annotate_schema2_parquet
This tool is used to annotate Parquet datasets partitioned according to GPF’s schema2 format. It expects a directory of the dataset as input.
The tool will always parallelize the annotation, unless explicitly disabled using the -j 1 argument.
Unlike the other tools, reannotation will be carried out automatically if a previous annotation is detected in the dataset.
Example usage of annotate_schema2_parquet
$ annotate_schema2_parquet.py input_parquet_dir annotation.yaml
Options for annotate_schema2_parquet
$ annotate_schema2_parquet --help
usage: annotate_schema2_parquet [-h] [--verbose] [--logfile LOGFILE]
[-r REGION] [-s REGION_SIZE] [-w WORK_DIR]
[--full-reannotation]
[-o OUTPUT | -e | -m | -n]
[--batch-size BATCH_SIZE] [-i INSTANCE]
[-g GRR_FILENAME]
[--grr-directory GRR_DIRECTORY]
[-R REFERENCE_GENOME_RESOURCE_ID]
[-G GENE_MODELS_RESOURCE_ID] [-ar] [-j JOBS]
[--process-pool] [-N DASK_CLUSTER_NAME]
[-c DASK_CLUSTER_CONFIG_FILE]
[--task-log-dir TASK_LOG_DIR]
[-t TASK_IDS [TASK_IDS ...]] [--keep-going]
[--fork-tasks] [--force] [-d TASK_STATUS_DIR]
[input] [pipeline] [{run,list,status}]
Annotate Schema2 Parquet
positional arguments:
input the directory containing Parquet files (default: -)
pipeline The pipeline definition file. By default, or if the
value is gpf_instance, the annotation pipeline from
the configured gpf instance will be used. (default:
context)
options:
-h, --help show this help message and exit
--verbose, -v, -V
--logfile LOGFILE File to log output to. If not set, logs to console.
(default: None)
-r REGION, --region REGION
annotate only a specific region (default: None)
-s REGION_SIZE, --region-size REGION_SIZE
region size to parallelize by (default: 300000000)
-w WORK_DIR, --work-dir WORK_DIR
Directory to store intermediate output files in
(default: annotate_schema2_output)
--full-reannotation, --fr
Ignore any previous annotation and run a full
reannotation. (default: False)
-o OUTPUT, --output OUTPUT
Path of the directory to hold the output (default:
None)
-e, --in-place Produce output directly in given input dir. (default:
False)
-m, --meta Print the input Parquet's meta properties. (default:
False)
-n, --dry-run Print the annotation that will be done without
writing. (default: False)
--batch-size BATCH_SIZE
Annotate in batches of (default: 0)
-i INSTANCE, --instance INSTANCE, --gpf-instance INSTANCE
The path to the GPF instance configuration file.
(default: None)
-g GRR_FILENAME, --grr-filename GRR_FILENAME, --grr GRR_FILENAME
The GRR configuration file. If the argument is absent,
the a GRR repository from the current genomic context
(i.e. gpf_instance) will be used or, if that fails,
the default GRR configuration will be used. (default:
None)
--grr-directory GRR_DIRECTORY
Local GRR directory to use as repository. (default:
None)
-R REFERENCE_GENOME_RESOURCE_ID, --reference-genome-resource-id REFERENCE_GENOME_RESOURCE_ID, --ref REFERENCE_GENOME_RESOURCE_ID
The resource id for the reference genome. If the
argument is absent the reference genome from the
current genomic context will be used. (default: None)
-G GENE_MODELS_RESOURCE_ID, --gene-models-resource-id GENE_MODELS_RESOURCE_ID, --genes GENE_MODELS_RESOURCE_ID
The resource is of the gene models resource. If the
argument is absent the gene models from the current
genomic context will be used. (default: None)
-ar, --allow-repeated-attributes
Rename repeated attributes instead of raising an
error. (default: False)
Task Graph Executor:
-j JOBS, --jobs JOBS Number of jobs to run in parallel. Defaults to the
number of processors on the machine (default: None)
--process-pool, --pp Use a process pool executor with the specified number
of processes instead of a dask distributed executor.
(default: False)
-N DASK_CLUSTER_NAME, --dask-cluster-name DASK_CLUSTER_NAME, --dcn DASK_CLUSTER_NAME
The named of the named dask cluster (default: None)
-c DASK_CLUSTER_CONFIG_FILE, --dccf DASK_CLUSTER_CONFIG_FILE, --dask-cluster-config-file DASK_CLUSTER_CONFIG_FILE
dask cluster config file (default: None)
--task-log-dir TASK_LOG_DIR
Path to directory where to store tasks' logs (default:
None)
Execution Mode:
{run,list,status} Command to execute on the import configuration. run -
runs the import process list - lists the tasks to be
executed but doesn't run them status - synonym for
list (default: run)
-t TASK_IDS [TASK_IDS ...], --task-ids TASK_IDS [TASK_IDS ...]
--keep-going Whether or not to keep executing in case of an error
(default: False)
--fork-tasks, --fork-task, --fork
Whether to fork a new worker process for each task
(default: False)
--force, -f Ignore precomputed state and always rerun all tasks.
(default: False)
-d TASK_STATUS_DIR, --task-status-dir TASK_STATUS_DIR, --tsd TASK_STATUS_DIR
Directory to store the task progress. (default:
./.task-progress)