annotate_schema2_parquet

This tool is used to annotate Parquet datasets partitioned according to GPF’s schema2 format. It expects a directory of the dataset as input. The tool will always parallelize the annotation, unless explicitly disabled using the -j 1 argument.

Unlike the other tools, reannotation will be carried out automatically if a previous annotation is detected in the dataset.

Example usage of annotate_schema2_parquet

$ annotate_schema2_parquet.py input_parquet_dir annotation.yaml

Options for annotate_schema2_parquet

$ annotate_schema2_parquet --help
usage: annotate_schema2_parquet [-h] [--verbose] [--logfile LOGFILE]
                                [-r REGION] [-s REGION_SIZE] [-w WORK_DIR]
                                [--full-reannotation]
                                [-o OUTPUT | -e | -m | -n]
                                [--batch-size BATCH_SIZE] [-i INSTANCE]
                                [-g GRR_FILENAME]
                                [--grr-directory GRR_DIRECTORY]
                                [-R REFERENCE_GENOME_RESOURCE_ID]
                                [-G GENE_MODELS_RESOURCE_ID] [-ar] [-j JOBS]
                                [--process-pool] [-N DASK_CLUSTER_NAME]
                                [-c DASK_CLUSTER_CONFIG_FILE]
                                [--task-log-dir TASK_LOG_DIR]
                                [-t TASK_IDS [TASK_IDS ...]] [--keep-going]
                                [--fork-tasks] [--force] [-d TASK_STATUS_DIR]
                                [input] [pipeline] [{run,list,status}]

Annotate Schema2 Parquet

positional arguments:
  input                 the directory containing Parquet files (default: -)
  pipeline              The pipeline definition file. By default, or if the
                        value is gpf_instance, the annotation pipeline from
                        the configured gpf instance will be used. (default:
                        context)

options:
  -h, --help            show this help message and exit
  --verbose, -v, -V
  --logfile LOGFILE     File to log output to. If not set, logs to console.
                        (default: None)
  -r REGION, --region REGION
                        annotate only a specific region (default: None)
  -s REGION_SIZE, --region-size REGION_SIZE
                        region size to parallelize by (default: 300000000)
  -w WORK_DIR, --work-dir WORK_DIR
                        Directory to store intermediate output files in
                        (default: annotate_schema2_output)
  --full-reannotation, --fr
                        Ignore any previous annotation and run a full
                        reannotation. (default: False)
  -o OUTPUT, --output OUTPUT
                        Path of the directory to hold the output (default:
                        None)
  -e, --in-place        Produce output directly in given input dir. (default:
                        False)
  -m, --meta            Print the input Parquet's meta properties. (default:
                        False)
  -n, --dry-run         Print the annotation that will be done without
                        writing. (default: False)
  --batch-size BATCH_SIZE
                        Annotate in batches of (default: 0)
  -i INSTANCE, --instance INSTANCE, --gpf-instance INSTANCE
                        The path to the GPF instance configuration file.
                        (default: None)
  -g GRR_FILENAME, --grr-filename GRR_FILENAME, --grr GRR_FILENAME
                        The GRR configuration file. If the argument is absent,
                        the a GRR repository from the current genomic context
                        (i.e. gpf_instance) will be used or, if that fails,
                        the default GRR configuration will be used. (default:
                        None)
  --grr-directory GRR_DIRECTORY
                        Local GRR directory to use as repository. (default:
                        None)
  -R REFERENCE_GENOME_RESOURCE_ID, --reference-genome-resource-id REFERENCE_GENOME_RESOURCE_ID, --ref REFERENCE_GENOME_RESOURCE_ID
                        The resource id for the reference genome. If the
                        argument is absent the reference genome from the
                        current genomic context will be used. (default: None)
  -G GENE_MODELS_RESOURCE_ID, --gene-models-resource-id GENE_MODELS_RESOURCE_ID, --genes GENE_MODELS_RESOURCE_ID
                        The resource is of the gene models resource. If the
                        argument is absent the gene models from the current
                        genomic context will be used. (default: None)
  -ar, --allow-repeated-attributes
                        Rename repeated attributes instead of raising an
                        error. (default: False)

Task Graph Executor:
  -j JOBS, --jobs JOBS  Number of jobs to run in parallel. Defaults to the
                        number of processors on the machine (default: None)
  --process-pool, --pp  Use a process pool executor with the specified number
                        of processes instead of a dask distributed executor.
                        (default: False)
  -N DASK_CLUSTER_NAME, --dask-cluster-name DASK_CLUSTER_NAME, --dcn DASK_CLUSTER_NAME
                        The named of the named dask cluster (default: None)
  -c DASK_CLUSTER_CONFIG_FILE, --dccf DASK_CLUSTER_CONFIG_FILE, --dask-cluster-config-file DASK_CLUSTER_CONFIG_FILE
                        dask cluster config file (default: None)
  --task-log-dir TASK_LOG_DIR
                        Path to directory where to store tasks' logs (default:
                        None)

Execution Mode:
  {run,list,status}     Command to execute on the import configuration. run -
                        runs the import process list - lists the tasks to be
                        executed but doesn't run them status - synonym for
                        list (default: run)
  -t TASK_IDS [TASK_IDS ...], --task-ids TASK_IDS [TASK_IDS ...]
  --keep-going          Whether or not to keep executing in case of an error
                        (default: False)
  --fork-tasks, --fork-task, --fork
                        Whether to fork a new worker process for each task
                        (default: False)
  --force, -f           Ignore precomputed state and always rerun all tasks.
                        (default: False)
  -d TASK_STATUS_DIR, --task-status-dir TASK_STATUS_DIR, --tsd TASK_STATUS_DIR
                        Directory to store the task progress. (default:
                        ./.task-progress)