Getting started on CLI
Prerequisites
This guide assumes that you are working on a recent Linux or macOS X machine.
Warning
GAIn is not currently supported on Windows, but it can be run through Windows Subsystem for Linux (WSL) if you have WSL configured.
GAIn is distributed as a Conda package and can be installed with conda install. For faster installation, we recommend using the libmamba solver with Conda or using Mamba directly. If you do not already have Conda or Mamba installed, or if you are unfamiliar with these package managers, we recommend installing Mamba through the Miniforge distribution, available at: https://github.com/conda-forge/miniforge.
Installation
We assume that you have a working mamba installation. If you do not have mamba but have a working conda installation, replace mamba with conda in the commands below. If you have neither, install Mamba through Miniforge as described above.
Start by creating an empty Conda environment named gain_cli:
mamba create -n gain_cli
To use this environment, activate it using the following command:
mamba activate gain_cli
Then install the gain_core conda package:
mamba install -c conda-forge -c bioconda -c iossifovlab gain-core
This command installs GAIn and all of its dependencies. A simple test to confirm that GAIn is installed correctly is to run:
grr_browse --version
The result should look similar to this:
GAIn version: 2026.6.6
Note that the version number may be different depending on when you install GAIn, but the command should run without error and print a version number.
In the sections below, we first browse the resources available for annotation, then run a small test annotation and build a custom pipeline. From there, we move to larger and more flexible workflows, including resource caching, parallel execution, VCF input, position and region annotation, public and local GRRs, and reannotation. The examples start small, but they are designed to scale. For larger datasets, caching resources locally and using indexed inputs for parallel annotation are especially important for making GAIn practical and efficient.
Browse available resources
GAIn is installed with access to the default IossifovLab GRR. You can confirm which GRRs are available to you and browse the resources hosted on them by running:
grr_browse
This shows that you have access to the IossifovLab’s main GRR and lists all the resources available from that server.
No GRR definition found, using the DEFAULT_DEFINITION
id: main-GRR
type: http
url: https://grr.iossifovlab.com
gene_score 0 139 11.12 MB main-GRR gene_properties/gene_scores/GTEx_V11_RNAexpression
gene_score 0 9 11.84 MB main-GRR gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015
gene_score 0 19 2.27 MB main-GRR gene_properties/gene_scores/LGD
gene_score 0 9 13.2 MB main-GRR gene_properties/gene_scores/LOEUF
gene_score 0 10 1.48 MB main-GRR gene_properties/gene_scores/RVIS
gene_score 0 9 202.88 KB main-GRR gene_properties/gene_scores/SFARI_gene_score_2024_Q1
...
This output contains several pieces of information. The first line shows that GAIn is using the default GRR definition, which points to the Iossifov lab’s main GRR at https://grr.iossifovlab.com. The next three lines show the default configuration. This section is useful for confirming that GAIn is connected to the expected GRR server. The following lines list the resources available on that server, including their type, size, and resource ID. For example, gene_properties/gene_scores/GTEx_V11_RNAexpression is the resource ID for the GTEx V11 RNA expression gene score resource. Resource IDs are used to refer to resources in annotation pipelines.
Quick annotation test
After installation, GAIn can immediately run a small annotation test using resources from the Iossifov Lab’s main GRR. This is a useful way to confirm that the command-line tools are working and can access the public resources.
In this example, we annotate a small comma-separated text file containing three variants. The test uses resources directly from the remote public GRR, so it is convenient for checking the setup but not intended for large annotation jobs.
Download the example input CSV file (small_input.csv), whose content is shown below. The file contains three variant annotatables, each described by the columns chrom, pos, ref, and alt, which specify the chromosome, genomic position, reference allele, and alternate allele:
chrom |
pos |
ref |
alt |
|---|---|---|---|
chr14 |
21415880 |
G |
A |
chr17 |
7674904 |
TCT |
T |
chr7 |
117587806 |
G |
A |
To annotate the file, run:
annotate_tabular small_input.csv pipeline/hg38_clinical_annotation
This command annotates small_input.csv using the predefined pipeline resource with id pipeline/hg38_clinical_annotation included in the main GRR.
GAIn writes the annotated output to a new file whose name is derived from the input file. For example, the command above produces (small_input.annotated.csv), with the following content:
chrom |
pos |
ref |
alt |
worst_effect_MANE_1_5 |
effect_details_MANE_1_5 |
gene_effects_MANE_1_5 |
dbSNP_rs_number |
gnomad_v4_exome_ALL_af |
gnomad_v4_genome_ALL_af |
clinical_significance |
clinical_disease_name |
phyloP7way |
AlphaMissense_pathogenicity |
AlphaMissense_class |
MPC_score |
worst_effect_GENCODE_49 |
effect_details_GENCODE_49 |
gene_effects_GENCODE_49 |
pLI_rank_all |
pLI_rank_min |
LOEUF_rank_all |
LOEUF_rank_min |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chr14 |
21415880 |
G |
A |
nonsense |
ENST00000646647.2:CHD8:nonsense:582/2581(Arg->End) |
CHD8:nonsense |
863224857 |
Pathogenic/Likely_pathogenic |
not_provided|Intellectual_developmental_disorder_with_autism_and_macrocephaly |
0.917 |
nonsense |
ENST00000645929.1:CHD8:nonsense:303/2302(Arg->End)|ENST00000646647.2:CHD8:nonsense:582/2581(Arg->End)|ENST00000864429.1:CHD8:nonsense:582/2581(Arg->End)|ENST00000643469.1:CHD8:nonsense:582/2581(Arg->End)|ENST00000934461.1:CHD8:nonsense:582/2581(Arg->End)|ENST00000557364.6:CHD8:nonsense:582/2581(Arg->End)|ENST00000934460.1:CHD8:nonsense:582/2581(Arg->End)|ENST00000430710.8:CHD8:nonsense:303/2302(Arg->End) |
CHD8:nonsense |
CHD8:45 |
45 |
CHD8:112.5 |
112.5 |
|||||
chr17 |
7674904 |
TCT |
T |
frame-shift |
ENST00000269305.9:TP53:frame-shift:209/393 |
TP53:frame-shift |
1057517840 |
6.84e-07 |
Pathogenic |
Hereditary_cancer-predisposing_syndrome|TP53-related_disorder|not_provided|Li-Fraumeni_syndrome_1|Ovarian_neoplasm|Li-Fraumeni_syndrome |
-0.12 |
frame-shift |
ENST00000622645.4:TP53:frame-shift:170/302|ENST00000420246.6:TP53:frame-shift:209/341|ENST00000610538.4:TP53:frame-shift:170/307|ENST00000455263.6:TP53:frame-shift:209/346|ENST00000949117.1:TP53:frame-shift:209/394|ENST00000923567.1:TP53:frame-shift:209/393|ENST00000905353.1:TP53:frame-shift:209/393|ENST00000620739.4:TP53:frame-shift:170/354|ENST00000714357.1:TP53:frame-shift:209/393|ENST00000923566.1:TP53:frame-shift:209/393|ENST00000510385.5:TP53:frame-shift:77/209|ENST00000610623.4:TP53:frame-shift:50/187|ENST00000504290.5:TP53:frame-shift:77/214|ENST00000618944.4:TP53:frame-shift:50/182|ENST00000504937.5:TP53:frame-shift:77/261|ENST00000619186.4:TP53:frame-shift:50/234|ENST00000445888.6:TP53:frame-shift:209/393|ENST00000604348.6:TP53:frame-shift:202/386|ENST00000619485.4:TP53:frame-shift:170/354|ENST00000269305.9:TP53:frame-shift:209/393|ENST00000714408.1:TP53:frame-shift:209/411|ENST00000714409.1:TP53:frame-shift:209/367|ENST00000413465.6:TP53:frame-shift:209/285|ENST00000576024.2:TP53:frame-shift:209/344|ENST00000923568.1:TP53:frame-shift:209/393|ENST00000714359.1:TP53:frame-shift:209/393|ENST00000714356.1:TP53:frame-shift:170/347|ENST00000923569.1:TP53:frame-shift:209/393|ENST00000359597.8:TP53:frame-shift:209/343|ENST00000610292.4:TP53:frame-shift:170/354 |
TP53:frame-shift |
TP53:3122 |
3122 |
TP53:4446.5 |
4446.5 |
||||
chr7 |
117587806 |
G |
A |
missense |
ENST00000003084.11:CFTR:missense:551/1480(Gly->Asp) |
CFTR:missense |
75527207 |
0.000404 |
0.000276 |
Pathogenic |
CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|not_provided|Hereditary_pancreatitis|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|ivacaftor_response_-_Efficacy |
0.917 |
0.99 |
likely_pathogenic |
0.015 |
missense |
ENST00000889206.1:CFTR:missense:551/1451(Gly->Asp)|ENST00000950799.1:CFTR:intron:11/22[15019]|ENST00000003084.11:CFTR:missense:551/1480(Gly->Asp)|ENST00000889208.1:CFTR:missense:551/1437(Gly->Asp)|ENST00000699605.1:CFTR:missense:409/1338(Gly->Asp)|ENST00000889209.1:CFTR:missense:551/1450(Gly->Asp)|ENST00000649781.2:CFTR:missense:490/1419(Gly->Asp)|ENST00000649406.1:CFTR:missense:490/1187(Gly->Asp)|ENST00000648260.1:CFTR:intron:10/16[15019]|ENST00000699602.1:CFTR:missense:551/1478(Gly->Asp)|ENST00000889207.1:CFTR:missense:490/1354(Gly->Asp)|ENST00000889210.1:CFTR:missense:490/1376(Gly->Asp) |
CFTR:missense|CFTR:intron |
CFTR:18190 |
18190 |
CFTR:13993.5 |
13993.5 |
The output contains the original variant columns followed by the annotation attributes produced by pipeline/hg38_clinical_annotation. See the pipeline summary page in the main GRR for a description of the attributes produced by this pipeline.
Custom annotation pipelines
In the quick annotation test, we used a predefined pipeline from the default GRR. GAIn also allows users to define their own annotation pipelines as YAML files. A custom pipeline is useful when you want to select genomic resources from one or more GRRs that fit a specific project or research question.
In this example, we will annotate the same three variants from small_input.csv, but this time using a custom pipeline stored locally as custom_pipeline.yaml.
Download the example custom annotation pipeline file (custom_pipeline.yaml), whose content is shown below.
preamble:
summary: Simple custom pipeline
input_reference_genome: hg38/genomes/GRCh38-hg38
annotators:
- effect_annotator:
gene_models: hg38/gene_models/MANE/1.5
attributes:
- worst_effect
- genes
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
- normalize_allele_annotator
- allele_score_annotator:
resource_id: hg38/scores/ClinVar_20251019
input_annotatable: normalized_allele
attributes:
- CLNSIG
- CLNDN
This pipeline has an optional preamble section, which records metadata about the pipeline and specifies that the input variants use the hg38/genomes/GRCh38-hg38 reference genome. The annotators section lists the annotation steps that GAIn will run from top to bottom. This pipeline first uses the MANE 1.5 gene model to identify affected genes and predict the worst effect of each variant. It then adds a conservation score from phyloP7way. Finally, it normalizes each allele and looks up selected ClinVar attributes: CLNSIG, which describes clinical significance, and CLNDN, which reports associated disease names.
Note
When building custom annotation pipelines, users can either write the pipeline directly using GAIn’s YAML structure or use the pipeline authoring tool in the GAIn web interface, which simplifies pipeline creation by guiding users through annotator and resource selection.
To review the attributes produced by the custom pipeline, run the following command.
annotate_doc custom_pipeline.yaml > doc.html
You can open the generated HTML summary (doc.html) in your local folder. To annotate the input file with this custom pipeline, run:
annotate_tabular small_input.csv custom_pipeline.yaml -o small_input_custom.annotated.csv
This command applies the local custom_pipeline.yaml file to the variants in small_input.csv. To avoid overwriting the output from the previous section, we write the result to (small_input_custom.annotated.csv), whose content is shown below.
chrom |
pos |
ref |
alt |
worst_effect |
genes |
phyloP7way |
CLNSIG |
CLNDN |
|---|---|---|---|---|---|---|---|---|
chr14 |
21415880 |
G |
A |
nonsense |
CHD8 |
0.917 |
Pathogenic/Likely_pathogenic |
not_provided|Intellectual_developmental_disorder_with_autism_and_macrocephaly |
chr17 |
7674904 |
TCT |
T |
frame-shift |
TP53 |
-0.12 |
Pathogenic |
Hereditary_cancer-predisposing_syndrome|TP53-related_disorder|not_provided|Li-Fraumeni_syndrome_1|Ovarian_neoplasm|Li-Fraumeni_syndrome |
chr7 |
117587806 |
G |
A |
missense |
CFTR |
0.917 |
Pathogenic |
CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|not_provided|Hereditary_pancreatitis|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|ivacaftor_response_-_Efficacy |
This approach is convenient for small tests and for developing custom pipelines. However, when annotation uses resources directly from the public GRR, it is practical only for small inputs. For larger inputs, input files should be sorted by genomic coordinates for more efficient processing. Users can also configure local resource caching and parallel execution, as described in the next sections.
Caching resources
By default, GAIn can access genomic resources directly from a remote GRR. This works well for small examples, but large annotation jobs may require repeated access to many large resources over the network. To make these jobs faster and more reliable, GAIn supports local resource caching.
When caching is enabled, GAIn downloads a required resource into a local cache directory the first time the resource is used. After that, GAIn uses the local copy for annotation and reuses it in future jobs without downloading it again.
So far, GAIn has been using the default GRR definition, which corresponds to the configuration shown by the first lines of grr_browse. To enable caching, create a GRR definition file in your home directory named ~/.grr_definition.yaml, with the same default GRR configuration plus a cache_dir entry. For example:
id: "main-GRR"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "<path_to_cache>/remote_grr_cache"
After this configuration, GAIn downloads each required resource to the specified cache directory before using it for annotation. Because genomic resources can be large, the cache directory should have sufficient disk space and write permission for the user. If <path_to_cache> does not have enough available space, use another cache directory with sufficient storage. The approximate space requirements for the resources used in this guide are described below.
This is especially important for large annotation pipelines. For example, a comprehensive clinical pipeline such as pipeline/hg38_clinical_annotation may require many large resources. These resources total approximately 36 GB and may take substantial time to download, depending on network speed and storage performance. This took approximately 16 minutes in our test with a Mac laptop. Once cached, however, they can be reused directly from the local cache, making future annotation jobs much faster.
GAIn can automatically download required resources during annotation. For large pipelines, however, it is often better to pre-download them before starting the annotation job. GAIn provides a dedicated tool for this purpose:
grr_cache_repo pipeline/hg38_clinical_annotation
This command downloads the resources required by the pipeline in one step, so that the actual annotation job does not need to pause while resources are being retrieved.
Custom pipelines can also reduce the amount of data that must be cached. A broad clinical pipeline may require more than 35 GB of resources, whereas a focused custom pipeline may require only the resources needed for a specific analysis. For example, the custom pipeline shown above requires approximately 8 GB of resources. Custom pipelines therefore help control annotation content while reducing storage requirements and setup time. You can cache the resources for the custom pipeline used above with the following command. Since this custom pipeline is a subset of the hg38 clinical pipeline, this command will not download any additional resources if the clinical pipeline has already been cached.
grr_cache_repo custom_pipeline.yaml
After the necessary resources have been cached, users can run large annotation jobs without waiting for GAIn to download each resource during the annotation process. To test this workflow, download the example input file (50k_variants.tsv.gz), which contains 50,000 variants randomly selected from approximately 1.4 million variants observed by whole-exome sequencing in the SSC project.
Depending on which pipeline you cached above, you can now run the annotation normally:
annotate_tabular 50k_variants.tsv.gz pipeline/hg38_clinical_annotation
or
annotate_tabular 50k_variants.tsv.gz custom_pipeline.yaml
Without caching, annotating a file of this size through remote resource access can take a very long time. With the required resources already cached, GAIn uses the local copies for annotation, making the same large-scale job much faster and less dependent on network performance. For example, in our test on a recent Mac laptop using cached resources, annotating 50,000 variants with pipeline/hg38_clinical_annotation took approximately 5 minutes. The input file used in this test was pre-sorted by chromosome and position, which allows GAIn to access genomic resources more efficiently. Unsorted input files can be annotated, but they will run significantly more slowly.
Parallelizing large annotation jobs
Annotation can be computationally intensive, especially for large input files or pipelines with many steps. Because GAIn annotates each annotatable independently, these jobs can be accelerated by splitting the input into genomic regions and processing those regions in parallel across multiple CPU cores or cluster workers. Users could do this manually by splitting an input file into chunks, annotating each chunk separately, and merging the results. To avoid this extra workflow management, GAIn provides built-in parallelization support for indexed input files.
To use GAIn’s parallelization features, the input file must be sorted by genomic coordinates and indexed with tabix, a widely used genomic indexing tool. This requirement applies to both input formats supported by GAIn: tabular files and VCF files. VCF files can be sorted and indexed with bcftools, while tabular files can be sorted, compressed with bgzip, and indexed with tabix. See the “Preparing annotation input files for parallelization”[] section for details and examples.
When GAIn detects an indexed input file, it splits the annotation job into smaller tasks and executes them in parallel using a computation cluster. By default, GAIn uses local cluster that uses the available CPU cores on the host where the annotation command is run. For larger jobs, users can control both how the input is split and how many workers are used.
The degree of parallelization can be controlled with the -j option, which specifies the number of workers. The optimal value depends on the input size, pipeline complexity, available CPU cores, memory, and storage performance.
For example, download the example input file (SSC_WES_variants_select.tsv.gz), which contains all 1,413,298 variants on canonical chromosomes detected by WES in the SSC project. You can annotate this large variant collection with the pipeline/hg38_clinical_annotation pipeline by running the following command. However, even with cached resources, this annotation took approximately 17 minutes in our test:
annotate_tabular SSC_WES_variants_select.tsv.gz pipeline/hg38_clinical_annotation
To take advantage of parallel computation, first prepare the input file for indexed genomic access:
prepare_tabular SSC_WES_variants_select.tsv.gz
When run successfully, this command produces two files: SSC_WES_variants_select.sorted.tsv.bgz, which contains the sorted and compressed version of the input file, and SSC_WES_variants_select.sorted.tsv.bgz.tbi, its associated tabix index. These two files enable parallelization and fast genomic-region access in GAIn.
The following command uses parallelization, and with the required resources already cached, annotating the sorted file with pipeline/hg38_clinical_annotation took approximately 1 minute and 15 seconds in our test.
annotate_tabular SSC_WES_variants_select.sorted.tsv.bgz pipeline/hg38_clinical_annotation
By default, GAIn splits indexed inputs by chromosome. For human genomes, this creates up to 24 chromosome-level tasks, which is already enough to use all available cores on our local test machine with 10 CPU cores. Therefore, splitting the input further with the -r option provides only a modest additional benefit on this computer. However, on larger compute systems or clusters with many more cores, chromosome-level splitting may not create enough tasks to fully use the available parallelism. In those cases, the -r option can split the input into smaller genomic regions and improve scaling. In our test, using the -r option reduced the annotation time to approximately 1 minute.
annotate_tabular SSC_WES_variants_select.sorted.tsv.bgz pipeline/hg38_clinical_annotation -r 30_000_000
GAIn can also use a configured cluster that creates workers on a larger compute system, such as SGE or SLURM. For example, if a cluster named my_sge_cluster has been configured to create workers on an SGE cluster, the annotation can be run with:
annotate_tabular SSC_WES_variants_select.sorted.tsv.bgz pipeline/hg38_clinical_annotation -r 30_000_000 -N my_sge_cluster -j 100
This runs the annotation across up to 100 workers on the configured cluster. See the “Configuring parallelization”[] and “Configuring Dask clusters”[] sections for more details on region splitting, worker configuration, and cluster setup.
Annotating VCF input
GAIn can also annotate variants stored in VCF files. The command is similar to annotate_tabular, but the input and
output files are in VCF format. To annotate an example VCF file, download the example input file (small_input.vcf), whose content is shown below.
##fileformat=VCFv4.1
##reference=GRCh38-hg38
##contig=<ID=chr7>
##contig=<ID=chr14>
##contig=<ID=chr17>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr14 21415880 . G A . . .
chr17 7674904 . TCT T . . .
chr7 117587806 . G A . . .
To annotate this file, run:
annotate_vcf small_input.vcf custom_pipeline.yaml -o vcf.annotated.vcf
This command produces an output file named vcf.annotated.vcf, which contains the same variants with
additional annotation fields in the INFO column.
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=GRCh38-hg38
##contig=<ID=chr7>
##contig=<ID=chr14>
##contig=<ID=chr17>
##pipeline_annotation_tool=GPF variant annotation.
##INFO=<ID=worst_effect,Number=A,Type=String,Description="Worst effect across all transcripts.">
##INFO=<ID=genes,Number=A,Type=String,Description="Comma separated list of all affected genes.">
##INFO=<ID=phyloP7way,Number=A,Type=String,Description="The score is a number that reflects the conservation at a position.">
##INFO=<ID=CLNSIG,Number=A,Type=String,Description="Aggregate germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNDN,Number=A,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr14 21415880 . G A . . worst_effect=nonsense;genes=CHD8;phyloP7way=0.917;CLNSIG=Pathogenic/Likely_pathogenic;CLNDN=not_provided|Intellectual_developmental_disorder_with_autism_and_macrocephaly
chr17 7674904 . TCT T . . worst_effect=frame-shift;genes=TP53;phyloP7way=-0.12;CLNSIG=Pathogenic;CLNDN=Hereditary_cancer-predisposing_syndrome|TP53-related_disorder|not_provided|Li-Fraumeni_syndrome_1|Ovarian_neoplasm|Li-Fraumeni_syndrome
chr7 117587806 . G A . . worst_effect=missense;genes=CFTR;phyloP7way=0.917;CLNSIG=Pathogenic;CLNDN=CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|not_provided|Hereditary_pancreatitis|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|ivacaftor_response_-_Efficacy
VCF files can also be prepared for parallel annotation. To do this, first install bcftools and then sort, compress, and index the VCF file:
mamba install -c conda-forge -c bioconda bcftools
bcftools sort small_input.vcf -o small_input.sorted.vcf.bgz -Oz -Wtbi
This creates a sorted, bgzip-compressed VCF file, small_input.sorted.vcf.bgz, together with its tabix index, small_input.sorted.vcf.bgz.tbi. GAIn can use this indexed VCF file for parallel annotation in the same way as indexed tabular inputs.
Annotating positions and regions
GAIn is well suited for annotating genetic variants obtained from sequencing data,
but not all genomic experiments produce variant calls. Some assays instead identify genomic
positions or regions of interest, such as transcription start sites mapped by CAGE-seq or
regulatory intervals detected by ATAC-seq and ChIP-seq. For researchers working with these data types,
it is often valuable to interpret them using the same kinds of genomic resources used in variant annotation.
Although positions and regions do not contain allele information, and therefore cannot support every type of
variant-based annotation, GAIn can still take these inputs and annotate them with many relevant resources using
the annotate_tabular tool, aggregating scores when needed.
Position inputs require only two columns: chromosome and position. Download positions.tsv, whose content is shown below:
chrom |
pos |
|---|---|
chr7 |
117587806 |
chr7 |
115587806 |
Because position inputs do not include reference and alternate alleles, GAIn cannot infer the effect of a specific allelic change on a gene product. However, GAIn provides a dedicated simple_effect_annotator that can infer the broad genomic context of a position, such as whether it is intergenic, genic, or coding. GAIn can also use other resource types with position inputs and, when needed, aggregate their values to produce a position-level annotation. For example, position score resources map directly to genomic positions and can be applied without modification, while allele score resources can be used by aggregating across the possible allelic changes at that site. In the example below, we use a single pipeline that combines these annotation types. Download annotation pipeline position_pipeline.yaml, whose content is shown below:
- simple_effect_annotator:
gene_models: hg38/gene_models/MANE/1.5
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
attributes:
- name: phyloP7way_max
source: phyloP7way
aggregator: max
- name: phyloP7way_mean
source: phyloP7way
aggregator: mean
- name: phyloP7way_list
source: phyloP7way
aggregator: list
- allele_score_annotator:
resource_id: hg38/variant_frequencies/gnomAD_4.1.0/exomes/ALL
attributes:
- name: af_max
source: AF
aggregator: max
- name: af_mean
source: AF
aggregator: mean
- name: af_alleles
source: allele
include_attributes:
- AF
This pipeline combines three annotators. The simple_effect_annotator uses the MANE 1.5 gene models resource to classify each position by genomic context, such as coding or intergenic, and to report overlapping genes when applicable.
The position_score_annotator adds phyloP7way conservation scores and requests three aggregations: max, mean, and list. The max and mean aggregators report the maximum and average score, while list reports the individual scores before aggregation. For single-position inputs, these aggregators have no effect because there is only one genomic position, but they become useful for region inputs, where scores must be summarized across many positions. The allele_score_annotator uses AlphaMissense to summarize possible allelic changes at each position. Here, max and mean report the maximum and mean am_pathogenicity values across possible alleles, while the allele source reports the observed alleles together with their am_pathogenicity values. Run the following command to annotate the positions:
annotate_tabular positions.tsv position_pipeline.yaml
This produces positions.annotated.tsv which contains:
chrom |
pos |
worst_effect |
worst_effect_genes |
phyloP7way_max |
phyloP7way_mean |
phyloP7way_all |
am_pathogenicity_max |
am_pathogenicity_mean |
allele |
|---|---|---|---|---|---|---|---|---|---|
chr7 |
117587806 |
coding |
CFTR |
0.917 |
0.917 |
[0.9169999957084656] |
0.99 |
0.953 |
[‘chr7:117587806:G:C:0.909’, ‘chr7:117587806:G:A:0.99’, ‘chr7:117587806:G:T:0.959’] |
chr7 |
115587806 |
intergenic |
0.158 |
0.158 |
[0.15800000727176666] |
This output shows that the first position falls within a coding part of CFTR, whereas the second position is intergenic. The coding position has a higher phyloP7way conservation score and receives an aggregate am_pathogenicity score, while no am_pathogenicity value is reported for the intergenic position.
Region inputs require three columns: chromosome, beginning position, and end position.
Download the example file regions.tsv, whose content is shown below:
chrom |
pos_beg |
pos_end |
|---|---|---|
chr1 |
11796260 |
11796280 |
chr14 |
21397800 |
21397840 |
As with position inputs, region inputs do not include reference and alternate alleles, so GAIn cannot infer the effect of a specific allelic change on a gene product. However, many of the same genomic resource types can still be applied to region inputs. Region inputs can also be evaluated with simple_effect_annotator, which summarizes whether a region overlaps genic or intergenic sequence and reports broad functional categories when applicable. Position score resources can be used on region inputs by aggregating values across the positions spanned by each interval. Allele score resources can also be used, but in that case GAIn must aggregate both across the positions in the region and across the possible allelic changes at each position.
To illustrate this, reuse position_pipeline.yaml, shown below as a reminder.
- simple_effect_annotator:
gene_models: hg38/gene_models/MANE/1.5
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
attributes:
- name: phyloP7way_max
source: phyloP7way
aggregator: max
- name: phyloP7way_mean
source: phyloP7way
aggregator: mean
- name: phyloP7way_list
source: phyloP7way
aggregator: list
- allele_score_annotator:
resource_id: hg38/variant_frequencies/gnomAD_4.1.0/exomes/ALL
attributes:
- name: af_max
source: AF
aggregator: max
- name: af_mean
source: AF
aggregator: mean
- name: af_alleles
source: allele
include_attributes:
- AF
Then run the following command to annotate the regions:
annotate_tabular regions.tsv position_pipeline.yaml
This produces regions.annotated.tsv shown below.
chrom |
pos_beg |
pos_end |
worst_effect |
worst_effect_genes |
phyloP7way_max |
phyloP7way_mean |
phyloP7way_list |
af_max |
af_mean |
af_alleles |
|---|---|---|---|---|---|---|---|---|---|---|
chr1 |
11796260 |
11796280 |
coding |
MTHFR |
1.06 |
0.493 |
[-1.2419999837875366, 0.9169999957084656, 0.8709999918937683, 0.032999999821186066, 0.9909999966621399, 0.9909999966621399, -0.6769999861717224, 1.062000036239624, 0.8709999918937683, 0.07500000298023224, 1.062000036239624, 1.062000036239624, 0.0689999982714653, -0.2750000059604645, -0.6830000281333923, 0.9169999957084656, 1.062000036239624, 1.062000036239624, 0.0689999982714653, 1.062000036239624, 1.062000036239624] |
0.000311 |
2.46e-05 |
[‘chr1:11796268:C:T:3.42e-06’, ‘chr1:11796278:G:C:6.84e-07’, ‘chr1:11796271:A:G:8.89e-06’, ‘chr1:11796262:C:T:1.37e-06’, ‘chr1:11796280:ATG:A:6.84e-07’, ‘chr1:11796275:G:C:1.37e-06’, ‘chr1:11796262:C:A:1.64e-05’, ‘chr1:11796260:T:C:5.81e-05’, ‘chr1:11796266:C:T:6.84e-07’, ‘chr1:11796275:G:A:6.84e-07’, ‘chr1:11796273:C:T:5.47e-06’, ‘chr1:11796276:A:G:1.37e-06’, ‘chr1:11796272:G:C:3.42e-06’, ‘chr1:11796269:A:G:4.1e-06’, ‘chr1:11796278:G:A:0.000311’, ‘chr1:11796264:T:C:1.37e-06’, ‘chr1:11796261:G:A:6.84e-07’, ‘chr1:11796273:CG:C:6.16e-06’, ‘chr1:11796274:G:A:4.24e-05’, ‘chr1:11796278:G:T:2.26e-05’] |
chr14 |
21397800 |
21397840 |
coding |
CHD8 |
1.06 |
0.334 |
[-0.0560000017285347, 0.07900000363588333, -0.35100001096725464, -0.3100000023841858, 0.09000000357627869, 0.09000000357627869, -1.3910000324249268, 0.041999999433755875, 0.07500000298023224, -0.4490000009536743, -0.9570000171661377, -0.21699999272823334, -0.6100000143051147, -0.4099999964237213, 0.041999999433755875, 1.062000036239624, 0.9909999966621399, -0.029999999329447746, 0.8709999918937683, 0.9909999966621399, 0.9909999966621399, 1.062000036239624, 0.8709999918937683, 0.8709999918937683, 0.8709999918937683, 0.9909999966621399, 0.9909999966621399, 0.8709999918937683, 0.04600000008940697, 1.062000036239624, 0.8709999918937683, -0.515999972820282, 1.062000036239624, 0.017999999225139618, 0.0689999982714653, 0.9909999966621399, 0.8709999918937683, 0.09000000357627869, 0.9169999957084656, 1.062000036239624, 0.0689999982714653] |
5e-05 |
8.14e-06 |
[‘chr14:21397809:A:T:6.85e-07’, ‘chr14:21397833:T:C:1.71e-05’, ‘chr14:21397800:A:G:6.86e-07’, ‘chr14:21397811:A:G:1.37e-06’, ‘chr14:21397824:C:A:6.84e-07’, ‘chr14:21397810:T:C:3.42e-06’, ‘chr14:21397804:A:T:5e-05’, ‘chr14:21397809:A:G:4.11e-06’, ‘chr14:21397823:C:T:6.84e-07’, ‘chr14:21397814:A:C:4.18e-05’, ‘chr14:21397806:C:G:6.85e-07’, ‘chr14:21397826:T:G:3.42e-06’, ‘chr14:21397830:C:T:1.37e-06’, ‘chr14:21397813:T:C:6.85e-07’, ‘chr14:21397824:C:G:6.84e-07’, ‘chr14:21397829:A:C:6.84e-07’, ‘chr14:21397808:A:C:1.3e-05’, ‘chr14:21397803:GAACA:G:5.49e-06’] |
This output shows how the same pipeline summarizes annotations over genomic intervals. The simple_effect_annotator reports the broad genomic context of each region and any overlapping genes. For phyloP7way, the max and mean columns summarize conservation scores across the positions spanned by each region, while the list column reports the individual position-level values. For AlphaMissense, GAIn aggregates across both the positions in the region and the possible allelic changes at those positions, producing summary am_pathogenicity values and listing the contributing alleles when available.
Adding public GRRs
So far, the annotation examples have used resources from the main IossifovLab GRR. We also provide another public repository, GRR-ENCODE, which contains ENCODE-derived functional genomics tracks that can be used in annotation pipelines. GRR-ENCODE contains approximately 8,000 resources, including ATAC-seq, DNase-seq, histone ChIP-seq, and transcription factor ChIP-seq tracks.
To use these resources, add GRR-ENCODE to the GRR definition file, ~/.grr_definition.yaml. The configuration below connects GAIn to both the main GRR and GRR-ENCODE:
id: "remote_GRRs"
type: group
cache_dir: "<path_to_cache>/remote_grr_cache"
children:
- id: "main-GRR"
type: "url"
url: "https://grr.iossifovlab.com"
- id: "GRR-ENCODE"
type: "url"
url: "https://grr-encode.iossifovlab.com"
With this configuration, GAIn can use resources from both repositories. For example, after adding GRR-ENCODE to the GRR definition file, a pipeline can use an ENCODE ATAC-seq resource as a position score annotator:
- position_score_annotator:
resource_id: ATAC-seq/ENCSR814RGG
This makes ENCODE-derived regulatory tracks available through the same pipeline syntax used for other position score resources.
Adding local GRRs
If an annotation workflow requires a resource that is not available in the public GRRs, you can add it through a local GRR. The resource may be a score generated in your own study or a previously published dataset. Here, we demonstrate this by adding the Collins rCNV 2022 dosage sensitivity scores, including pHaplo and pTriplo, which estimate gene-level sensitivity to deletion and duplication, respectively. This section walks through one simple example: adding a gene score resource to a local GRR and using it together with public GRRs. The Getting started with GRR and Genomic resources and repositories sections of the GAIn documentation provide more detailed instructions for adding other resource types, configuring resources, and managing local GRRs.
First, create a local GRR directory and a resource directory for the Collins rCNV 2022 dosage sensitivity scores:
mkdir -p local_GRR/collins_dosage_sensitivity
cd local_GRR/collins_dosage_sensitivity
Download the Collins rCNV dosage sensitivity score table:
curl -L -O https://zenodo.org/record/6347673/files/Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz
Inspect the first few lines, shown below, to see the available columns before writing the genomic_resource.yaml configuration file:
gzip -dc Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz | head -n 5
#gene |
pHaplo |
pTriplo |
|---|---|---|
CACNA1C |
0.99898184581082 |
1 |
ZNF462 |
1 |
0.987995995573708 |
CHD8 |
0.991649600531021 |
0.999999986508108 |
GRIN2B |
0.996808517025246 |
0.999999958700358 |
The file contains a gene column with the header #gene and two gene scores, pHaplo and pTriplo, for each gene. To make this file available as a GAIn resource, create a genomic_resource.yaml file in the same directory:
local_GRR/
└── collins_dosage_sensitivity/
├── Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz
└── genomic_resource.yaml
The genomic_resource.yaml file describes the resource to GAIn:
type: gene_score
filename: Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz
gene_column: "#gene"
scores:
- id: pHaplo
desc: "Haplosensitivity probability"
histogram:
type: number
- id: pTriplo
desc: "Triplosensitivity probability"
histogram:
type: number
Then add the local GRR to the GRR definition file, ~/.grr_definition.yaml. For annotation, the local GRR only needs to be a directory that contains resource subdirectories with valid genomic_resource.yaml files. For example, the configuration below connects GAIn to the main GRR, GRR-ENCODE, and the new local GRR:
type: group
id: "my_GRRs"
children:
- type: group
id: "remote_GRRs"
cache_dir: "<path_to_cache>/remote_grr_cache"
children:
- id: "main-GRR"
type: "url"
url: "https://grr.iossifovlab.com"
- id: "GRR-ENCODE"
type: "url"
url: "https://grr-encode.iossifovlab.com"
- id: "local_GRR"
type: "directory"
directory: "<path_to_local_GRR>/local_GRR"
With this configuration, GAIn can use the local resource in annotation pipelines, as well as the public resources in the main GRR and GRR-ENCODE. For example, the following custom pipeline combines resources from all three repositories: gene models resource from the main GRR, an ATAC-seq track from GRR-ENCODE, and the collins_dosage_sensitivity score from local_GRR.
Download the example pipeline, (multiple_grr_pipeline.yaml), which uses multiple GRRs:
preamble:
summary: Custom pipeline using public and local resources
input_reference_genome: hg38/genomes/GRCh38-hg38
annotators:
- effect_annotator:
gene_models: hg38/gene_models/MANE/1.5
attributes:
- worst_effect
- gene_list
- normalize_allele_annotator
- allele_score_annotator:
resource_id: hg38/scores/ClinVar_20251019
input_annotatable: normalized_allele
attributes:
- CLNSIG
- CLNDN
- position_score_annotator: ATAC-seq/ENCSR814RGG
- gene_score_annotator:
resource_id: my_score
input_gene_list: gene_list
attributes:
- name: pHaplo
source: pHaplo
To annotate the original example input with this pipeline, run:
annotate_tabular small_input.csv multiple_grr_pipeline.yaml -o small_input_multiple_grr.annotated.csv
The output contains the effect annotations, the ENCODE-derived position score, and the pHaplo score from the local GRR:
chrom |
pos |
ref |
alt |
worst_effect |
CLNSIG |
CLNDN |
ATAC-seq_ENCSR814RGG |
pHaplo |
|---|---|---|---|---|---|---|---|---|
chr14 |
21415880 |
G |
A |
nonsense |
Pathogenic/Likely_pathogenic |
not_provided|Intellectual_developmental_disorder_with_autism_and_macrocephaly |
CHD8:0.992 |
|
chr17 |
7674904 |
TCT |
T |
frame-shift |
Pathogenic |
Hereditary_cancer-predisposing_syndrome|TP53-related_disorder|not_provided|Li-Fraumeni_syndrome_1|Ovarian_neoplasm|Li-Fraumeni_syndrome |
2.18 |
TP53:0.85 |
chr7 |
117587806 |
G |
A |
missense |
Pathogenic |
CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|not_provided|Hereditary_pancreatitis|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|ivacaftor_response_-_Efficacy |
CFTR:0.115 |
Reannotation
When iterating on an analysis, you often want to run a new annotation pipeline on a dataset that has already been annotated. If the new pipeline shares any steps with the old one (for example, the same effect annotator or the same score lookup), recomputing those attributes can be wasteful—especially for large annotation jobs.
GAIn supports reannotation, which allows it to reuse attributes produced by unchanged steps in a previous pipeline run while computing only the annotations that are new or modified. To illustrate this, we will use the clinical annotation pipeline, pipeline/hg38_clinical_annotation. The full contents of this pipeline, including the attributes it produces, can be viewed here: hg38 clinical annotation pipeline. This pipeline annotates hg38 variants with commonly used clinical resources, including gene effects, conservation scores, allele frequencies, clinical significance, and gene-level constraint scores.
To establish a baseline runtime, we run it on the larger SSC whole-exome sequencing input used earlier and record the elapsed time. This file contains approximately 1.4 million variants and has been sorted and indexed for parallel annotation (SSC_WES_variants_select.sorted.tsv.bgz).
time annotate_tabular SSC_WES_variants_select.sorted.tsv.bgz pipeline/hg38_clinical_annotation -o clinical_annotation.tsv.bgz
In our test, this took approximately 9 minutes and produced clinical_annotation.tsv.bgz, which contains the annotations generated by the clinical annotation pipeline.
Now suppose that, after running the clinical annotation pipeline, you want to apply a modified version of the pipeline. In this example, the modified pipeline differs from the original clinical pipeline in three ways:
it replaces the ClinVar resource
ClinVar_20251019with an earlier ClinVar version,ClinVar_20240730.it removes the
phyloP7wayconservation annotation.it adds one GRR-ENCODE annotation, the ATAC-seq position score track
ATAC-seq/ENCSR814RGG.
These changes are defined in the modified pipeline, hg38_clinical_modified.yaml, shown below with comments marking the modifications:
preamble:
input_reference_genome: hg38/genomes/GRCh38-hg38
summary: Clinical Annotation Pipeline for hg38
description: This pipeline annotates hg38 variants with clinical resources.
annotators:
- effect_annotator:
gene_models: hg38/gene_models/MANE/1.5
attributes:
- name: worst_effect_MANE_1_5
source: worst_effect
- name: effect_details_MANE_1_5
source: effect_details
- name: gene_effects_MANE_1_5
source: gene_effects
- normalize_allele_annotator:
genome: hg38/genomes/GRCh38-hg38
- allele_score_annotator:
resource_id: hg38/scores/dbSNP
input_annotatable: normalized_allele
attributes:
- name: dbSNP_rs_number
source: RS
- allele_score_annotator:
resource_id: hg38/variant_frequencies/gnomAD_4.1.0/exomes/ALL
input_annotatable: normalized_allele
- allele_score_annotator:
resource_id: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
input_annotatable: normalized_allele
- allele_score_annotator:
resource_id: hg38/scores/ClinVar_20240730
input_annotatable: normalized_allele
attributes:
- name: clinical_significance
source: CLNSIG
- name: clinical_disease_name
source: CLNDN
# - position_score_annotator:
# resource_id: hg38/scores/phyloP7way
# attributes:
# - name: phyloP7way
# source: phyloP7way
- allele_score_annotator:
resource_id: hg38/scores/AlphaMissense
input_annotatable: normalized_allele
attributes:
- name: AlphaMissense_pathogenicity
source: am_pathogenicity
- name: AlphaMissense_class
source: am_class
- liftover_annotator:
chain: liftover/hg38_to_hg19
source_genome: hg38/genomes/GRCh38-hg38
target_genome: hg19/genomes/GATK_ResourceBundle_5777_b37_phiX174
attributes:
- source: liftover_annotatable
name: hg19_annotatable
internal: true
- allele_score_annotator:
resource_id: hg19/scores/MPC
input_annotatable: hg19_annotatable
attributes:
- name: MPC_score
source: MPC
- effect_annotator:
gene_models: hg38/gene_models/GENCODE/49/basic/ALL
genome: hg38/genomes/GRCh38-hg38
attributes:
- name: worst_effect_GENCODE_49
source: worst_effect
- name: effect_details_GENCODE_49
source: effect_details
- name: gene_effects_GENCODE_49
source: gene_effects
- name: gene_list
internal: true
- gene_score_annotator:
resource_id: gene_properties/gene_scores/pLI
input_gene_list: gene_list
attributes:
- name: pLI_rank_all
source: pLI_rank
- name: pLI_rank_min
source: pLI_rank
aggregator: min
- gene_score_annotator:
resource_id: gene_properties/gene_scores/LOEUF
input_gene_list: gene_list
attributes:
- name: LOEUF_rank_all
source: LOEUF_rank
- name: LOEUF_rank_min
source: LOEUF_rank
aggregator: min
- position_score_annotator: #### NEW ANNOTATOR
resource_id: ATAC-seq/ENCSR814RGG
You could run hg38_clinical_modified.yaml directly on the original input file, but that would recompute many annotations that are already present in clinical_annotation.tsv.bgz. Instead, use --reannotate and pass the original clinical annotation pipeline, so GAIn can reuse the previously computed attributes and only compute the annotations that are new or modified.
The general form of a reannotation command is:
annotate_tabular first_annotation second_pipeline --reannotate first_pipeline -o second_annotation
Here, first_annotation is the output from the previous annotation run, second_pipeline is the updated pipeline you want to apply, first_pipeline is the pipeline that produced the existing annotations, and second_annotation is the new output file. In this example, we also use time to record the elapsed time:
time annotate_tabular clinical_annotation.tsv.bgz hg38_clinical_modified.yaml --reannotate pipeline/hg38_clinical_annotation -o modified_clinical_annotation.tsv.bgz
When you run this command, clinical_annotation.tsv.bgz is used as the input table. This file already contains the annotations produced by pipeline/hg38_clinical_annotation. The modified pipeline, hg38_clinical_modified.yaml, keeps the unchanged parts of the original clinical pipeline, replaces the ClinVar annotation with results from ClinVar_20240730, omits the phyloP7way annotation, and adds the ATAC-seq position score from ATAC-seq/ENCSR814RGG. The key part of the command is --reannotate pipeline/hg38_clinical_annotation: it tells GAIn which pipeline originally generated the annotation columns already present in the input file. GAIn can then recognize the shared annotation steps, reuse the existing attributes where appropriate, compute the modified or newly requested annotations, and exclude annotations that are no longer requested by the modified pipeline. In our test, this reannotation step took less than 30 seconds because most of the requested annotations were already present in the input file. The result is written to modified_clinical_annotation.tsv.bgz.