Getting started on CLI
Installation
The GAIn platform is developed in Python and supports Python 3.11 and up.
Begin by creating an empty Conda environment named gain:
mamba create -n gain
To use this environment, you need to activate it using the following command:
mamba activate gain
Afterwards, install the gain_core conda package:
mamba install -c bioconda -c iossifovlab gain_core
This command is going to install GAIn and all of its dependencies.
Browse available resources
GAIn installs with access to the default IossifovLab GRR. You can confirm which GRRs are available to you and browse the resources hosted on them by running the command below:
grr_browse
This will show that you have access to the IossifovLab GRR server and lists all the resources available to you on that server.
No GRR definition found, using the DEFAULT_DEFINITION
id: default
type: http
url: https://grr.iossifovlab.com
gene_score 0 139 11.19 MB GRR gene_properties/gene_scores/GTEx_V11_RNAexpression
gene_score 0 6 7.8 MB GRR gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015
gene_score 0 11 576.07 KB GRR gene_properties/gene_scores/LGD
gene_score 0 9 13.18 MB GRR gene_properties/gene_scores/LOEUF
gene_score 0 13 505.9 KB GRR gene_properties/gene_scores/RVIS
...
Simple annotation pipeline
Using the command-line tools, users can annotate large sets of variants, positions, or regions on a standard personal computer. As a simple example, we will annotate the following three variants.
chrom |
pos |
ref |
alt |
|---|---|---|---|
chr14 |
21415880 |
G |
A |
chr17 |
7674904 |
TCT |
T |
chr7 |
117587806 |
G |
A |
The input consists of chromosomal positions, the reference allele and the alternate allele.
The user should create a file named variants.txt with this content in a working folder of their choice. The columns should be tab-separated.
In order to tell GAIn which annotation attributes we are interested in, we use simple YAML files called annotation pipelines.
Below we will introduce a simple annotation pipeline which we will use on variants.txt in the next section.
The preamble section is optional and can be used to define the genome the variants are in and to store additional metadata about the pipeline.
preamble:
summary: Simple pipeline
input_reference_genome: hg38/genomes/GRCh38-hg38
After the preamble, various annotators are listed. Annotation runs from top to bottom. Attributes produced by earlier annotators can be used by later annotators. The following lines tell GAIn to use version 1.3 of the MANE gene model to find which genes are affected by each variant and what the worst predicted effect is.
annotators:
- effect_annotator:
gene_models: hg38/gene_models/MANE/1.3
attributes:
- worst_effect
- gene_list
Next is a position score annotator. phyloP7way provides a score for conservation at this genomic coordinate, computed from a multiple alignment of seven species.
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
Next, we add allele scores from ClinVar: CLNSIG, which encodes the clinical significance of a
variant (e.g. benign, pathogenic), and CLNDN, the associated disease name. Allele score annotators are
preceded by a normalize_allele_annotator, which expresses the allele in canonical form.
- normalize_allele_annotator
- allele_score_annotator:
resource_id: hg38/scores/ClinVar_20240730
input_annotatable: normalized_allele
attributes:
- CLNSIG
- CLNDN
Copy all of the pipeline lines above into a new text file called annotation_pipeline.yaml.
Annotating tabular input
GAIn performs annotations by combining three ingredients: the genomic resources (in one or more GRRs), the annotatables to annotate, and a YAML annotation pipeline describing which attributes to compute. Now that all three are in place, we can execute the following command to apply the pipeline to our tabular variant file:
annotate_columns variants.txt annotation_pipeline.yaml
This command tells GAIn to annotate the tabular file called variants.txt using annotation_pipeline.yaml.
Running this command produces an output file named variants_annotated.txt, shown below.
chrom |
pos |
ref |
alt |
worst_effect |
genes |
phylop7way |
CLNSIG |
CLNDN |
|---|---|---|---|---|---|---|---|---|
chr14 |
21415880 |
G |
A |
nonsense |
CHD8 |
0.917 |
Pathogenic/Likely_pathogenic |
Intellectual_developmental_disorder_with_autism_and_macrocephaly|not_provided |
chr17 |
7674904 |
TCT |
T |
frame-shift |
TP53 |
-0.12 |
Pathogenic |
Li-Fraumeni_syndrome_1|Hereditary_cancer-predisposing_syndrome|Li-Fraumeni_syndrome|Ovarian_neoplasm|not_provided|TP53-related_disorder |
chr7 |
117587806 |
G |
A |
missense |
CFTR |
0.917 |
Pathogenic |
Hereditary_pancreatitis|CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|ivacaftor_response_-_Efficacy|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|not_provided |
Annotating VCF input
GAIn can also annotate variants stored in VCF files. The command is similar to annotate_columns, but the input and
output files are in VCF format. Here, the same variants are stored in a file called variants.vcf.
##fileformat=VCFv4.1
##reference=GRCh38-hg38
##contig=<ID=chr7>
##contig=<ID=chr14>
##contig=<ID=chr17>
#CHROM POS ID REF ALT QUAL FILTER INFO
chr14 21415880 . G A . . .
chr17 7674904 . TCT T . . .
chr7 117587806 . G A . . .
To annotate them, run:
annotate_vcf variants.vcf annotation_pipeline.yaml
This command produces an output file named variants_annotated.vcf, which contains the same variants with
additional annotation fields in the INFO column.
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##reference=GRCh38-hg38
##contig=<ID=chr7>
##contig=<ID=chr14>
##contig=<ID=chr17>
##pipeline_annotation_tool=GPF variant annotation.
##INFO=<ID=worst_effect,Number=A,Type=String,Description="Worst effect accross all transcripts.">
##INFO=<ID=genes,Number=A,Type=String,Description="Comma separated list of all affected genes.">
##INFO=<ID=phylop7way,Number=A,Type=String,Description="The score is a number that reflects the conservation at a position.">
##INFO=<ID=CLNSIG,Number=A,Type=String,Description="Aggregate germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNDN,Number=A,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr14 21415880 . G A . . worst_effect=nonsense;genes=CHD8;phylop7way=0.917;CLNSIG=Pathogenic/Likely_pathogenic;CLNDN=Intellectual_developmental_disorder_with_autism_and_macrocephaly|not_provided
chr17 7674904 . TCT T . . worst_effect=frame-shift;genes=TP53;phylop7way=-0.12;CLNSIG=Pathogenic;CLNDN=Li-Fraumeni_syndrome_1|Hereditary_cancer-predisposing_syndrome|Li-Fraumeni_syndrome|Ovarian_neoplasm|not_provided|TP53-related_disorder
chr7 117587806 . G A . . worst_effect=missense;genes=CFTR;phylop7way=0.917;CLNSIG=Pathogenic;CLNDN=Hereditary_pancreatitis|CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|ivacaftor_response_-_Efficacy|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|not_provided
Annotating positions and regions
GAIn is well suited for annotating genetic variants obtained from sequencing data,
but not all genomic experiments produce variant calls. Some assays instead identify genomic
positions or regions of interest, such as transcription start sites mapped by CAGE-seq or
regulatory intervals detected by ATAC-seq and ChIP-seq. For researchers working with these data types,
it is often valuable to interpret them using the same kinds of genomic resources used in variant annotation.
Although positions and regions do not contain allele information, and therefore cannot support every type of
variant-based annotation, GAIn can still take these inputs and annotate them with many relevant resources using
the annotate_columns tool.
Position inputs require only two columns: chromosome and position. Save the following tab-delimited text in a
file called positions.txt.
chrom |
pos |
|---|---|
chr7 |
117587806 |
chr7 |
115587806 |
Because position inputs do not include reference and alternate alleles, GAIn cannot infer the effect of a
specific allelic change on a gene product. However, it can still determine whether a position falls within a gene and,
if so, what broad part of the gene it overlaps. To do this, use simple_effect_annotator, which classifies loci into
broad categories such as intergenic and genic, and further subdivides genic loci into coding and several noncoding
classes. Save the following text as annotation_pipeline2.yaml.
- simple_effect_annotator:
gene_models: hg38/gene_models/MANE/1.4
Then run the following command to annotate the positions:
annotate_columns positions.txt annotation_pipeline2.yaml
This produces positions_annotated.txt which contains:
chrom |
pos |
worst_effect |
worst_effect_genes |
|---|---|---|---|
chr7 |
117587806 |
coding |
CFTR |
chr7 |
115587806 |
intergenic |
This shows that the first position falls within a coding part of CFTR, whereas the second position is intergenic.
Position score resources can be applied directly to genomic positions, so position_score_annotator works on this
input without modification. GAIn can also use allele score resources with position inputs. In that case, because
the input specifies only the genomic position and not a particular allele, GAIn reports an aggregate value across
possible allelic changes at that site.
To extend the example, add the following annotators to annotation_pipeline2.yaml.
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
- allele_score_annotator:
resource_id: hg38/scores/CADD_v1.6
attributes:
- cadd_raw
Then run the command again:
annotate_columns positions.txt annotation_pipeline2.yaml
This produces positions_annotated.txt which contains:
chrom |
pos |
worst_effect |
worst_effect_genes |
phylop7way |
cadd_raw |
|---|---|---|---|---|---|
chr7 |
117587806 |
coding |
CFTR |
0.917 |
3.98 |
chr7 |
115587806 |
intergenic |
0.158 |
0.472 |
phyloP7way measures evolutionary conservation at a genomic position. In this example,
the coding position has a higher conservation score than the intergenic position. CADD estimates the
deleteriousness of allelic changes, and for position inputs GAIn reports an aggregate value for the possible
alleles at that site. Here, the first position has a higher aggregate cadd_raw score than the second.
Region inputs require three columns: chromosome, beginning position, and end position.
Save the following tab-delimited text in a file called regions.txt.
chrom |
pos_beg |
pos_end |
|---|---|---|
chr1 |
1 |
100000 |
chr1 |
11796321 |
11800000 |
As with position inputs, region inputs do not include reference and alternate alleles,
so GAIn cannot infer the effect of a specific allelic change on a gene product. However, many of the same
genomic resource types can still be applied to region inputs. Region inputs can also be evaluated with
simple_effect_annotator, which summarizes whether a region overlaps genic or intergenic sequence and
reports broad functional categories when applicable. Position score resources can be used on region
inputs by aggregating values across the positions spanned by each interval. Allele score resources can also be used,
but in that case GAIn must aggregate both across the positions in the region and across the possible allelic
changes at each position.
To illustrate this, reuse annotation_pipeline2.yaml, shown below as a reminder.
- simple_effect_annotator:
gene_models: hg38/gene_models/MANE/1.4
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
- allele_score_annotator:
resource_id: hg38/scores/CADD_v1.6
attributes:
- cadd_raw
Then run the following command to annotate the regions:
annotate_columns regions.txt annotation_pipeline2.yaml
This produces regions_annotated.txt which contains:
chrom |
pos_beg |
pos_end |
worst_effect |
worst_effect_genes |
phylop7way |
cadd_raw |
|---|---|---|---|---|---|---|
chr1 |
1 |
100000 |
coding |
OR4F5 |
0.0599 |
0.43 |
chr1 |
11796321 |
11800000 |
coding |
MTHFR |
0.0348 |
0.269 |
This output shows how GAIn summarizes the functional context of each region. Depending on the interval, a region may be classified as intergenic, coding, or another category, and overlapping genes are reported when applicable.
Parallelization []
Reannotation
When iterating on an analysis, you often want to run a new annotation pipeline on a dataset that has already been annotated. If the new pipeline shares any steps with the old one (for example, the same effect annotator or the same score lookup), recomputing those attributes can be wasteful—especially for large annotation jobs.
GAIn supports reannotation, which allows it to reuse attributes that were already computed by a
previous pipeline run, and only compute what is new. To see an example for reannotation, create the
following annotation pipeline and save it as pipeline_A.yaml:
- effect_annotator:
gene_models: hg38/gene_models/MANE/1.3
attributes:
- genes
- worst_effect
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
Run the pipeline on your input variants:
annotate_columns variants.txt pipeline_A.yaml -o variants_A.txt
This produces variants_A.txt, which includes the requested attributes:
chrom |
pos |
ref |
alt |
genes |
worst_effect |
phyloP7way |
|---|---|---|---|---|---|---|
chr14 |
21415880 |
G |
A |
CHD8 |
nonsense |
0.917 |
chr17 |
7674904 |
TCT |
T |
TP53 |
frame-shift |
-0.12 |
chr7 |
117587806 |
G |
A |
CFTR |
missense |
0.917 |
Now suppose you want to annotate the same variants with a modified pipeline saved as pipeline_B.yaml.
In this example, Pipeline B is the same as Pipeline A, but adds two additional position-score annotators.
- effect_annotator:
gene_models: hg38/gene_models/MANE/1.3
attributes:
- genes
- worst_effect
- position_score_annotator:
resource_id: hg38/scores/phyloP7way
- position_score_annotator:
resource_id: hg38/scores/phyloP30way
- position_score_annotator:
resource_id: hg38/scores/phyloP100way
You could run Pipeline B directly on variants_A.txt, but that would recompute genes, worst_effect,
and phyloP7way even though they are already present. Instead, use --reannotate and pass the old pipeline
that produced the existing annotations:
annotate_columns variants_A.txt pipeline_B.yaml --reannotate pipeline_A.yaml -o variants_B.txt
When you run the command above, variants_A.txt is used as the input table
(the output produced by Pipeline A), and pipeline_B.yaml is the updated
pipeline you want to apply. The key part is --reannotate pipeline_A.yaml:
it tells GAIn which pipeline originally generated the annotation columns already present
in variants_A.txt, so GAIn can recognize any overlapping work and reuse those precomputed
attributes instead of recalculating them. The result is written to variants_B.txt, which
contains the attributes requested by Pipeline B, with any shared attributes carried forward
from the earlier run.
chrom |
pos |
ref |
alt |
genes |
worst_effect |
phyloP7way |
phyloP30way |
phyloP100way |
|---|---|---|---|---|---|---|---|---|
chr14 |
21415880 |
G |
A |
CHD8 |
nonsense |
0.917 |
1.18 |
1.25 |
chr17 |
7674904 |
TCT |
T |
TP53 |
frame-shift |
-0.12 |
-0.076 |
-1.14 |
chr7 |
117587806 |
G |
A |
CFTR |
missense |
0.917 |
1.18 |
8.82 |