Getting started on CLI ====================== Installation ------------ The GAIn platform is developed in Python and supports Python 3.11 and up. Begin by creating an empty Conda environment named ``gain``: .. code-block:: bash mamba create -n gain To use this environment, you need to activate it using the following command: .. code-block:: bash mamba activate gain Afterwards, install the ``gain_core`` conda package: .. code-block:: bash mamba install -c bioconda -c iossifovlab gain_core This command is going to install GAIn and all of its dependencies. Browse available resources ----------------------------------------- GAIn installs with access to the default IossifovLab GRR. You can confirm which GRRs are available to you and browse the resources hosted on them by running the command below: .. code-block:: bash grr_browse This will show that you have access to the IossifovLab GRR server and lists all the resources available to you on that server. .. code-block:: bash No GRR definition found, using the DEFAULT_DEFINITION id: default type: http url: https://grr.iossifovlab.com gene_score 0 139 11.19 MB GRR gene_properties/gene_scores/GTEx_V11_RNAexpression gene_score 0 6 7.8 MB GRR gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015 gene_score 0 11 576.07 KB GRR gene_properties/gene_scores/LGD gene_score 0 9 13.18 MB GRR gene_properties/gene_scores/LOEUF gene_score 0 13 505.9 KB GRR gene_properties/gene_scores/RVIS ... Simple annotation pipeline -------------------------- Using the command-line tools, users can annotate large sets of variants, positions, or regions on a standard personal computer. As a simple example, we will annotate the following three variants. .. csv-table:: :header-rows: 1 chrom,pos,ref,alt chr14,21415880,G,A chr17,7674904,TCT,T chr7,117587806,G,A The input consists of chromosomal positions, the reference allele and the alternate allele. The user should create a file named ``variants.txt`` with this content in a working folder of their choice. The columns should be tab-separated. In order to tell GAIn which annotation attributes we are interested in, we use simple YAML files called annotation pipelines. Below we will introduce a simple annotation pipeline which we will use on ``variants.txt`` in the next section. The preamble section is optional and can be used to define the genome the variants are in and to store additional metadata about the pipeline. .. code-block:: yaml preamble: summary: Simple pipeline input_reference_genome: hg38/genomes/GRCh38-hg38 After the preamble, various annotators are listed. Annotation runs from top to bottom. Attributes produced by earlier annotators can be used by later annotators. The following lines tell GAIn to use version 1.3 of the MANE gene model to find which genes are affected by each variant and what the worst predicted effect is. .. code-block:: yaml annotators: - effect_annotator: gene_models: hg38/gene_models/MANE/1.3 attributes: - worst_effect - gene_list Next is a position score annotator. phyloP7way provides a score for conservation at this genomic coordinate, computed from a multiple alignment of seven species. .. code-block:: yaml - position_score_annotator: resource_id: hg38/scores/phyloP7way Next, we add allele scores from ClinVar: CLNSIG, which encodes the clinical significance of a variant (e.g. benign, pathogenic), and CLNDN, the associated disease name. Allele score annotators are preceded by a ``normalize_allele_annotator``, which expresses the allele in canonical form. .. code-block:: yaml - normalize_allele_annotator - allele_score_annotator: resource_id: hg38/scores/ClinVar_20240730 input_annotatable: normalized_allele attributes: - CLNSIG - CLNDN Copy all of the pipeline lines above into a new text file called ``annotation_pipeline.yaml``. Annotating tabular input --------------------------------- GAIn performs annotations by combining three ingredients: the genomic resources (in one or more GRRs), the annotatables to annotate, and a YAML annotation pipeline describing which attributes to compute. Now that all three are in place, we can execute the following command to apply the pipeline to our tabular variant file: .. code-block:: bash annotate_columns variants.txt annotation_pipeline.yaml This command tells GAIn to annotate the tabular file called ``variants.txt`` using ``annotation_pipeline.yaml``. Running this command produces an output file named ``variants_annotated.txt``, shown below. .. csv-table:: :header-rows: 1 chrom,pos,ref,alt,worst_effect,genes,phylop7way,CLNSIG,CLNDN chr14,21415880,G,A,nonsense,CHD8,0.917,Pathogenic/Likely_pathogenic,Intellectual_developmental_disorder_with_autism_and_macrocephaly|not_provided chr17,7674904,TCT,T,frame-shift,TP53,-0.12,Pathogenic,Li-Fraumeni_syndrome_1|Hereditary_cancer-predisposing_syndrome|Li-Fraumeni_syndrome|Ovarian_neoplasm|not_provided|TP53-related_disorder chr7,117587806,G,A,missense,CFTR,0.917,Pathogenic,Hereditary_pancreatitis|CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|ivacaftor_response_-_Efficacy|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|not_provided Annotating VCF input ----------------------------- GAIn can also annotate variants stored in VCF files. The command is similar to ``annotate_columns``, but the input and output files are in VCF format. Here, the same variants are stored in a file called ``variants.vcf``. .. code-block:: yaml ##fileformat=VCFv4.1 ##reference=GRCh38-hg38 ##contig= ##contig= ##contig= #CHROM POS ID REF ALT QUAL FILTER INFO chr14 21415880 . G A . . . chr17 7674904 . TCT T . . . chr7 117587806 . G A . . . To annotate them, run: .. code-block:: bash annotate_vcf variants.vcf annotation_pipeline.yaml This command produces an output file named ``variants_annotated.vcf``, which contains the same variants with additional annotation fields in the ``INFO`` column. .. code-block:: yaml ##fileformat=VCFv4.1 ##FILTER= ##reference=GRCh38-hg38 ##contig= ##contig= ##contig= ##pipeline_annotation_tool=GPF variant annotation. ##INFO= ##INFO= ##INFO= ##INFO= ##INFO= #CHROM POS ID REF ALT QUAL FILTER INFO chr14 21415880 . G A . . worst_effect=nonsense;genes=CHD8;phylop7way=0.917;CLNSIG=Pathogenic/Likely_pathogenic;CLNDN=Intellectual_developmental_disorder_with_autism_and_macrocephaly|not_provided chr17 7674904 . TCT T . . worst_effect=frame-shift;genes=TP53;phylop7way=-0.12;CLNSIG=Pathogenic;CLNDN=Li-Fraumeni_syndrome_1|Hereditary_cancer-predisposing_syndrome|Li-Fraumeni_syndrome|Ovarian_neoplasm|not_provided|TP53-related_disorder chr7 117587806 . G A . . worst_effect=missense;genes=CFTR;phylop7way=0.917;CLNSIG=Pathogenic;CLNDN=Hereditary_pancreatitis|CFTR-related_disorder|Cystic_fibrosis|Congenital_bilateral_aplasia_of_vas_deferens_from_CFTR_mutation|ivacaftor_response_-_Efficacy|Bronchiectasis_with_or_without_elevated_sweat_chloride_1|not_provided Annotating positions and regions ----------------------------------- GAIn is well suited for annotating genetic variants obtained from sequencing data, but not all genomic experiments produce variant calls. Some assays instead identify genomic positions or regions of interest, such as transcription start sites mapped by CAGE-seq or regulatory intervals detected by ATAC-seq and ChIP-seq. For researchers working with these data types, it is often valuable to interpret them using the same kinds of genomic resources used in variant annotation. Although positions and regions do not contain allele information, and therefore cannot support every type of variant-based annotation, GAIn can still take these inputs and annotate them with many relevant resources using the ``annotate_columns`` tool. Position inputs require only two columns: chromosome and position. Save the following tab-delimited text in a file called ``positions.txt``. .. csv-table:: :header-rows: 1 chrom,pos chr7,117587806 chr7,115587806 Because position inputs do not include reference and alternate alleles, GAIn cannot infer the effect of a specific allelic change on a gene product. However, it can still determine whether a position falls within a gene and, if so, what broad part of the gene it overlaps. To do this, use ``simple_effect_annotator``, which classifies loci into broad categories such as intergenic and genic, and further subdivides genic loci into coding and several noncoding classes. Save the following text as ``annotation_pipeline2.yaml``. .. code-block:: yaml - simple_effect_annotator: gene_models: hg38/gene_models/MANE/1.4 Then run the following command to annotate the positions: .. code-block:: bash annotate_columns positions.txt annotation_pipeline2.yaml This produces ``positions_annotated.txt`` which contains: .. csv-table:: :header-rows: 1 chrom,pos,worst_effect,worst_effect_genes chr7,117587806,coding,CFTR chr7,115587806,intergenic This shows that the first position falls within a coding part of CFTR, whereas the second position is intergenic. Position score resources can be applied directly to genomic positions, so ``position_score_annotator`` works on this input without modification. GAIn can also use allele score resources with position inputs. In that case, because the input specifies only the genomic position and not a particular allele, GAIn reports an aggregate value across possible allelic changes at that site. To extend the example, add the following annotators to ``annotation_pipeline2.yaml``. .. code-block:: yaml - position_score_annotator: resource_id: hg38/scores/phyloP7way - allele_score_annotator: resource_id: hg38/scores/CADD_v1.6 attributes: - cadd_raw Then run the command again: .. code-block:: bash annotate_columns positions.txt annotation_pipeline2.yaml This produces ``positions_annotated.txt`` which contains: .. csv-table:: :header-rows: 1 chrom,pos,worst_effect,worst_effect_genes,phylop7way,cadd_raw chr7,117587806,coding,CFTR,0.917,3.98 chr7,115587806,intergenic,,0.158,0.472 phyloP7way measures evolutionary conservation at a genomic position. In this example, the coding position has a higher conservation score than the intergenic position. CADD estimates the deleteriousness of allelic changes, and for position inputs GAIn reports an aggregate value for the possible alleles at that site. Here, the first position has a higher aggregate ``cadd_raw`` score than the second. Region inputs require three columns: chromosome, beginning position, and end position. Save the following tab-delimited text in a file called ``regions.txt``. .. csv-table:: :header-rows: 1 chrom,pos_beg,pos_end chr1,1,100000 chr1,11796321,11800000 As with position inputs, region inputs do not include reference and alternate alleles, so GAIn cannot infer the effect of a specific allelic change on a gene product. However, many of the same genomic resource types can still be applied to region inputs. Region inputs can also be evaluated with ``simple_effect_annotator``, which summarizes whether a region overlaps genic or intergenic sequence and reports broad functional categories when applicable. Position score resources can be used on region inputs by aggregating values across the positions spanned by each interval. Allele score resources can also be used, but in that case GAIn must aggregate both across the positions in the region and across the possible allelic changes at each position. To illustrate this, reuse ``annotation_pipeline2.yaml``, shown below as a reminder. .. code-block:: yaml - simple_effect_annotator: gene_models: hg38/gene_models/MANE/1.4 - position_score_annotator: resource_id: hg38/scores/phyloP7way - allele_score_annotator: resource_id: hg38/scores/CADD_v1.6 attributes: - cadd_raw Then run the following command to annotate the regions: .. code-block:: bash annotate_columns regions.txt annotation_pipeline2.yaml This produces ``regions_annotated.txt`` which contains: .. csv-table:: :header-rows: 1 chrom,pos_beg,pos_end,worst_effect,worst_effect_genes,phylop7way,cadd_raw chr1,1,100000,coding,OR4F5,0.0599,0.43 chr1,11796321,11800000,coding,MTHFR,0.0348,0.269 This output shows how GAIn summarizes the functional context of each region. Depending on the interval, a region may be classified as intergenic, coding, or another category, and overlapping genes are reported when applicable. Parallelization [] --------------- Reannotation ------------ When iterating on an analysis, you often want to run a new annotation pipeline on a dataset that has already been annotated. If the new pipeline shares any steps with the old one (for example, the same effect annotator or the same score lookup), recomputing those attributes can be wasteful—especially for large annotation jobs. GAIn supports reannotation, which allows it to reuse attributes that were already computed by a previous pipeline run, and only compute what is new. To see an example for reannotation, create the following annotation pipeline and save it as ``pipeline_A.yaml``: .. code-block:: yaml - effect_annotator: gene_models: hg38/gene_models/MANE/1.3 attributes: - genes - worst_effect - position_score_annotator: resource_id: hg38/scores/phyloP7way Run the pipeline on your input variants: .. code-block:: bash annotate_columns variants.txt pipeline_A.yaml -o variants_A.txt This produces ``variants_A.txt``, which includes the requested attributes: .. csv-table:: :header-rows: 1 chrom,pos,ref,alt,genes,worst_effect,phyloP7way chr14,21415880,G,A,CHD8,nonsense,0.917 chr17,7674904,TCT,T,TP53,frame-shift,-0.12 chr7,117587806,G,A,CFTR,missense,0.917 Now suppose you want to annotate the same variants with a modified pipeline saved as ``pipeline_B.yaml``. In this example, Pipeline B is the same as Pipeline A, but adds two additional position-score annotators. .. code-block:: yaml - effect_annotator: gene_models: hg38/gene_models/MANE/1.3 attributes: - genes - worst_effect - position_score_annotator: resource_id: hg38/scores/phyloP7way - position_score_annotator: resource_id: hg38/scores/phyloP30way - position_score_annotator: resource_id: hg38/scores/phyloP100way You could run Pipeline B directly on ``variants_A.txt``, but that would recompute genes, worst_effect, and phyloP7way even though they are already present. Instead, use ``--reannotate`` and pass the old pipeline that produced the existing annotations: .. code-block:: bash annotate_columns variants_A.txt pipeline_B.yaml --reannotate pipeline_A.yaml -o variants_B.txt When you run the command above, ``variants_A.txt`` is used as the input table (the output produced by Pipeline A), and ``pipeline_B.yaml`` is the updated pipeline you want to apply. The key part is ``--reannotate pipeline_A.yaml``: it tells GAIn which pipeline originally generated the annotation columns already present in ``variants_A.txt``, so GAIn can recognize any overlapping work and reuse those precomputed attributes instead of recalculating them. The result is written to ``variants_B.txt``, which contains the attributes requested by Pipeline B, with any shared attributes carried forward from the earlier run. .. csv-table:: :header-rows: 1 chrom,pos,ref,alt,genes,worst_effect,phyloP7way,phyloP30way,phyloP100way chr14,21415880,G,A,CHD8,nonsense,0.917,1.18,1.25 chr17,7674904,TCT,T,TP53,frame-shift,-0.12,-0.076,-1.14 chr7,117587806,G,A,CFTR,missense,0.917,1.18,8.82