Getting started with GRR

Create your first GRR

To create a new Genomic Resource Repository (GRR), start by making an empty directory and moving into it:

mkdir my_GRR
cd my_GRR

Initialize this directory as a GRR:

grr_manage repo-init

From now on, any subdirectory under my_GRR becomes a genomic resource if:

It contains a properly configured genomic_resource.yaml, and
It includes the data files referenced in that YAML.

Add new resources

1: Toy genome

Reference genomes live in FASTA files, and the human reference FASTA is on the order of gigabytes (~3 Gb), which makes it inconvenient as a first example. Instead of starting with the full human genome, we will use a toy genome made of two very short chromosomes (10 bases each). A genome resource in GAIn always has two components: the FASTA file that holds the sequences and a .fai index file that enables fast random access.

From within your my_GRR directory, create a directory called my_minigenome and move into it.

mkdir my_minigenome
cd my_minigenome

Create a new text file with the content below and save it as minigenome.fa.

>chr1 1st_chromosome
TATGAAATAA
>chr2 2nd_chromosome
AAAAAAAAAA

Before GAIn can efficiently access a genome, the reference FASTA must be indexed to enable fast random access. We use samtools (installed via bioconda and already included in the GAIn conda environment) for this step.

Run the following command to index minigenome.fa. This generates a FASTA index file (.fai) that GAIn uses as a lookup table:

samtools faidx minigenome.fa

Next, create a file named genomic_resource.yaml with the following content; this configures the minigenome resource so GAIn can recognize and use it for annotation:

type: genome
filename: minigenome.fa
chrom_prefix: "chr"
meta:
  summary: mini genome

With the FASTA, index, and genomic_resource.yaml in place, the minigenome resource is already usable for annotation by GAIn (however, you will need to update your .grr_definition.yaml file to include my_GRR, which includes this resource). We strongly recommend running the command below, which checks the resource and produces summary statistics.

grr_manage resource-repair

On successful completion, an index.html file will appear in the my_minigenome directory. It contains basic metadata about the resource, statistics such as chromosome length and nucleotide/dinucleotide composition, and a full inventory of the files that make up the resource.

_images/example1_resource.png — Summary html page created for `my_minigenome` resource.

2: Genome (GRCh38.p14)

Next, we configure a real human genome resource. From inside my_GRR, make a directory named my_genome and change into it:

mkdir my_genome
cd my_genome

Then fetch the GRCh38.p14 genome FASTA from the GENCODE FTP site:

curl -o GRCh38.p14.genome.fa.gz \
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh38.p14.genome.fa.gz

Use the following commands in the my_genome directory to unzip and index the FASTA file:

gunzip GRCh38.p14.genome.fa.gz
samtools faidx GRCh38.p14.genome.fa

Now that you have both the FASTA and its .fai index, add a genomic_resource.yaml file in this directory with the content below. Here, filename specifies the FASTA file used for the genome sequence, chrom_prefix indicates that chromosome names in this assembly use the chr prefix, for example chr1, and PARS lists the pseudoautosomal regions on chromosomes X and Y.

type: genome
filename: GRCh38.p14.genome.fa
chrom_prefix: "chr"

PARS:
"X":
    - "chrX:10000-2781479"
    - "chrX:155701382-156030895"
"Y":
    - "chrY:10000-2781479"
    - "chrY:56887902-57217415"

meta:
  summary: Nucleotide sequence of the GRCh38.p14 genome assembly

At this point, the my_genome resource can be used for annotation by GAIn. To validate it and obtain summary statistics, run grr_manage resource-repair in this directory. Be aware that this step can be slow (~5 minutes), as it processes the entire genome to build the report.

On successful completion, an index.html file will appear in the my_genome directory, which contains basic metadata about the resource, statistics such as chromosome length and nucleotide/dinucleotide composition, and a full inventory of the files that make up the resource.

_images/example2_resource.png — Summary html page created for `my_genome` resource (partially displayed).

3: Gene models (MANE v1.4)

Next, we will create a gene model resource based on the MANE (Matched Annotation from NCBI and EBI) gene set, which offers a standardized transcript set for consistent use across genomic resources. From inside my_GRR, create a directory named my_genemodel and change into it:

mkdir my_genemodel
cd my_genemodel

First, we download the corresponding GTF file.

curl -O https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.4/MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz

Next, create a genomic_resource.yaml file in the my_genemodel directory with the following content:

type: gene_models

filename: MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz
format: gtf

meta:
  summary: MANE gene model version 1.4

With these files in place, my_genemodel is usable as a gene-model resource in GAIn. To check it and produce an HTML summary with basic statistics, execute grr_manage resource-repair in this directory.

_images/example3_resource.png — Summary html page created for `my_genemodel` resource.

4: Toy position score

In this example, we will create a toy position score resource which only has scores for a few positions in chromosome 1. In your my_GRR directory, create a directory called my_miniposition:

mkdir my_miniposition
cd my_miniposition

Create a new text file with the content below and save it as mini_positionscore.tsv. The first column is the chromosome name, the second column is the 0-based position, and the third column is the score value.

chr1        0       0
chr1        1       0.1
chr1        2       0.2
chr1        3       0.3
chr1        4       0.4

Next, create a genomic_resource.yaml file in the my_miniposition directory with the following content:

type: position_score

table:
filename: mini_positionscore.tsv
zero_based: True

chrom:
    index: 0
pos_begin:
    index: 1
pos_end:
    index: 1

scores:
- id: pos_tsv_0
  type: float
  index: 2

meta:
  summary: 0-based tsv position score

The resource is ready for use by GAIn. To check it and produce an HTML summary with basic statistics, execute grr_manage resource-repair in this directory.

_images/example4_resource.png — Summary html page created for `my_miniposition` resource.

5: Position score (PhyloP7)

Let’s now create a real position score resource. In your my_GRR directory, create a directory called my_position:

mkdir my_position
cd my_position

phyloP (phylogenetic P-values) scores measure the evolutionary rate at individual nucleotides or other genomic elements. Let’s download PhyloP7, which is derived from a multiple sequence alignment of the genomes of 7 different species (this is a large file around 5Gb, make sure you have enough space on your local drive).

curl -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/phyloP7way/hg38.phyloP7way.bw

Next, create a genomic_resource.yaml file in the my_miniposition directory with the following content:

type: position_score

table:
  filename: hg38.phyloP7way.bw
  header_mode: none

scores:
- id: phyloP7way
  type: float
  index: 3

meta:
  summary: Conservation score based on the multiple alignment of 7 species

The resource is ready for use by GAIn. To check it and produce an HTML summary with basic statistics, execute grr_manage resource-repair in this directory (this will take around one hour as GAIn processes the large file to create summary statistics).

_images/example5_resource.png — Summary html page created for `my_position` resource.

6: Toy allele score

Next, let’s create a toy allele score resource. In my_GRR, create a new directory called my_miniallele and move into it:

mkdir my_miniallele
cd my_miniallele

Create a new text file with the content below and save it as mini_allelescore.tsv. The first column is the chromosome name, the second column is the 0-based position. Third and fourth columns show the reference and alternate alleles. Last two columns are two allele scores, one is a numerical score and the second is a class represented by strings.

#chrom      pos     ref     alt     allele_score    allele_class
chr1        0       T       A       1       good
chr1        0       T       C       1.1     bad
chr1        0       T       G       1.2     bad
chr2        4       A       T       2       good
chr2        4       A       C       2.1     bad
chr2        4       A       G       2.2     bad

To prepare this allele score file for fast random access in GAIn, first compress and index it with:

bgzip mini_allelescore.tsv
tabix -s 1 -b 2 -e 2 -0 mini_allelescore.tsv.gz

Then create a genomic_resource.yaml file as shown. This file marks the resource as indexed, identifies which columns hold the chromosome, position, reference, and alternate alleles, and describes the score columns that GAIn can use during annotation.

type: allele_score

table:
  filename: mini_allelescore.tsv.gz
  format: tabix
  zero_based: True

  chrom:
    name: chrom
  pos_begin:
    name: pos
  pos_end:
    name: pos
  reference:
    name: ref
  alternative:
    name: alt

scores:
- id: allele_score
  name: allele_score
  type: float

- id: allele_class
  name: allele_class
  type: str

meta:
  summary: A toy allele score resource with allele scores and allele classes.

The resource is ready for use by GAIn. To check it and produce an HTML summary with basic statistics, execute grr_manage resource-repair in this directory.

_images/example6_resource.png — Summary html page created for `my_miniallele` resource.

7: Allele score (AlphaMissense)

Next, let’s create a real allele score resource. In your my_GRR directory, create a directory called my_allele:

mkdir my_allele
cd my_allele

We will use AlphaMissense, a deep-learning-based missense variant deleteriousness score. Download the AlphaMissense score file for hg38 genome:

curl -O https://zenodo.org/records/8208688/files/AlphaMissense_hg38.tsv.gz

A quick look at the downloaded file shows that the column names are listed on line four.

gzip -dc AlphaMissense_hg38.tsv.gz | head

# Copyright 2023 DeepMind Technologies Limited
#
# Licensed under CC BY-NC-SA 4.0 license
#CHROM      POS     REF     ALT     genome  uniprot_id      transcript_id   protein_variant am_pathogenicity        am_class
chr1        69094   G       T       hg38    Q8NH21  ENST00000335137.4       V2L     0.2937  likely_benign
chr1        69094   G       C       hg38    Q8NH21  ENST00000335137.4       V2L     0.2937  likely_benign
chr1        69094   G       A       hg38    Q8NH21  ENST00000335137.4       V2M     0.3296  likely_benign
chr1        69095   T       C       hg38    Q8NH21  ENST00000335137.4       V2A     0.2609  likely_benign
chr1        69095   T       A       hg38    Q8NH21  ENST00000335137.4       V2E     0.2922  likely_benign
chr1        69095   T       G       hg38    Q8NH21  ENST00000335137.4       V2G     0.203   likely_benign

GAIn expects the column headers on the first line. Accordingly, we decompress the file, strip the first three lines, write the processed content to a new file, and delete the original file to minimize disk usage.

gzip -dc AlphaMissense_hg38.tsv.gz \
| sed '1,3d' \
| bgzip > AlphaMissense_hg38_modified.tsv.gz
rm AlphaMissense_hg38.tsv.gz

A second look at the resource file confirms that the column names are on line 1.

#CHROM      POS     REF     ALT     genome  uniprot_id      transcript_id   protein_variant am_pathogenicity        am_class
chr1        69094   G       T       hg38    Q8NH21  ENST00000335137.4       V2L     0.2937  likely_benign
chr1        69094   G       C       hg38    Q8NH21  ENST00000335137.4       V2L     0.2937  likely_benign
chr1        69094   G       A       hg38    Q8NH21  ENST00000335137.4       V2M     0.3296  likely_benign
chr1        69095   T       C       hg38    Q8NH21  ENST00000335137.4       V2A     0.2609  likely_benign

The resource is now ready to be included in a GRR. As is typical for allele resources, the chrom and pos columns specify the genomic coordinates of the variant, while REF and ALT describe the variant itself. AlphaMissense also provides genome, UniProt, and transcript identifiers, as well as the amino acid substitution caused by the missense mutation (for example, the first variant changes valine to leucine). The last two columns report the scoring results: am_pathogenicity provides the predicted effect of the variant on protein structure and function, and am_class converts this score into categorical labels, with values below 0.34 classified as likely_benign and values above 0.564 as likely_pathogenic.

To index the resource by genomic coordinates, we run the following command:

tabix -s 1 -b 2 -e 2 AlphaMissense_hg38_modified.tsv.gz

Next, we create a text file called genomic_resource.yaml so that the resource is recognized by GAIn. As a first step, we configure it to ingest only the am_pathogenicity scores from the source file. In this YAML file, rather than using column indices, we explicitly specify the column names corresponding to chromosome, position, reference, and alternative alleles. We also customize the histogram generated by GAIn by setting the score range to 0-1 (the AlphaMissense score range), using 100 bins, and applying a logarithmic scale to the y-axis. Finally, we define the interpretation of low and high scores, which will be displayed on the summary page.

type: allele_score
allele_score_mode: substitutions

table:
  filename: AlphaMissense_hg38_modified.tsv.gz
  format: tabix

  chrom:
    name: CHROM
  pos_begin:
    name: POS
  pos_end:
    name: POS
  reference:
    name: REF
  alternative:
    name: ALT

scores:
- id: am_pathogenicity
  name: am_pathogenicity
  type: float
  desc: |
    AlphaMissense Pathogenicity score is a deleteriousness score for missense variants
  large_values_desc: "more pathogenic"
  small_values_desc: "less pathogenic"
  histogram:
    type: number
    number_of_bins: 100
    view_range:
      min: 0.0
      max: 1.0
    y_log_scale: True

meta:
  summary: Functional impact of mutations on protein function

At this point, GAIn can use the resource for annotation. Running grr_manage resource-repair produces the following summary page, which currently includes only the am_pathogenicity score.

_images/example7_resource.png — Summary html page created for `my_allele` resource.

_images/example7b_resource.png — Histogram created for `am_pathogenicity` scores.

To include the am_class scores, add the following entries to the scores section, configuring the histogram as categorical and enabling a log scale on the y-axis.

- id: am_class
  name: am_class
  type: str
  desc: |
    AlphaMissense Class is a deleteriousness category for missense variants
  histogram:
    type: categorical
    y_log_scale: True

Running grr_manage resource-repair with the updated genomic_resource.yaml file produces the updated resource page shown below, now displaying both the am_pathogenicity and am_class scores.

_images/example7c_resource.png — Updated summary HTML page created for `my_allele` resource.

8: Toy gene score

Create a new folder for the resource and move into it:

mkdir my_minigenescore
cd my_minigenescore

Create a comma-separated file called my_minigenescore.csv with the following content:

gene,my_minigenescore
CHD8,9
TP53,3
CFTR,7

This resource provides a single score for three example genes. Next, create a genomic_resource.yaml file in the same directory with this content:

type: gene_score
filename: my_minigenescore.csv

scores:
- id: my_minigenescore
  histogram:
    type: number

meta:
  summary: A custom gene score
  description: This is a custom gene score for demonstration purposes.

Finally, while still in the my_minigenescore directory, run:

grr_manage resource-repair

This command checks that the resource is usable for annotation and produces an HTML summary file with basic descriptions and histograms for the my_minigenescore values.

_images/example8_resource.png — Summary html page created for `my_minigenescore` resource.

9: Gene score (pLI)

As a real-world example of a gene score, we use pLI (probability of loss-of-function intolerance), which reflects a gene’s sensitivity to loss-of-function mutations, with higher values indicating greater intolerance. The pLI score was introduced by Lek et al. in 2016.

Create a new folder for the resource and move into it:

mkdir my_genescore
cd my_genescore

At https://www.nature.com/articles/nature19057#Sec16, download the ZIP file containing the Supplementary Tables and unzip it. We focus on Supplementary Table 13, specifically the “Gene Constraint” sheet, which reports pLI scores and related constraint metrics for human genes. You may manually copy the gene and pLI columns into a new CSV file named pLI.csv, or generate this file automatically using the script below (before running the script, install openpyxl by mamba install openpyxl).

import pandas as pd
# Load the Excel file
df = pd.read_excel("nature19057-SI Table 13.xlsx", sheet_name="Gene Constraint")
# Extract only 'gene' and 'pLI'
df_subset = df[['gene', 'pLI']].copy()
# Write to CSV
df_subset.to_csv("pLI.csv", index=False)

The first few lines of pLI.csv will look like this:

gene        pLI
AGRN        0.17335234
NOC2L       1.33E-19
B3GALT6     0.048104466
C1orf159    0.090877636
ISG15       0.009847813
KLHL17      2.52E-07
PLEKHN1     2.02E-08

Next, create a genomic_resource.yaml file in the same directory with this content:

type: gene_score
filename: pLI.csv
scores:
- id: pLI
  desc: Probability of Loss-of-Function Intolerance
  small_values_desc: "less likely to be Loss-of-function intolerant"
  large_values_desc: "more likely to be Loss-of-function intolerant"
  histogram:
    type: number
    number_of_bins: 100
    view_range:
        min: 0
        max: 1
    x_min_log: 0.00001
    x_log_scale: false
    y_log_scale: true

meta:
  summary: Probability of Loss-of-Function Intolerance
  description: The probability of loss-of-function intolerance (pLI) score reflects a gene's sensitivity to LoF mutations.

Finally, while still in the my_genescore directory, run:

grr_manage resource-repair

This command checks that the resource is usable for annotation and produces an HTML summary file with basic descriptions and histograms for the my_genescore values.

_images/example9_resource.png — Summary html page created for `my_genescore` resource.

10: Toy gene sets

Gene set resources group biologically related genes and support membership and enrichment analyses. In GAIn, gene sets use a custom text representation. To illustrate how gene set resources are defined and used, we will create a toy gene set as a concrete example.

Create a new folder for the resource and move into it:

mkdir my_minigenesets
cd my_minigenesets

First, create another text file named map.txt with the following content. This file defines gene-to-set memberships: the left column lists gene names, and the right column lists the set identifier(s) for each gene. In this example, CHD8 and CFTR belong only to set 1, while TP53 belongs to sets 2 and 3.

CHD8        set_1
TP53        set_2 set_3
CFTR        set_3

Finally, make a genomic_resource.yaml file with the following content.

type: gene_set_collection
id: genesets
format: map
filename: map.txt

meta:
  summary: mini gene sets collection

Finally, while still in the my_minigenesets directory, run:

grr_manage resource-repair

This command checks that the resource is usable for annotation and produces an HTML summary file with basic descriptions and histograms for the my_minigenesets resource.

_images/example10_resource.png — Summary html page created for `my_minigenesets` resource.

11: Gene sets (MSigDB)

As a real-world example of a gene set resource, we will create a MSigDB (Molecular Signatures Database) gene sets derived from a variety of curated sources The Curated (C2) collection in MSigDB includes gene sets from canonical pathway databases (e.g., KEGG, Reactome, BioCarta) and from published gene expression studies, capturing well-defined pathways and perturbation signatures.

To create a gene sets resource for MSigDB, make a new folder for the resource and move into it:

mkdir my_genesets
cd my_genesets

Grab the latest MSigDB gene sets in GMT format from the Broad Institute website:

curl -O https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.4/c2.all.v7.4.symbols.gmt

Prepare a genomic_resource.yaml with the following content to make the resource available in GAIn:

type: gene_set_collection
id: MSigDB_curated
format: gmt
filename: c2.all.v7.4.symbols.gmt
histograms:
genes_per_gene_set:
  type: number
  y_log_scale: True
gene_sets_per_gene:
  type: number
  y_log_scale: True
meta:
  summary: MSigDB (Molecular Signatures Database) gene sets

Finally, from inside the my_genesets directory, run:

grr_manage resource-repair

This validates the resource for annotation and generates an HTML summary page with basic descriptions and histograms for my_genesets.

_images/example11_resource.png — Partial screen shot of the summary html page created for `my_genesets` resource.

12: Toy CNV collection

Copy-number variants (CNVs) are deletions or duplications of genomic segments. In practice, CNV resources summarize previously observed gains and losses so you can contextualize a query locus by interval overlap. In GAIn, CNV collections are represented as tabular files plus a small YAML definition that declares which columns should be exposed as annotation attributes.

Create a new folder for the resource and move into it:

mkdir my_miniCNVcollection
cd my_miniCNVcollection

Create a tab-separated file called my_miniCNVcollection.txt with the following content:

chrom	pos_beg	pos_end	CNV_name	deletion_duplication	frequency
chr1	3	15	Chr1_duplication	Duplication	0.1
chr2	5	15	Chr2_duplication	Deletion	0.2

This file defines two example CNVs. Each row specifies an interval (chrom, pos_beg, pos_end), a CNV identifier (CNV_name), the CNV type (deletion_duplication) and frequency.

Next, create a genomic_resource.yaml file in the same directory with this content:

type: cnv_collection
table:
  filename: my_miniCNVcollection.txt

scores:
- id: CNV type
  name: deletion_duplication
  type: str
  desc: duplication or deletion
- id: CNV frequency
  name: frequency
  type: float
  desc: CNV frequency

meta:
  summary: CNV collection resource

In this resource, the interval columns (chrom, pos_beg, pos_end) are stored in the table and used for overlap queries, while the two fields listed under scores (CNV type and CNV frequency) are exposed as attributes that can be emitted in annotation output.

Finally, while still in the my_miniCNVcollection directory, run:

grr_manage resource-repair

This command checks that the resource is usable for annotation and produces an HTML summary file with basic descriptions for the CNV collection resource.

_images/example12_resource.png — Summary html page created for `my_miniCNVcollection` resource.

13: CNV collection (Iossifov 2021)

As a real-world example, we will build a CNV collection resource from Supplementary Data 4 of Yoon et al. (2021), which lists the de novo CNVs included in their analysis from WGS of SSC (simplex) and AGRE (multiplex) families.

Create a new folder for the resource and move into it:

mkdir my_CNVcollection
cd my_CNVcollection

First download the resource file:

curl -O https://static-content.springer.com/esm/art%3A10.1038%2Fs42003-021-02533-z/MediaObjects/42003_2021_2533_MOESM6_ESM.xlsx

This command downloads an Excel workbook with two sheets. In this example, we use the “De novo CNV in SSC and AGRE” sheet, which contains the de novo CNV intervals and associated metadata.

UCSCLink	collection	familyId	in affected status	personIds	location	variant	size	genomic region	number of genes	genes
LINK	SSC	12613	affected	12613.p1	chr1:1305145-1314126	duplication	8982	coding	3	ACAP3,INTS11,PUSL1
LINK	AGRE	AU2725202_AU2725201	affected	AU2725301	chr1:3069177-4783791	duplication	1714615	coding	13	AJAP1,ARHGEF16,C1orf174,CCDC27,CEP104,DFFB,LRRC47,MEGF6,PRDM16,SMIM1,TP73,TPRG1L,WRAP73
LINK	SSC	13424	unaffected	13424.s1	chr1:3975501-3977800	deletion	2300	intergenic	0
LINK	SSC	12852	affected	12852.p1	chr1:6647401-6650500	deletion	3100	inter-coding_intronic	1	DNAJC11
LINK	SSC	13776	affected	13776.p1	chr1:8652301-8657600	deletion	5300	coding	1	RERE
LINK	SSC	13373	unaffected	13373.s1	chr1:9992001-9994100	deletion	2100	intergenic	0

The downloaded table does not include explicit chrom, pos_beg, or pos_end columns. Instead, these coordinates are encoded in the location field (for example, chr1:1305145-1314126). Run the script below in your terminal to split location into chrom, pos_beg, and pos_end, retain the variant and size columns, and write the result to a tab-separated file named Iossifov_Lab_SSC_AGRE_2021.tsv (before running the script, install openpyxl by mamba install openpyxl).

python - <<'PY'
import pandas as pd

df = pd.read_csv("download-csv.php", sep=",", dtype=str)
df["chrom"] = "chr" + df["cnv-locus"].str.extract(r"^(\d+|X|Y|M|MT)(?=[pq])")[0]
df[["pos_beg","pos_end"]] = df["basepair-range"].str.extract(r"(\d+)-(\d+)").astype(int)
out = pd.DataFrame({
    "chrom": df["chrom"],
    "pos_beg": df["pos_beg"],
    "pos_end": df["pos_end"],
    "CNV_name": df["cnv-locus"] + " " + df["cnv-type"],
    "deletion_duplication": df["cnv-type"],
})
out.to_csv("Iossifov_Lab_SSC_AGRE_2021.tsv", sep="\t", index=False)
PY

After running the script, inspect Iossifov_Lab_SSC_AGRE_2021.tsv to confirm the coordinate columns (chrom, pos_begin, and pos_end) and attribute columns.

chrom	pos_beg	pos_end	variant	size
chr1	1305145	1314126	duplication	8982
chr1	3069177	4783791	duplication	1714615
chr1	3975501	3977800	deletion	2300
chr1	6647401	6650500	deletion	3100
chr1	8652301	8657600	deletion	5300
chr1	9992001	9994100	deletion	2100

Prepare a genomic_resource.yaml with the following content to make the resource available in GAIn:

type: cnv_collection
table:
  filename: Iossifov_Lab_SSC_AGRE_2021.tsv

scores:
- id: CNV_type
  name: variant
  type: str
  desc: CNV type
- id: CNV_size
  name: size
  type: int
  desc: CNV size
  histogram:
    type: number
    y_log_scale: True

meta:
  summary: Iossifov Lab SSC AGRE 2021 CNV collection

Finally, while still in the resource directory, run:

grr_manage resource-repair

This command validates the CNV collection resource for use in annotation and generates an HTML summary page with basic descriptions and any available statistics for the CNV collection resource.

_images/example13_resource.png — Summary html page created for `my_CNVcollection` resource.

Select GRR to work with

As noted earlier, a fresh GAIn installation includes access to the default IossifovLab GRR. If you want GAIn to use a different set of GRRs for annotation (for example, a local GRR you created), you can define them explicitly by creating a file named .grr_definition.yaml in your home directory. This file specifies which GRRs GAIn should connect to when browsing and annotating.

To let GAIn see our newly created local repository, my_GRR, we create ~/.grr_definition.yaml with the following content:

id: development
type: group
children:
- id: GRR
  type: url
  url: https://grr.iossifovlab.com

- id: my_GRR
  type: directory
  directory: <path to myGRR>/my_GRR

This configuration tells GAIn that, when resolving a resource ID, it should first look in the public GRR hosted by the Iossifov lab (GRR). If the resource is not found there, it then falls back to the local directory-based GRR (my_GRR).

If you use the grr_browse command again, you will see that GAIn now recognizes both GRRs.

Working with GRR definition: <home directory>/.grr_definition.yaml
id: development
type: group
children:
- id: GRR
  type: url
  url: https://grr.iossifovlab.com
- id: my_GRR
  type: directory
  directory: <path to myGRR>/my_GRR

samocha_enrichment_background 0        4 1.38 MB      GRR enrichment/samocha_background
gene_score           0        6 7.8 MB       GRR gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015
gene_score           0       11 576.07 KB    GRR gene_properties/gene_scores/LGD
...

Below is an example annotation pipeline you can run using your local resources. Copy the contents into a text file named annotation_pipeline_local.yaml.

preamble:
summary: Local pipeline
input_reference_genome: my_genome

annotators:
- effect_annotator:
    gene_models: my_genemodel
    attributes:
    - worst_effect
    - gene_list
    - genes

- position_score:
    resource_id: my_position

- normalize_allele_annotator
- allele_score:
    resource_id: my_allele
    input_annotatable: normalized_allele

- gene_score_annotator:
    resource_id: my_genescore
    input_gene_list: gene_list

Run the following command to annotate your variants using this pipeline.

annotate_columns variants.txt annotation_pipeline_local.yaml

This will create a variants_annotated.txt file with the following content:

chrom	pos	ref	alt	worst_effect	genes	phyloP7way	am_pathogenicity	am_class	my_genescore
chr14	21415880	G	A	nonsense	CHD8	0.917			{‘CHD8’: 9}
chr17	7674904	TCT	T	frame-shift	TP53	-0.12	0.151	likely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benign	{‘TP53’: 3}
chr7	117587806	G	A	missense	CFTR	0.917	0.99	likely_pathogenic	{‘CFTR’: 7}

You can now annotate your variants for gene effects using the latest genomic assembly (GRCh38-p14), the MANE gene model (v1.5), PhyloP7, and AlphaMissense — entirely offline. Simply comment out the Iossifov Lab GRR resource in your .grr_definition.yaml file, disconnect from the network, and run the annotation locally.

id: "development"
type: group
children:
#- id: "GRR"
#  type: "url"
#  url: "https://grr.iossifovlab.com"

- id: "my_GRR"
type: "directory"
directory: "/Users/muratcokol/Desktop/my_GRR"

miniGRR: a template GRR

Defining new genomic resources can feel a bit abstract at first: different resource types expect different file formats (FASTA, BigWig, tabix-indexed TSV/VCF, …), coordinate conventions (0-based vs 1-based), and configuration options in genomic_resource.yaml. To make this concrete, we provide miniGRR, a small, self-contained Genomic Resource Repository on GitHub that you can use as a template.

You can clone miniGRR with:

git clone https://github.com/iossifovlab/mini_grr.git
cd mini_grr

miniGRR contains a toy genome (two short chromosomes, 20 nucleotides each) with ready-made genomic_resource.yaml descriptors, plus minimal examples of gene models (RefSeq- and GTF-style, one gene per chromosome), position scores, allele scores, gene scores and gene sets. The resources span common file types (e.g., TSV/tabix, BedGraph/BigWig, VCF) and both 0-based and 1-based coordinate conventions, so you can see exactly how formats and offsets are declared in practice.

Once cloned, you can point GAIn to miniGRR in your .grr_definition.yaml and run pipelines against it on a laptop to verify that your installation and configuration work as expected. After you understand how a given resource is structured, you can swap in your own data (for example, replace the mini FASTA with a real assembly, or replace a toy score track with your own) by editing the corresponding genomic_resource.yaml. In this way, miniGRR serves as a template GRR that demonstrates directory layout, metadata fields, and attribute wiring with minimal compute, making it easier to bootstrap a private GRR and extend it resource by resource.