Getting started with GRR
========================

Create your first GRR
---------------------
To create a new Genomic Resource Repository (GRR), start by making an empty directory and moving into it:


.. code-block:: bash

   mkdir my_GRR
   cd my_GRR


Initialize this directory as a GRR:


.. code-block:: bash

   grr_manage repo-init


From now on, any subdirectory under ``my_GRR`` becomes a genomic resource if:

1. It contains a properly configured genomic_resource.yaml, and

2. It includes the data files referenced in that YAML.

Add new resources
---------------------

1: Toy genome
^^^^^^^^^^^^^^^^^^^^^^^

Reference genomes live in FASTA files, and the human reference FASTA is on the order of gigabytes (~3 Gb), 
which makes it inconvenient as a first example. Instead of starting with the full human genome, we will use 
a toy genome made of two very short chromosomes (10 bases each). A genome resource in GAIn always has two components: 
the FASTA file that holds the sequences and a ``.fai`` index file that enables fast random access.

From within your ``my_GRR`` directory, create a directory called ``my_minigenome`` and move into it.

.. code-block:: bash

   mkdir my_minigenome
   cd my_minigenome

Create a new text file with the content below and save it as ``minigenome.fa``.

.. code-block:: text

    >chr1 1st_chromosome
    TATGAAATAA
    >chr2 2nd_chromosome
    AAAAAAAAAA

Before GAIn can efficiently access a genome, the reference FASTA must be indexed to enable fast random access. 
We use ``samtools`` (installed via bioconda and already included in the ``GAIn`` conda environment) for this step.

Run the following command to index ``minigenome.fa``. This generates a FASTA index file (``.fai``) that GAIn uses as a 
lookup table:

.. code-block:: bash

    samtools faidx minigenome.fa


Next, create a file named ``genomic_resource.yaml`` with the following content; 
this configures the minigenome resource so GAIn can recognize and use it for annotation:

.. code-block:: yaml

    type: genome
    filename: minigenome.fa
    chrom_prefix: "chr"
    meta:
      summary: mini genome


With the FASTA, index, and ``genomic_resource.yaml`` in place, the minigenome resource is already usable for 
annotation by GAIn (however, you will need to update your ``.grr_definition.yaml`` file to include ``my_GRR``, 
which includes this resource). 
We strongly recommend running the command below, which checks the resource and produces summary statistics.

.. code-block:: bash

    grr_manage resource-repair

On successful completion, an ``index.html`` file will appear in the ``my_minigenome`` directory. 
It contains basic metadata about the resource, statistics such as chromosome length and 
nucleotide/dinucleotide composition, and a full inventory of the files that make up the resource.

.. figure:: figures/example1_resource.png
  :scale: 80 %
  :align: center

  Summary html page created for ``my_minigenome`` resource.


2: Genome (GRCh38.p14)
^^^^^^^^^^^^^^^^^^^^^^^
Next, we configure a real human genome resource. From inside ``my_GRR``, make a directory named ``my_genome`` and change 
into it:

.. code-block:: bash

    mkdir my_genome
    cd my_genome

Then fetch the ``GRCh38.p14`` genome FASTA from the GENCODE FTP site:

.. code-block:: bash

    curl -o GRCh38.p14.genome.fa.gz \
    https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_46/GRCh38.p14.genome.fa.gz

Use the following commands in the ``my_genome`` directory to unzip and index the FASTA file:


.. code-block:: bash

    gunzip GRCh38.p14.genome.fa.gz
    samtools faidx GRCh38.p14.genome.fa

Now that you have both the FASTA and its ``.fai`` index, add a ``genomic_resource.yaml`` file 
in this directory with the content below. Here, filename specifies the FASTA file used for the genome sequence, 
``chrom_prefix`` indicates that chromosome names in this assembly use the ``chr`` prefix, for example ``chr1``, and ``PARS`` 
lists the pseudoautosomal regions on chromosomes X and Y.

.. code-block:: yaml

    type: genome
    filename: GRCh38.p14.genome.fa
    chrom_prefix: "chr"

    PARS:
    "X":
        - "chrX:10000-2781479"
        - "chrX:155701382-156030895"
    "Y":
        - "chrY:10000-2781479"
        - "chrY:56887902-57217415"

    meta:
      summary: Nucleotide sequence of the GRCh38.p14 genome assembly


At this point, the ``my_genome`` resource can be used for annotation by GAIn. 
To validate it and obtain summary statistics, run ``grr_manage resource-repair`` in this directory. 
Be aware that this step can be slow (~5 minutes), as it processes the entire genome to build the report.

On successful completion, an ``index.html`` file will appear in the ``my_genome`` directory, 
which contains basic metadata about the resource, statistics such as chromosome length and nucleotide/dinucleotide 
composition, and a full inventory of the files that make up the resource. 

.. figure:: figures/example2_resource.png
  :scale: 80 %
  :align: center

  Summary html page created for ``my_genome`` resource (partially displayed).


3: Gene models (MANE v1.4)
^^^^^^^^^^^^^^^^^^^^^^^
Next, we will create a gene model resource based on the MANE (Matched Annotation from NCBI and EBI) gene set, which offers a standardized transcript set for consistent use across genomic resources.
From inside ``my_GRR``, create a directory named ``my_genemodel`` and change into it:

.. code-block:: bash

    mkdir my_genemodel
    cd my_genemodel


First, we download the corresponding GTF file.

.. code-block:: bash

    curl -O https://ftp.ncbi.nlm.nih.gov/refseq/MANE/MANE_human/release_1.4/MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz


Next, create a ``genomic_resource.yaml`` file in the ``my_genemodel`` directory with the following content:

.. code-block:: yaml

    type: gene_models

    filename: MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz
    format: gtf

    meta:
      summary: MANE gene model version 1.4


With these files in place, ``my_genemodel`` is usable as a gene-model resource in GAIn. 
To check it and produce an HTML summary with basic statistics, execute ``grr_manage resource-repair`` in this directory.


.. figure:: figures/example3_resource.png
  :scale: 80 %
  :align: center

  Summary html page created for ``my_genemodel`` resource.


4: Toy position score
^^^^^^^^^^^^^^^^^^^^^^^
In this example, we will create a toy position score resource which only has scores for a few positions 
in chromosome 1. In your ``my_GRR`` directory, create a directory called ``my_miniposition``:


.. code-block:: bash

    mkdir my_miniposition
    cd my_miniposition


Create a new text file with the content below and save it as ``mini_positionscore.tsv``.
The first column is the chromosome name, the second column is the 0-based position, and the third column is the score value.


.. code-block:: text

    chr1	0	0
    chr1	1	0.1
    chr1	2	0.2
    chr1	3	0.3
    chr1	4	0.4


Next, create a ``genomic_resource.yaml`` file in the ``my_miniposition`` directory with the following content:

.. code-block:: yaml

    type: position_score

    table:
    filename: mini_positionscore.tsv
    zero_based: True

    chrom:
        index: 0
    pos_begin:
        index: 1
    pos_end:
        index: 1

    scores:
    - id: pos_tsv_0
      type: float
      index: 2

    meta:
      summary: 0-based tsv position score

The resource is ready for use by GAIn. To check it and produce an HTML summary with basic statistics, 
execute ``grr_manage resource-repair`` in this directory.


.. figure:: figures/example4_resource.png
  :scale: 60 %
  :align: center

  Summary html page created for ``my_miniposition`` resource.


5: Position score (PhyloP7)
^^^^^^^^^^^^^^^^^^^^^^^

Let's now create a real position score resource. In your ``my_GRR`` directory, create a directory called ``my_position``:


.. code-block:: bash

    mkdir my_position
    cd my_position


phyloP (phylogenetic P-values) scores measure the evolutionary rate at individual nucleotides or other genomic elements. 
Let's download PhyloP7, which is derived from a multiple sequence alignment of the genomes of 7 different species 
(this is a large file around 5Gb, make sure you have enough space on your local drive).

.. code-block:: bash

    curl -O https://hgdownload.soe.ucsc.edu/goldenPath/hg38/phyloP7way/hg38.phyloP7way.bw


Next, create a ``genomic_resource.yaml`` file in the ``my_miniposition`` directory with the following content:

.. code-block:: yaml

    type: position_score

    table:
      filename: hg38.phyloP7way.bw
      header_mode: none
    
    scores:
    - id: phyloP7way
      type: float
      index: 3

    meta:
      summary: Conservation score based on the multiple alignment of 7 species

The resource is ready for use by GAIn. To check it and produce an HTML summary with basic statistics, 
execute ``grr_manage resource-repair`` in this directory (this will take around one hour as GAIn processes the 
large file to create summary statistics).


.. figure:: figures/example5_resource.png
  :scale: 60 %
  :align: center

  Summary html page created for ``my_position`` resource.


6: Toy allele score
^^^^^^^^^^^^^^^^^^^^^^^


Next, let's create a toy allele score resource. In ``my_GRR``, create a new directory called ``my_miniallele`` and move into it:

.. code-block:: bash

    mkdir my_miniallele
    cd my_miniallele


Create a new text file with the content below and save it as ``mini_allelescore.tsv``.
The first column is the chromosome name, the second column is the 0-based position. Third and fourth columns show the reference and alternate alleles. 
Last two columns are two allele scores, one is a numerical score and the second is a class represented by strings.

.. code-block:: text

    #chrom	pos	ref	alt	allele_score	allele_class
    chr1	0	T	A	1	good
    chr1	0	T	C	1.1	bad
    chr1	0	T	G	1.2	bad
    chr2	4	A	T	2	good
    chr2	4	A	C	2.1	bad
    chr2	4	A	G	2.2	bad


To prepare this allele score file for fast random access in GAIn, first compress and index it with:

.. code-block:: bash

    bgzip mini_allelescore.tsv 
    tabix -s 1 -b 2 -e 2 -0 mini_allelescore.tsv.gz


Then create a ``genomic_resource.yaml`` file as shown. This file marks the resource as indexed, identifies which columns hold the chromosome, 
position, reference, and alternate alleles, and describes the score columns that GAIn can use during annotation.


.. code-block:: yaml

    type: allele_score

    table:
      filename: mini_allelescore.tsv.gz
      format: tabix
      zero_based: True

      chrom:
        name: chrom
      pos_begin:
        name: pos
      pos_end:
        name: pos
      reference:
        name: ref
      alternative:
        name: alt

    scores:
    - id: allele_score
      name: allele_score
      type: float
        
    - id: allele_class
      name: allele_class
      type: str

    meta:
      summary: A toy allele score resource with allele scores and allele classes.


The resource is ready for use by GAIn. To check it and produce an HTML summary with basic statistics, execute ``grr_manage resource-repair`` in this directory.


.. figure:: figures/example6_resource.png
  :scale: 60 %
  :align: center

  Summary html page created for ``my_miniallele`` resource.


7: Allele score (AlphaMissense)
^^^^^^^^^^^^^^^^^^^^^^^
Next, let's create a real allele score resource. In your ``my_GRR`` directory, create a directory called ``my_allele``:

.. code-block:: bash

    mkdir my_allele
    cd my_allele

We will use AlphaMissense, a deep-learning-based missense variant deleteriousness score. Download the AlphaMissense score file for hg38 genome:

.. code-block:: bash

    curl -O https://zenodo.org/records/8208688/files/AlphaMissense_hg38.tsv.gz

A quick look at the downloaded file shows that the column names are listed on line four.

.. code-block:: bash

    gzip -dc AlphaMissense_hg38.tsv.gz | head


.. code-block:: text

    # Copyright 2023 DeepMind Technologies Limited
    #
    # Licensed under CC BY-NC-SA 4.0 license
    #CHROM	POS	REF	ALT	genome	uniprot_id	transcript_id	protein_variant	am_pathogenicity	am_class
    chr1	69094	G	T	hg38	Q8NH21	ENST00000335137.4	V2L	0.2937	likely_benign
    chr1	69094	G	C	hg38	Q8NH21	ENST00000335137.4	V2L	0.2937	likely_benign
    chr1	69094	G	A	hg38	Q8NH21	ENST00000335137.4	V2M	0.3296	likely_benign
    chr1	69095	T	C	hg38	Q8NH21	ENST00000335137.4	V2A	0.2609	likely_benign
    chr1	69095	T	A	hg38	Q8NH21	ENST00000335137.4	V2E	0.2922	likely_benign
    chr1	69095	T	G	hg38	Q8NH21	ENST00000335137.4	V2G	0.203	likely_benign

GAIn expects the column headers on the first line. Accordingly, we decompress the file, strip the first three lines, 
write the processed content to a new file, and delete the original file to minimize disk usage.

.. code-block:: bash

    gzip -dc AlphaMissense_hg38.tsv.gz \
    | sed '1,3d' \
    | bgzip > AlphaMissense_hg38_modified.tsv.gz
    rm AlphaMissense_hg38.tsv.gz

A second look at the resource file confirms that the column names are on line 1.

.. code-block:: bash
    bgzip -dc AlphaMissense_hg38_modified.tsv.gz | head -5


.. code-block:: text

    #CHROM	POS	REF	ALT	genome	uniprot_id	transcript_id	protein_variant	am_pathogenicity	am_class
    chr1	69094	G	T	hg38	Q8NH21	ENST00000335137.4	V2L	0.2937	likely_benign
    chr1	69094	G	C	hg38	Q8NH21	ENST00000335137.4	V2L	0.2937	likely_benign
    chr1	69094	G	A	hg38	Q8NH21	ENST00000335137.4	V2M	0.3296	likely_benign
    chr1	69095	T	C	hg38	Q8NH21	ENST00000335137.4	V2A	0.2609	likely_benign

The resource is now ready to be included in a GRR. As is typical for allele resources, the ``chrom`` and ``pos`` columns 
specify the genomic coordinates of the variant, while REF and ALT describe the variant itself. 
AlphaMissense also provides genome, UniProt, and transcript identifiers, as well as the amino acid substitution caused 
by the missense mutation (for example, the first variant changes valine to leucine). The last two columns report the 
scoring results: ``am_pathogenicity`` provides the predicted effect of the variant on protein structure and function, 
and ``am_class`` converts this score into categorical labels, with values below 0.34 classified as ``likely_benign`` and values above 0.564 as ``likely_pathogenic``.

To index the resource by genomic coordinates, we run the following command:

.. code-block:: bash

    tabix -s 1 -b 2 -e 2 AlphaMissense_hg38_modified.tsv.gz

Next, we create a text file called ``genomic_resource.yaml`` so that the resource is recognized 
by GAIn. As a first step, we configure it to ingest only the ``am_pathogenicity`` scores from 
the source file. In this ``YAML`` file, rather than using column indices, we explicitly specify 
the column names corresponding to chromosome, position, reference, and alternative alleles. 
We also customize the histogram generated by GAIn by setting the score range to 0-1 
(the AlphaMissense score range), using 100 bins, and applying a logarithmic scale to the ``y-axis``. 
Finally, we define the interpretation of low and high scores, which will be displayed 
on the summary page.


.. code-block:: yaml

    type: allele_score
    allele_score_mode: substitutions

    table:
      filename: AlphaMissense_hg38_modified.tsv.gz
      format: tabix

      chrom:
        name: CHROM
      pos_begin:
        name: POS
      pos_end:
        name: POS
      reference:
        name: REF
      alternative:
        name: ALT

    scores:
    - id: am_pathogenicity
      name: am_pathogenicity
      type: float
      desc: |
        AlphaMissense Pathogenicity score is a deleteriousness score for missense variants
      large_values_desc: "more pathogenic"
      small_values_desc: "less pathogenic"
      histogram:
        type: number
        number_of_bins: 100
        view_range:
          min: 0.0
          max: 1.0
        y_log_scale: True
      
    meta:
      summary: Functional impact of mutations on protein function

At this point, GAIn can use the resource for annotation. 
Running ``grr_manage resource-repair`` produces the following summary page, 
which currently includes only the ``am_pathogenicity`` score.

.. figure:: figures/example7_resource.png
  :scale: 40 %
  :align: center

  Summary html page created for ``my_allele`` resource.

.. figure:: figures/example7b_resource.png
  :scale: 40 %
  :align: center

  Histogram created for ``am_pathogenicity`` scores.

To include the ``am_class`` scores, add the following entries to the scores section, 
configuring the histogram as categorical and enabling a log scale on the ``y-axis``.


.. code-block:: yaml

  - id: am_class
    name: am_class
    type: str
    desc: |
      AlphaMissense Class is a deleteriousness category for missense variants
    histogram:
      type: categorical
      y_log_scale: True


Running ``grr_manage resource-repair`` with the updated ``genomic_resource.yaml`` 
file produces the updated resource page shown below, now displaying both the 
``am_pathogenicity`` and ``am_class`` scores.

.. figure:: figures/example7c_resource.png
  :scale: 60 %
  :align: center

  Updated summary HTML page created for ``my_allele`` resource.

8: Toy gene score
^^^^^^^^^^^^^^^^^^^^^^^

Create a new folder for the resource and move into it:

.. code-block:: bash

    mkdir my_minigenescore
    cd my_minigenescore

Create a comma-separated file called ``my_minigenescore.csv`` with the following content:

.. code-block:: bash

    gene,my_minigenescore
    CHD8,9
    TP53,3
    CFTR,7

This resource provides a single score for three example genes. 
Next, create a ``genomic_resource.yaml`` file in the same directory with this content:

.. code-block:: yaml

    type: gene_score
    filename: my_minigenescore.csv

    scores:
    - id: my_minigenescore
      histogram:
        type: number

    meta:
      summary: A custom gene score
      description: This is a custom gene score for demonstration purposes.

Finally, while still in the ``my_minigenescore`` directory, run:

.. code-block:: bash

    grr_manage resource-repair

This command checks that the resource is usable for annotation and produces an HTML 
summary file with basic descriptions and histograms for the ``my_minigenescore`` values.


.. figure:: figures/example8_resource.png
  :scale: 30 %
  :align: center

  Summary html page created for ``my_minigenescore`` resource.


9: Gene score (pLI)
^^^^^^^^^^^^^^^^^^^^^^^

As a real-world example of a gene score, we use pLI (probability of loss-of-function intolerance), which reflects a gene's sensitivity 
to loss-of-function mutations, with higher values indicating greater intolerance. The pLI score was introduced by `Lek et al.` in 2016.

Create a new folder for the resource and move into it:

.. code-block:: bash

    mkdir my_genescore
    cd my_genescore

At https://www.nature.com/articles/nature19057#Sec16, download the ZIP file containing the Supplementary Tables and unzip it. 
We focus on Supplementary Table 13, specifically the “Gene Constraint” sheet, which reports pLI scores and related constraint 
metrics for human genes. You may manually copy the gene and pLI columns into a new CSV file named pLI.csv, or 
generate this file automatically using the script below (before running the script, install ``openpyxl`` by ``mamba install openpyxl``).

.. code-block:: bash

    import pandas as pd
    # Load the Excel file
    df = pd.read_excel("nature19057-SI Table 13.xlsx", sheet_name="Gene Constraint")
    # Extract only 'gene' and 'pLI'
    df_subset = df[['gene', 'pLI']].copy()
    # Write to CSV
    df_subset.to_csv("pLI.csv", index=False)

The first few lines of pLI.csv will look like this:

.. code-block:: text

    gene	pLI
    AGRN	0.17335234
    NOC2L	1.33E-19
    B3GALT6	0.048104466
    C1orf159	0.090877636
    ISG15	0.009847813
    KLHL17	2.52E-07
    PLEKHN1	2.02E-08

Next, create a ``genomic_resource.yaml`` file in the same directory with this content:

.. code-block:: yaml

    type: gene_score
    filename: pLI.csv
    scores:
    - id: pLI
      desc: Probability of Loss-of-Function Intolerance
      small_values_desc: "less likely to be Loss-of-function intolerant"
      large_values_desc: "more likely to be Loss-of-function intolerant"
      histogram:
        type: number
        number_of_bins: 100
        view_range:
            min: 0
            max: 1
        x_min_log: 0.00001
        x_log_scale: false
        y_log_scale: true
    
    meta:
      summary: Probability of Loss-of-Function Intolerance
      description: The probability of loss-of-function intolerance (pLI) score reflects a gene's sensitivity to LoF mutations.


Finally, while still in the ``my_genescore`` directory, run:

.. code-block:: bash

    grr_manage resource-repair

This command checks that the resource is usable for annotation and produces an HTML 
summary file with basic descriptions and histograms for the ``my_genescore`` values.

.. figure:: figures/example9_resource.png
  :scale: 50 %
  :align: center

  Summary html page created for ``my_genescore`` resource.

10: Toy gene sets
^^^^^^^^^^^^^^^^^^^^^^^^

Gene set resources group biologically related genes and support membership and enrichment analyses. In GAIn, gene sets use a custom text representation. 
To illustrate how gene set resources are defined and used, we will create a toy gene set as a concrete example.

Create a new folder for the resource and move into it:

.. code-block:: bash

    mkdir my_minigenesets
    cd my_minigenesets


First, create another text file named ``map.txt`` with the following content. 
This file defines gene-to-set memberships: the left column lists gene names, and the right column lists the set identifier(s) for each gene. 
In this example, CHD8 and CFTR belong only to set 1, while TP53 belongs to sets 2 and 3.

.. code-block:: text

    CHD8	set_1
    TP53	set_2 set_3
    CFTR	set_3

Finally, make a ``genomic_resource.yaml`` file with the following content. 

.. code-block:: yaml

    type: gene_set_collection
    id: genesets
    format: map
    filename: map.txt

    meta:
      summary: mini gene sets collection


Finally, while still in the ``my_minigenesets`` directory, run:

.. code-block:: bash

    grr_manage resource-repair

This command checks that the resource is usable for annotation and produces 
an HTML summary file with basic descriptions and histograms for the ``my_minigenesets`` resource.

.. figure:: figures/example10_resource.png
  :scale: 50 %
  :align: center

  Summary html page created for ``my_minigenesets`` resource.


11: Gene sets (MSigDB)
^^^^^^^^^^^^^^^^^^^^^^^^

As a real-world example of a gene set resource, we will create a MSigDB (Molecular Signatures Database) gene sets derived from a variety of curated sources
The Curated (C2) collection in MSigDB includes gene sets from canonical pathway databases (e.g., KEGG, Reactome, BioCarta) and from published gene expression studies, capturing well-defined pathways and perturbation signatures.

To create a gene sets resource for MSigDB, make a new folder for the resource and move into it:

.. code-block:: bash

    mkdir my_genesets
    cd my_genesets


Grab the latest MSigDB gene sets in GMT format from the Broad Institute website:

.. code-block:: bash

    curl -O https://data.broadinstitute.org/gsea-msigdb/msigdb/release/7.4/c2.all.v7.4.symbols.gmt


Prepare a ``genomic_resource.yaml`` with the following content to make the resource available in GAIn:

.. code-block:: yaml

    type: gene_set_collection
    id: MSigDB_curated
    format: gmt
    filename: c2.all.v7.4.symbols.gmt
    histograms:
    genes_per_gene_set:
      type: number
      y_log_scale: True
    gene_sets_per_gene:
      type: number
      y_log_scale: True
    meta:
      summary: MSigDB (Molecular Signatures Database) gene sets


Finally, from inside the ``my_genesets`` directory, run:

.. code-block:: bash

    grr_manage resource-repair

This validates the resource for annotation and generates an HTML summary page with basic descriptions and histograms for ``my_genesets``.

.. figure:: figures/example11_resource.png
  :scale: 50 %
  :align: center

  Partial screen shot of the summary html page created for ``my_genesets`` resource.


12: Toy CNV collection
^^^^^^^^^^^^^^^^^^^

Copy-number variants (CNVs) are deletions or duplications of genomic segments. 
In practice, CNV resources summarize previously observed gains and losses so you can contextualize a query locus by 
interval overlap. In GAIn, CNV collections are represented as tabular files plus a small YAML definition that 
declares which columns should be exposed as annotation attributes.

Create a new folder for the resource and move into it:

.. code-block:: bash

    mkdir my_miniCNVcollection
    cd my_miniCNVcollection


Create a tab-separated file called my_miniCNVcollection.txt with the following content:

.. csv-table::
    :header-rows: 1

    chrom,pos_beg,pos_end,CNV_name,deletion_duplication,frequency
    chr1,3,15,Chr1_duplication,Duplication,0.1
    chr2,5,15,Chr2_duplication,Deletion,0.2


This file defines two example CNVs. Each row specifies an interval (``chrom``, ``pos_beg``, ``pos_end``), a CNV identifier (``CNV_name``), 
the CNV type (``deletion_duplication``) and ``frequency``.

Next, create a ``genomic_resource.yaml`` file in the same directory with this content:

.. code-block:: yaml

    type: cnv_collection
    table:
      filename: my_miniCNVcollection.txt

    scores:
    - id: CNV type
      name: deletion_duplication
      type: str
      desc: duplication or deletion
    - id: CNV frequency
      name: frequency
      type: float
      desc: CNV frequency

    meta:
      summary: CNV collection resource


In this resource, the interval columns (``chrom``, ``pos_beg``, ``pos_end``) are stored in the table and used for overlap queries, 
while the two fields listed under scores (CNV type and CNV frequency) are exposed as attributes that can be emitted in annotation output.

Finally, while still in the ``my_miniCNVcollection`` directory, run:

.. code-block:: bash

    grr_manage resource-repair


This command checks that the resource is usable for annotation and produces an HTML summary file with basic descriptions for the CNV collection resource.

.. figure:: figures/example12_resource.png
    :scale: 50 %
    :align: center

    Summary html page created for ``my_miniCNVcollection`` resource.


13: CNV collection (Iossifov 2021)
^^^^^^^^^^^^^^^^^^^
As a real-world example, we will build a CNV collection resource from Supplementary Data 4 of `Yoon et al.` (2021), which lists the `de novo` CNVs 
included in their analysis from WGS of SSC (simplex) and AGRE (multiplex) families.

Create a new folder for the resource and move into it:

.. code-block:: bash

    mkdir my_CNVcollection
    cd my_CNVcollection


First download the resource file:


.. code-block:: bash

    curl -O https://static-content.springer.com/esm/art%3A10.1038%2Fs42003-021-02533-z/MediaObjects/42003_2021_2533_MOESM6_ESM.xlsx

This command downloads an Excel workbook with two sheets. In this example, we use the "De novo CNV in SSC and AGRE" sheet, which contains the de novo CNV intervals and associated metadata.

.. csv-table::
    :header-rows: 1

    UCSCLink,collection,familyId,in affected status,personIds,location,variant,size,publicaton,genomic region,number of genes,genes
    LINK,SSC,12613,affected,12613.p1,chr1:1305145-1314126,duplication,8982,,coding,3,"ACAP3,INTS11,PUSL1"
    LINK,AGRE,AU2725202_AU2725201,affected,AU2725301,chr1:3069177-4783791,duplication,1714615,,coding,13,"AJAP1,ARHGEF16,C1orf174,CCDC27,CEP104,DFFB,LRRC47,MEGF6,PRDM16,SMIM1,TP73,TPRG1L,WRAP73"
    LINK,SSC,13424,unaffected,13424.s1,chr1:3975501-3977800,deletion,2300,,intergenic,0,
    LINK,SSC,12852,affected,12852.p1,chr1:6647401-6650500,deletion,3100,,inter-coding_intronic,1,DNAJC11
    LINK,SSC,13776,affected,13776.p1,chr1:8652301-8657600,deletion,5300,,coding,1,RERE
    LINK,SSC,13373,unaffected,13373.s1,chr1:9992001-9994100,deletion,2100,,intergenic,0,

The downloaded table does not include explicit ``chrom``, ``pos_beg``, or ``pos_end`` columns. 
Instead, these coordinates are encoded in the location field (for example, ``chr1:1305145-1314126``). 
Run the script below in your terminal to split location into ``chrom``, ``pos_beg``, and ``pos_end``, 
retain the ``variant`` and ``size`` columns, and write the result to a tab-separated file 
named ``Iossifov_Lab_SSC_AGRE_2021.tsv`` (before running the script, install ``openpyxl`` by ``mamba install openpyxl``).

.. code-block:: bash

    python - <<'PY'
    import pandas as pd

    df = pd.read_csv("download-csv.php", sep=",", dtype=str)
    df["chrom"] = "chr" + df["cnv-locus"].str.extract(r"^(\d+|X|Y|M|MT)(?=[pq])")[0]
    df[["pos_beg","pos_end"]] = df["basepair-range"].str.extract(r"(\d+)-(\d+)").astype(int)
    out = pd.DataFrame({
        "chrom": df["chrom"],
        "pos_beg": df["pos_beg"],
        "pos_end": df["pos_end"],
        "CNV_name": df["cnv-locus"] + " " + df["cnv-type"],
        "deletion_duplication": df["cnv-type"],
    })
    out.to_csv("Iossifov_Lab_SSC_AGRE_2021.tsv", sep="\t", index=False)
    PY


After running the script, inspect ``Iossifov_Lab_SSC_AGRE_2021.tsv`` to confirm the coordinate columns (``chrom``, ``pos_begin``, and ``pos_end``) 
and attribute columns.


.. csv-table::
    :header-rows: 1

    chrom,pos_beg,pos_end,variant,size
    chr1,1305145,1314126,duplication,8982
    chr1,3069177,4783791,duplication,1714615
    chr1,3975501,3977800,deletion,2300
    chr1,6647401,6650500,deletion,3100
    chr1,8652301,8657600,deletion,5300
    chr1,9992001,9994100,deletion,2100


Prepare a ``genomic_resource.yaml`` with the following content to make the resource available in GAIn:


.. code-block:: yaml

    type: cnv_collection
    table:
      filename: Iossifov_Lab_SSC_AGRE_2021.tsv

    scores:
    - id: CNV_type 
      name: variant
      type: str
      desc: CNV type
    - id: CNV_size 
      name: size
      type: int
      desc: CNV size
      histogram:
        type: number
        y_log_scale: True
        
    meta:
      summary: Iossifov Lab SSC AGRE 2021 CNV collection


Finally, while still in the resource directory, run:

.. code-block:: bash

    grr_manage resource-repair


This command validates the CNV collection resource for use in annotation and generates an HTML summary page with basic descriptions 
and any available statistics for the CNV collection resource.


.. figure:: figures/example13_resource.png
    :scale: 50 %
    :align: center

    Summary html page created for ``my_CNVcollection`` resource.


Select GRR to work with
-----------------------

As noted earlier, a fresh GAIn installation includes access to the default IossifovLab GRR. 
If you want GAIn to use a different set of GRRs for annotation (for example, a local GRR you created), 
you can define them explicitly by creating a file named .grr_definition.yaml in your home directory. 
This file specifies which GRRs GAIn should connect to when browsing and annotating.

To let GAIn see our newly created local repository, my_GRR, we create ~/.grr_definition.yaml with 
the following content:

.. code-block:: yaml

   id: development
   type: group
   children:
   - id: GRR
     type: url
     url: https://grr.iossifovlab.com

   - id: my_GRR
     type: directory
     directory: <path to myGRR>/my_GRR


This configuration tells GAIn that, when resolving a resource ID, it should first look in the public GRR
hosted by the Iossifov lab (``GRR``). If the resource is not found there, it then falls back to the local
directory-based GRR (``my_GRR``). 

If you use the ``grr_browse`` command again, you will see that GAIn now recognizes both GRRs.

.. code-block:: text

    Working with GRR definition: <home directory>/.grr_definition.yaml
    id: development
    type: group
    children:
    - id: GRR
      type: url
      url: https://grr.iossifovlab.com
    - id: my_GRR
      type: directory
      directory: <path to myGRR>/my_GRR

    samocha_enrichment_background 0        4 1.38 MB      GRR enrichment/samocha_background
    gene_score           0        6 7.8 MB       GRR gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015
    gene_score           0       11 576.07 KB    GRR gene_properties/gene_scores/LGD
    ...

Below is an example annotation pipeline you can run using your 
local resources. Copy the contents into a text file named 
``annotation_pipeline_local.yaml``.

.. code-block:: yaml

    preamble:
    summary: Local pipeline 
    input_reference_genome: my_genome

    annotators:
    - effect_annotator:
        gene_models: my_genemodel
        attributes:
        - worst_effect
        - gene_list
        - genes

    - position_score:
        resource_id: my_position

    - normalize_allele_annotator
    - allele_score: 
        resource_id: my_allele 
        input_annotatable: normalized_allele

    - gene_score_annotator:
        resource_id: my_genescore
        input_gene_list: gene_list

Run the following command to annotate your variants using this pipeline.

.. code-block:: bash

    annotate_columns variants.txt annotation_pipeline_local.yaml


This will create a ``variants_annotated.txt`` file with the following content:

.. csv-table::
    :header-rows: 1

    chrom,pos,ref,alt,worst_effect,genes,phyloP7way,am_pathogenicity,am_class,my_genescore
    chr14,21415880,G,A,nonsense,CHD8,0.917,,,{'CHD8': 9}
    chr17,7674904,TCT,T,frame-shift,TP53,-0.12,0.151,likely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benignlikely_benign,{'TP53': 3}
    chr7,117587806,G,A,missense,CFTR,0.917,0.99,likely_pathogenic,{'CFTR': 7}

You can now annotate your variants for gene effects using the latest genomic 
assembly (GRCh38-p14), the MANE gene model (v1.5), PhyloP7, and AlphaMissense — 
entirely offline. Simply comment out the Iossifov Lab GRR resource in your 
``.grr_definition.yaml`` file, disconnect from the network, and run the annotation locally.


.. code-block:: yaml

    id: "development"
    type: group
    children:
    #- id: "GRR"
    #  type: "url"
    #  url: "https://grr.iossifovlab.com"

    - id: "my_GRR"
    type: "directory"
    directory: "/Users/muratcokol/Desktop/my_GRR"


miniGRR: a template GRR
-----------------------

Defining new genomic resources can feel a bit abstract at first: different resource types expect different file formats 
(FASTA, BigWig, tabix-indexed TSV/VCF, …), coordinate conventions (0-based vs 1-based), and configuration options in 
``genomic_resource.yaml``. To make this concrete, we provide **miniGRR**, a small, self-contained Genomic Resource Repository on 
GitHub that you can use as a template.

You can clone miniGRR with:

.. code-block:: bash

   git clone https://github.com/iossifovlab/mini_grr.git
   cd mini_grr

miniGRR contains a toy genome (two short chromosomes, 20 nucleotides each) with ready-made ``genomic_resource.yaml`` 
descriptors, plus minimal examples of gene models (RefSeq- and GTF-style, one gene per chromosome), position scores, 
allele scores, gene scores and gene sets. The resources span common file types (e.g., TSV/tabix, BedGraph/BigWig, VCF) 
and both 0-based and 1-based coordinate conventions, so you can see exactly how formats and offsets are declared in practice.

Once cloned, you can point GAIn to miniGRR in your ``.grr_definition.yaml`` and run pipelines against it on a laptop to verify 
that your installation and configuration work as expected. After you understand how a given resource is structured, you can 
swap in your own data (for example, replace the mini FASTA with a real assembly, or replace a toy score track with your own) 
by editing the corresponding ``genomic_resource.yaml``. In this way, miniGRR serves as a template GRR that demonstrates directory 
layout, metadata fields, and attribute wiring with minimal compute, making it easier to bootstrap a private GRR and extend it resource by resource.