Genomic resources and repositories

A Genomic Resource Repository (GRR) is a collection of genomic resources (e.g., genomes, gene models, scores, and gene sets) stored either locally (on disk) or remotely (over the network). GAIn uses GRRs as the backing store for resources during annotation and analysis.

Repository discovery

A GRR configuration file, also called a GRR definition file, is a small YAML file that tells GAIn which Genomic Resource Repositories (GRRs) to use and in what order to search them. It does not contain genomic data itself. Instead, it points to the repositories where resources live, such as local directories or remote URLs, and can also describe how those repositories should be combined and cached.

GAIn command-line tools determine which GRR definition to use by checking several sources in order. A GRR definition passed directly with the -g or --grr command-line option has the highest priority and applies only to that command:

grr_browse -g /path/to/my_grr_definition.yaml
annotate_tabular -g /path/to/my_grr_definition.yaml input.tsv pipeline.yaml

If no command-line GRR definition is provided, GAIn next checks the GRR_DEFINITION_FILE environment variable,

export GRR_DEFINITION_FILE=/path/to/my_grr_definition.yaml
grr_browse

If GRR_DEFINITION_FILE is not set, GAIn then checks the default ~/.grr_definition.yaml file in the user’s home directory.

If none of these are available, GAIn falls back to the public IossifovLab GRR.

To configure which GRRs GAIn uses by default, create a file named .grr_definition.yaml in your home directory. The example below points GAIn to the public IossifovLab GRR (a remote repository accessed via URL):

id: development
type: group
children:
- id: GRR
  type: url
  url: https://grr.iossifovlab.com

If .grr_definition.yaml contains the next example, GAIn will resolve resources from your local directory-based GRR, such as one created in “Getting Started in GRR”. This overrides the default behavior, so the public IossifovLab GRR will no longer be used unless you add it explicitly.

id: development
type: group
children:
- id: grr_local
  type: directory
  directory: [path to my_grr]/my_grr

The configuration below defines two GRRs and searches them in order. When GAIn resolves a resource ID, it first queries the GRR with id GRR. If the resource is not found there, GAIn then queries the GRR with id grr_local.

id: development
type: group
children:
- id: GRR
  type: url
  url: https://grr.iossifovlab.com
- id: grr_local
  type: directory
  directory: [path to my_grr]/my_grr

Repository configuration

A repository configuration is a YAML mapping that describes a single repository. Every repository has a required id and type, plus additional fields depending on the repository type. A repository can be a concrete repository, such as a local directory or remote URL, or a group that combines several child repositories.

Common fields

id (string, required): Identifier for the repository.

type (string, required): directory, http, url, s3, embedded, or group.

cache_dir (string, optional): Directory used to cache downloaded resources locally. May be added to any repository type, including a group.

Repository types

directory — a local repository on disk: A GRR stored in a local directory. The aliases dir and file are accepted as synonyms of directory.

directory (string, required): Absolute path to a local directory containing the resources. A relative path is rejected.

url — a remote repository: The general-purpose remote repository type. The scheme of the URL selects the protocol; http, https, and s3 are supported.

url (string, required): Base URL of the remote repository.

http — a remote HTTP(S) repository: Like url but restricted to http and https URLs.

url (string, required): Base URL of the remote repository.

s3 — a remote S3 repository: Like url but restricted to s3 URLs.

url (string, required): s3:// URL of the remote repository.

embedded — an in-memory repository: A repository whose resources are defined inline in the configuration. This is used mainly for testing and small examples. The alias memory is accepted as a synonym of embedded.

content (mapping, required): Nested dictionary describing files and directories. Directory values are nested mappings; file values are file contents.

group — a collection of repositories: Combines several repositories and searches them in the order they appear in children. When a resource ID is requested, the group queries each child in turn and returns the first match. Groups can be nested.

children (list, required): A list of repository configurations. Each child can be a concrete repository or another group.

Within a group, the first repository that contains the requested resource wins. Order children accordingly: list a local directory before a remote repository if local resources should take precedence, or after it if the remote repository should be authoritative.

Repository caching

When a repository is configured with a cache_dir option, GAIn caches resources locally before using them. This matters because many genomic resources are large (often hundreds of MB to many GB), and repeatedly downloading or streaming them from a remote GRR can be slow and network-dependent.

With caching enabled, the first use of a resource may take longer while GAIn downloads it into cache_dir. After that, GAIn reuses the cached copy, which is typically much faster and avoids repeated network transfers. This is especially useful for resources you access frequently (for example, common reference genomes, gene models, or widely used scores).

cache_dir can be attached to any repository, including a group. When attached to a group, it caches resources served through that group, which provides a convenient way to use a single cache in front of several repositories.

The tradeoff is disk usage: cached resources can occupy substantial space, so choose a cache_dir location with enough capacity (and keep in mind that the cache may grow over time as you use more resources).

Complete GRR definition example

The configuration below combines several features described above. It defines a top-level group with two children: a nested group of remote repositories that share a cache directory, followed by a local directory-based GRR.

In this example, GAIn searches main-GRR first, then GRR-ENCODE, and finally My_First_GRR. The first repository that contains the requested resource is used.

type: group
id: my_GRRs
children:
- type: group
  id: remote_GRRs
  cache_dir: <path_to_cache>/remote_grr_cache
  children:
  - id: main-GRR
    type: url
    url: https://grr.iossifovlab.com

  - id: GRR-ENCODE
    type: url
    url: https://grr-encode.iossifovlab.com

- id: My_First_GRR
  type: directory
  directory: <path_to_My_First_GRR>/My_First_GRR

To use this configuration, save it as ~/.grr_definition.yaml, point GRR_DEFINITION_FILE to it, or pass it explicitly with -g:

grr_browse -g my_grr_definition.yaml

Repository management

GAIn provides two command-line tools for working with genomic resources and repositories. Their usage is outlined below.

grr_manage: create, inspect, and maintain GRRs (manifests, stats, info pages, repair).

grr_browse: browse the resources available through a GRR definition file.

grr_manage

$ grr_manage --help
usage: grr_manage [-h] [--version] [--verbose] [--logfile LOGFILE]
                {list,repo-init,repo-manifest,resource-manifest,repo-stats,resource-stats,repo-info,resource-info,repo-repair,resource-repair}
                ...

Genomic Resource Repository Management Tool

positional arguments:
{list,repo-init,repo-manifest,resource-manifest,repo-stats,resource-stats,repo-info,resource-info,repo-repair,resource-repair}
                        Command to execute
    list                List a GR Repo
    repo-init           Initialize a directory to turn it into a GRR
    repo-manifest       Create/update manifests for whole GRR
    resource-manifest   Create/update manifests for a resource
    repo-stats          Build the statistics for a resource
    resource-stats      Build the statistics for a resource
    repo-info           Build the index.html for the whole GRR
    resource-info       Build the index.html for the specific resource
    repo-repair         Update/rebuild manifest and histograms whole GRR
    resource-repair     Update/rebuild manifest and histograms for a resource

options:
-h, --help            show this help message and exit
--version             Prints GAIn version and exists.
--verbose, -v, -V
--logfile LOGFILE     File to log output to. If not set, logs to console.

grr_browse

$ grr_browse --help
usage: grr_browse [-h] [--version] [--verbose] [--logfile LOGFILE] [-g GRR]
                [--bytes]

Genomic Resource Repository Browse Tool

options:
-h, --help         show this help message and exit
--version          Prints GAIn version and exists.
--verbose, -v, -V
--logfile LOGFILE  File to log output to. If not set, logs to console.
--bytes            Print the resource size in bytes

Repository/Resource:
-g GRR, --grr GRR  path to GRR definition file.

Version control for GRRs

GRRs can be managed under version control using a combination of Git, DVC, and grr_manage. In this setup, small files such as genomic_resource.yaml, .MANIFEST files, histogram metadata, and .dvc tracking files are stored in Git, while large resource files are stored with DVC. The grr_manage tool is then used to generate or update GRR metadata, including manifests, resource statistics, histograms, and HTML info pages.

This organization makes it possible to track both the structure and content of a GRR while avoiding the need to store large genomic data files directly in Git.

Initializing version control

A version-controlled GRR starts as a directory-based GRR. In the Adding local GRRs section of the “Getting started on CLI” page, we created a local GRR named My_First_GRR and initialized it with grr_manage repo-init. We can now place that GRR directory under Git and DVC control.

Git is used for small files, such as genomic_resource.yaml, .MANIFEST, histogram metadata, and .dvc pointer files. DVC is used for large genomic resource files.

From the directory that contains My_First_GRR, enter the GRR root directory and initialize Git and DVC:

cd My_First_GRR
git init
dvc init

After initializing DVC, configure a DVC remote where large files will be stored. The remote can be a shared filesystem, SSH server, cloud bucket, or another DVC-supported storage backend. For example:

dvc remote add -d myremote <remote_url>

After initializing Git locally, the GRR directory can optionally be connected to a remote Git repository, such as a private or public GitHub repository, so that the GRR structure and metadata can be shared with other users:

git remote add origin git@github.com:<organization>/<repository>.git
git add .
git commit -m "Initialize version-controlled GRR"
git push -u origin main

Only small files and DVC pointer files should be committed to Git. Large genomic resource files should be added with dvc add and stored in the configured DVC remote.

Once this is done, the GRR can be managed using the same pattern as other version-controlled data repositories: small files are committed to Git, large resource files are added with dvc add and pushed with dvc push, and grr_manage is used to regenerate manifests, statistics, histograms, and info pages after changes. The public IossifovLab GRR is managed using this same approach.

Adding a resource to version control

In the Adding local GRRs section of “Getting started on CLI” page, we added a gene score resource named my_score to My_First_GRR. The resource directory contains the downloaded score file and its genomic_resource.yaml configuration file:

My_First_GRR/
└── my_score/
    ├── Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz
    └── genomic_resource.yaml

In a version-controlled GRR, the large resource file should be added to DVC, while the small configuration file should be added to Git. From the root of My_First_GRR:

cd my_score
dvc add Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz

This creates a .dvc pointer file for the large resource file. The .dvc file and the resource configuration should be added to Git:

git add Collins_rCNV_2022.dosage_sensitivity_scores.tsv.gz.dvc
git add genomic_resource.yaml
git commit -m "Add my_score resource"

The large resource file itself should be pushed to the configured DVC remote:

dvc push

After adding the resource, return to the GRR root directory and run

grr_manage resource-repair to generate or update the resource manifest, histograms, and other derived files:

cd ..
grr_manage resource-repair

The generated files should also be added to Git:

git add my_score/.MANIFEST
git add my_score/histograms/
git commit -m "Add my_score generated metadata"
git push

In this workflow, Git tracks the resource structure, configuration, DVC pointer files, manifests, and histogram outputs, while DVC stores the large resource data file itself.

Metadata-only updates

Metadata-only changes are simpler than changes to the underlying resource data. For example, suppose we want to update the summary, description, labels, or score descriptions in the genomic_resource.yaml file for the my_score resource. Because the large score file itself is not changing, we do not need to download or modify the DVC-managed resource file.

Edit the resource configuration file:

cd My_First_GRR
vi my_score/genomic_resource.yaml

After editing genomic_resource.yaml, run the repair command from the GRR root directory:

grr_manage resource-repair

This updates the resource manifest and any derived metadata that depend on the configuration. Then add the changed files to Git:

git add my_score/genomic_resource.yaml
git add my_score/.MANIFEST
git commit -m "Update my_score metadata"
git push

This workflow is efficient because metadata files are small and stored directly in Git, while large genomic resource files remain in DVC and do not need to be downloaded or modified for metadata-only updates.

Genomic resource configuration

GAIn supports a large number of genomic resource types (for example, genomes, gene models, and position scores). Each resource lives in its own folder within a GRR and includes the resource files plus a genomic_resource.yaml configuration file. In the sections below, we describe the configuration options available for each resource type.

All genomic_resource.yaml files share the same top-level structure: the first line sets the resource type (a string that determines how GAIn interprets the resource), and an optional meta section can provide human-readable metadata via summary, description, and labels.

type: <genomic resource type>

# resource-specific configuration

meta:
  summary: <(string) Short summary of the resource>
  description: <(string) Longer description of the resource>
  labels: <(dictionary) Arbitrary key/value pairs>

While describing genomic_resource.yaml configuration options, we will first cover the resource types whose genomic_resource.yaml files are relatively simple (genome, gene models, liftover chains, and annotation pipelines). Next, we will cover position score and allele score resources, whose configuration files are typically more complex because the underlying data files are large and often follow resource-specific conventions. To support these cases, we introduce additional options for table and column matching, histogram configuration, and annotation defaults. Finally, we cover gene scores (which are similar to position and allele scores) and gene sets, which have their own resource-specific configuration in genomic_resource.yaml.

Genomes

Genome resources use a reference assembly FASTA and (optionally) provide assembly-specific metadata such as chromosome naming conventions and pseudoautosomal regions.

Resource-specific fields in genomic_resource.yaml for genome resources (type: genome) are:

filename (string): Path to the genome FASTA file, relative to the resource directory.

index_file (string, optional): Path to the FASTA .fai index, relative to the resource directory. Default: <filename>.fai.

chrom_prefix (string, optional): Prefix expected in contig names (e.g., chr). Default: no prefix.

PARS (subsection, optional): Pseudoautosomal regions for the assembly.

The genome FASTA may be either a plain .fa file or a bgzipped FASTA (.fa.gz or .bgz). GAIn selects how to read the sequence from the file extension — a bgzipped genome is read with random access via pysam.FastaFile — so no extra configuration is required. A plain .fa genome needs only its .fai index; a bgzipped genome must be accompanied by two index files in the resource directory: a .fai FASTA index and a .gzi bgzip block index. Both are produced together by samtools faidx:

samtools faidx GRCh38.p14.genome.fa.gz

which writes GRCh38.p14.genome.fa.gz.fai and GRCh38.p14.genome.fa.gz.gzi next to the FASTA.

A bgzipped genome is configured exactly like a plain one — only the filename extension differs:

type: genome
filename: GRCh38.p14.genome.fa.gz
chrom_prefix: "chr"

meta:
  summary: Nucleotide sequence of the GRCh38.p14 genome assembly (bgzipped)

Let’s revisit the example genomic_resource.yaml from the Getting started with GRR genome section. As before, filename points to the downloaded FASTA file and contig names use the chr prefix. We now also include PARS, which defines the pseudoautosomal regions on chromosomes X and Y.

type: genome
filename: GRCh38.p14.genome.fa
chrom_prefix: "chr"

PARS:
  "X":
    - "chrX:10000-2781479"
    - "chrX:155701382-156030895"
  "Y":
    - "chrY:10000-2781479"
    - "chrY:56887902-57217415"

meta:
  summary: Nucleotide sequence of the GRCh38.p14 genome assembly

Gene models

For gene model resources, the genomic_resource.yaml file has a minimal resource-specific section with only filename and format.

Resource-specific fields (type: gene_models):: filename (string): Path to the gene model file, relative to the resource directory.

format (string): Gene model format. Supported values include default, refflat, refseq, ccds, knowngene, gtf, and ucscgenepred.

In the Getting started with GRR gene models example, the gene model file is a GTF, so we set format: gtf.

type: gene_models

filename: MANE.GRCh38.v1.4.ensembl_genomic.gtf.gz
format: gtf

meta:
  summary: MANE gene model version 1.4

Liftover chains

For liftover chain resources, the genomic_resource.yaml file has a minimal resource-specific section with only filename.

Resource-specific fields (type: liftover_chain):: filename (string): Path to the chain file, relative to the resource directory.

type: liftover_chain
filename: hg38-chm13v2.over.chain.gz
meta:
  summary: Liftover Chain hg38 to T2T

Annotation pipelines

For annotation pipeline resources, the genomic_resource.yaml file has a minimal resource-specific section with only filename.

Resource-specific fields (type: annotation_pipeline):: filename (string): Path to the pipeline YAML file, relative to the resource directory.

type: annotation_pipeline
filename: Clinical_annotation.yaml
meta:
  summary: Clinical Annotation Pipeline

Position scores

Position score resources (type: position_score) use a genomic_resource.yaml file with three resource-specific sections: table, scores, and (optionally) default_annotation.

table

The table section specifies the data file (filename), its format, and how GAIn should interpret the columns.

Currently supported formats are tabix, vcf_info, tsv, csv, and bw.

The header_mode setting controls how column names (the header) are determined:: file: Extract the header from the file (default).

list: Use the explicit header provided via header.

none: No header is used; columns can only be referenced by index.

The header field is used only when header_mode is set to list. Example:

header_mode: list
header: ["chrom", "start", "end", "score_value"]

The user must tell GAIn which columns correspond to chrom (chromosome), pos_begin (start position), and pos_end (end position). This can be done by column index or by column name.

If the resource file has no header, columns must be specified by index. For example:

table:
  filename: positionscore1.bedGraph.gz
  format: tabix
  header_mode: none
  chrom:
    index: 0
  pos_begin:
    index: 1
  pos_end:
    index: 2

If the resource file includes a header, columns can be specified by name. In the next example, positionscore2.bedGraph.gz has columns named chr and pos:

table:
  filename: positionscore2.bedGraph.gz
  format: tabix
  header_mode: file
  chrom:
    name: chr
  pos_begin:
    name: pos
  pos_end:
    name: pos

The table section also supports chrom_mapping, which can be used to reconcile chromosome naming differences between the resource file and the reference genome. This is useful, for example, when the resource uses contig names like chr1 but the genome uses only numbers.

Three options are available under chrom_mapping:

add_prefix: Takes a string value and adds it as a prefix.

del_prefix: Takes a string value and removes it from the start of each chromosome name.

filename: Takes a filepath (relative to the genomic resource directory).

The file must contain two whitespace-delimited columns. The first line must be a header with the column names chrom and file_chrom. Values in file_chrom are what appear in the resource file, and values in chrom are what they will be mapped to. For example:

chrom           file_chrom
Chromosome_1     1
Chromosome_22    22

An example of using chrom_mapping (useful when the resource uses a chr prefix but the genome does not) is shown below:

table:
...
  chrom_mapping:
    add_prefix: "chr"

scores

The table section configures how the data file is read. The scores section specifies which score columns to extract, how to name them in the GRR, and what data type they should have. For example, the minimal configuration below extracts a float score from column index 2 and stores it under the id my_positionscore1:

scores:
- id: my_positionscore1
  type: float
  index: 2

Alternatively, score columns can be specified by name. In the next example, the score column in the file is named positionscore2, and the extracted score is stored under the id my_positionscore2:

scores:
- id: my_positionscore2
  type: float
  name: positionscore2

Optionally, the user may also add human-readable descriptions. These fields are used on the HTML summary page for the resource. For example:

desc: "conservation score"
large_values_desc: "more conserved"
small_values_desc: "less conserved"

The HTML summary page displays a default histogram for each score. Optionally, the user may provide a histogram configuration to override the default and control how the score distribution is displayed. Histogram configuration options are covered here. The example below shows a custom histogram within a complete scores entry. If the resource includes multiple scores, add additional entries under scores with different id values.

scores:
- id: my_positionscore2
  type: float
  name: positionscore2

  desc: "conservation score"
  large_values_desc: "more conserved"
  small_values_desc: "less conserved"

  histogram:
    type: number
    number_of_bins: 100
    view_range:
      min: 0.0
      max: 1.0
    y_log_scale: True

default_annotation

Annotation pipelines can choose which scores from a resource to use. If a pipeline does not explicitly specify scores for this resource, GAIn falls back to the resource’s default_annotation list. If default_annotation is not provided, all scores in the resource are used by default. An example is shown below:

default_annotation:
- source: my_positionscore2
  name: my_positionscore2

Putting all the pieces together, the following is a complete genomic_resource.yaml example for a position score resource. The optional meta field is omitted for conciseness.

type: position_score                         # resource type

table:                                       # how to read the input table
  filename: positionscore2.bedGraph.gz       # input file (relative path)
  format: tabix                              # file format
  header_mode: file                          # read header from file
  chrom:                                     # chromosome column
    name: chr                                # column name
  pos_begin:                                 # start position column
    name: pos                                # column name
  pos_end:                                   # end position column
    name: pos                                # column name

scores:                                      # how to extract data columns as scores
  - id: my_positionscore2                    # score id stored in GRR
    type: float                              # data type of the score values
    name: positionscore2                     # column name containing the score

    desc: "a description"                    # shown on the HTML summary page
    large_values_desc: "more"                # meaning of larger values (HTML)
    small_values_desc: "less"                # meaning of smaller values (HTML)

    histogram:                               # optional histogram override (HTML)
      type: number                           # numeric histogram
      number_of_bins: 100                    # bin count used in the histogram
      view_range:                            # visible range shown on the x-axis
        min: 0.0                             # minimum visible range in the histogram
        max: 1.0                             # maximum visible range in the histogram
      y_log_scale: True                      # use log scale on the y-axis

default_annotation:                          # default scores used for annotation
  - source: my_positionscore2                # score id to annotate from
    name: my_positionscore2                  # name of the annotation field

Allele scores

genomic_resource.yaml files for allele score resources are almost exactly the same as for position score resources, with three differences:

type: allele_score
allele_score_mode must be specified. Options are:

substitutions: single nucleotide substitutions (for example, C>T)

allele: covers all allele types (for example, insertions and deletions in addition to substitutions)
In the table section, the user must also specify which columns contain the reference and alternative alleles using reference and alternative.

The scores, default_annotation, and meta sections are the same as for position scores. The example below shows the beginning of a valid genomic_resource.yaml for an allele score resource:

type: allele_score
allele_score_mode: substitutions

table:
  filename: AlphaMissense_hg38_modified.tsv.gz
  format: tabix

  chrom:
    name: CHROM
  pos_begin:
    name: POS
  pos_end:
    name: POS
  reference:
    name: REF
  alternative:
    name: ALT

... (scores, default_annotation, and meta sections follow) ...

CNV collections

genomic_resource.yaml files for CNV collection resources are the same as for position score resources, except that the resource type is set to cnv_collection.

CNV collections are coordinate-based, like position scores: they are queried by chromosome and interval and do not model allele changes. Annotation consists of reporting overlapping CNVs and the selected associated fields (for example, CNV class and frequency).

The example below shows a valid genomic_resource.yaml for a CNV collection resource (my_CNVcollection.txt), which uses chrom, pos_begin and pos_end as column names for chromosome, beginning position and end position, respectively. It also has a column called deletion_duplication which describes the event type recorded.

type: cnv_collection
table:
  filename: my_CNVcollection.txt

scores:
- id: CNV type
  name: deletion_duplication
  type: str
  desc: duplication or deletion

meta:
  summary: CNV collection resource

Gene scores

Gene scores are gene-level annotations, such as constraint metrics, expression summaries, or intolerance scores. genomic_resource.yaml files for gene score resources are similar to position score resources, except that the resource type is set to gene_score and there is no table section. The underlying data file is a table with a gene identifier column and one or more score columns. By default the gene identifier column must be named gene; if the file uses a different column name, set gene_column to that name.

In the example genomic_resource.yaml file below, data file gene_scores.tsv contains a required column named gene, plus two score columns named constraint and intolerance. The scores section defines which columns are exposed as scores, and default_annotation works the same way as for position scores.

The HTML summary page displays a default histogram for each score. Optionally, the user may provide a histogram configuration to override the default and control how the score distribution is displayed, as shown for the constraint_score in this example. Histogram configuration options are covered here.

type: gene_score

filename: gene_scores.tsv

gene_column: gene  # optional; defaults to "gene"

scores:
- id: intolerance_score
  desc: Probability of Loss-of-Function Intolerance

- id: constraint_score
  desc: Gene conservation score
  histogram:
    type: number
    number_of_bins: 126
    view_range:
      min: 0
      max: 1
    x_min_log: 0.00001
    x_log_scale: false
    y_log_scale: true

default_annotation:
- source: constraint_score
  name: constraint_score

meta:
  summary: Gene score resource

Gene set collections

A gene_set_collection defines relationships between genes and gene sets. These relationships can be provided either directly as gene sets (gmt format) or as gene-set mappings (map format). In both cases, the underlying structure is the same: a many-to-many association between genes and sets.

In gmt format, each line of the file directly defines a gene set and its member genes. In this format, each row corresponds to a single gene set. The first column defines the set identifier, the second column typically provides a description, and the remaining columns list the genes belonging to that set. No additional processing is required to construct the gene sets.

example.gmt, an example gmt data file:

PATHWAY_A   Description of pathway A    GENE1    GENE2    GENE3
PATHWAY_B   Description of pathway B    GENE2    GENE4

Example genomic_resource.yaml file for a gmt gene set collection resource:

type: gene_set_collection
id: example_gmt
format: gmt
filename: example.gmt

meta:
  summary: Minimal GMT example

In map format, each row defines a relationship between a gene and a gene set. The first column contains the gene identifier, and the second column contains the set identifier. Gene sets are formed by grouping all rows with the same set identifier. A companion file may optionally be provided to associate each set identifier with a human-readable description.

example-map.txt, an example map file:

GENE1   SET_A
GENE2   SET_A
GENE3   SET_A
GENE2   SET_B
GENE4   SET_B

Optional companion file: example-mapnames.txt

SET_A   Pathway A description
SET_B   Pathway B description

Example genomic_resource.yaml file for a map gene set collection resource:

type: gene_set_collection
id: example_map
format: map
filename: example-map.txt

histograms:
  genes_per_gene_set:
    type: number
    y_log_scale: true

  gene_sets_per_gene:
    type: number
    y_log_scale: true

meta:
  summary: Example MAP-based gene set collection

For both gmt and map resources, the optional histograms section can be used to summarize the structure of the collection. For example, genes_per_gene_set describes the distribution of gene set sizes, while gene_sets_per_gene describes how many sets each gene belongs to.

Histogram configuration

Histograms provide a quick visual summary of how a score is distributed across the genome or across observed variants. Seeing the distribution is often as important as seeing individual values, because it helps interpret what “large” or “small” values typically look like for a given score and whether the score has outliers, heavy tails, or distinct modes.

For each score, the HTML summary page shows a default histogram whenever it is possible to compute one from the underlying data. Histogram configuration is optional. If a score includes a histogram block under scores, GAIn uses it to override the default display and control how the distribution is visualized.

Histogram behavior is controlled by the type field, which selects the histogram implementation. GAIn supports three histogram types: number for numeric scores, categorical for string or discrete category scores, and null to explicitly disable histogram computation/display when a histogram is not meaningful. The value of type must be exactly one of number, categorical, or null.

Some options are shared across number and categorical histogram types. For example, y_log_scale controls whether the y-axis is displayed on a log scale (default: False), which can be helpful when counts vary widely across bins or categories. x_log_scale controls whether the x-axis is displayed on a log scale (default: False). When x_log_scale is set to True, x_min_log defines the minimum x-axis value used for the logarithmic scale. The example below shows a minimal histogram configuration that overrides the default by enabling log-scale display on the y-axis for a numeric score. Other options depend on the selected type and are described in the sections below.

scores:
  - id: myscore
    column_name: RS
    type: int
    desc: a genomic score

    histogram:
      type: number
      y_log_scale: True

Number histograms

Number histograms are used for numeric scores, including continuous-valued scores and integer-valued scores. They are supported for scores of type int and float. By default, the histogram is calculated with 100 bins and uses linear scaling on both axes. They summarize the distribution by grouping values into bins along the x-axis and showing the number of observations per bin.

A number histogram configuration supports two options.

number_of_bins: number of bins used to partition the score values (default: 100).

view_range: the visible range on the x-axis using min and max values, which is useful for bounded scores (for example, 0-1) or for focusing on the region of interest without being dominated by extreme outliers. Default is showing all values.

The example below shows a number histogram configuration with an explicit bin count and visible range.

histogram:
  type: number
  number_of_bins: 10
  view_range:
    min: 0.0
    max: 1.0

Categorical histograms

Here, each value represents a discrete label (e.g., ClinVar clinical significance categories or review-status labels). Categorical histograms are supported for scores of type str and int. This histogram type shows the distribution of unique values in the score and is supported only for scores with fewer than 100 unique values. They summarize the distribution by counting how many observations fall into each unique value and displaying those counts.

A categorical histogram configuration supports five options.

displayed_values_count: the number of unique values that will be displayed in the histogram (default: 20). The remaining values are grouped into the Other category.

displayed_values_percent: the percentage of total mass of unique values that will be displayed. The remaining values are grouped into the Other category. Only one of displayed_values_count and displayed_values_percent can be set.

label_rotation: rotation angle for x-axis category labels in degrees (default: 0).

value_order: the order in which the unique values are displayed in the histogram.

plot_function: optional custom plotting function used instead of the default categorical histogram rendering. This is useful when the default plot and the available options are not sufficient, for example to reorder, filter, or relabel categories. The value should be provided as <python module>:<python function>, where the Python module path is relative to the resource directory. When plot_function is set, GAIn uses the custom function to render the histogram and ignores built-in categorical histogram options such as displayed_values_count, displayed_values_percent, and label_rotation.

The examples below show two common categorical histogram setups. The first uses the built-in categorical histogram rendering with displayed_values_count and label_rotation. The second uses plot_function, which overrides the default categorical histogram rendering.

Example 1: built-in categorical histogram options (top 5 values + label rotation)

histogram:
  type: categorical
  displayed_values_count: 5
  label_rotation: 90

Example 2: custom categorical histogram rendering using plot_function

histogram:
  type: categorical
  plot_function: "customplot1.py:my_own_plot"

For GAIn to render the second histogram using a custom plotting function, place a Python module such as customplot1.py that contains the function my_own_plot in the resource directory. The custom function must render and write a plot to the provided output stream (outfile) so it can be embedded in the HTML summary output. A simple example that sorts categories by their counts, keeps the top 20, and renders a basic bar chart (with optional log-scaled y-axis) to the provided output stream is:

from typing import IO
from dae.genomic_resources.histogram import CategoricalHistogram
import matplotlib.pyplot as plt

def my_own_plot(outfile: IO, histogram: CategoricalHistogram, xlabel: str, *_args, **_kw) -> None:
    items = sorted(histogram.raw_values.items(), key=lambda x: -x[1])[:20]
    labels, counts = zip(*items) if items else ([], [])
    plt.figure()
    plt.bar(labels, counts, log=histogram.config.y_log_scale)
    plt.xlabel(xlabel); plt.ylabel("count")
    plt.savefig(outfile); plt.clf()

Null histograms

Null histograms are used when calculating a histogram is not possible or does not make sense for a score. In this case, the HTML summary page will not display a histogram for the score, and instead records the reason why histogram display is disabled.

A null histogram configuration supports one required field.

reason: a short explanation of why the histogram is disabled.

Example:

histogram:
  type: null
  reason: "Histogram is not available for this score."

VCF score auto-detection

VCF files already describe many score-like fields in their headers. In particular, each ##INFO line provides an ID, a type, and a human-readable description. GAIn uses this metadata to automatically create score definitions for INFO fields, which you can then reference in configuration just like manually defined scores.

Create the following file and save it as example.vcf, which contains a single INFO field A:

##fileformat=VCFv4.1
##INFO=<ID=A,Number=1,Type=Integer,Description="Score A">
#CHROM POS ID REF ALT QUAL FILTER  INFO
chr1   5   .  A   T   .    .       A=1

Create the following genomic_resource.yaml for this score which omits an explicit scores section.

type: position_score

table:
  filename: example.vcf
  format: vcf_info

When you run grr_manage resource-repair, the scores and their descriptions will be automatically generated from the INFO field in the vcf file.

The configuration above is equivalent to spelling out the generated score definition explicitly:

type: position_score

table:
  filename: example.vcf
  format: vcf_info

scores:
- id: A
  type: int
  column_name: A
  desc: Score A

Some fields cannot be automatically generated. To customize a generated definition, add a scores: entry with the same id and include only the fields you want to change or extend (for example, overriding type or adding a histogram block):

scores:
- id: A
  type: float
  histogram:
    type: categorical
    value_order: ["alpha", "beta"]

GAIn derives each score’s type directly from the VCF INFO field type: Integer maps to int, Float to float, String to str, and Flag to bool.

Tabix indexing

Many GAIn resource types are backed by on-disk tables, typically tab-delimited genomic files (for example TSV/BED-like tables, bedGraph, or VCF-derived tables). These files can be large, but GAIn still needs to look up the records that overlap a given genomic interval during annotation (for example, chr1:100000-101000). Scanning the full file for every query would be too slow, so GAIn supports Tabix-indexed tables for fast random access by genomic region. (Some resource formats such as bigWig are already indexed and do not use Tabix.)

When you set format: tabix under a resource’s table section, you are telling GAIn that the data file is bgzip-compressed, coordinate-sorted, and accompanied by a Tabix index (.tbi or .csi). With that index in place, GAIn can jump directly to the relevant file blocks, and your table: mapping tells it how to interpret each row (which columns provide chrom, pos_begin, and pos_end, plus any header handling you specify).

The main pitfall is coordinate conventions: BED-style files are typically 0-based, half-open, while many TSV tables and VCF positions are 1-based. Keep the tabix indexing flags (for example -0) consistent with the file, and set zero_based accordingly in the resource YAML to avoid subtle off-by-one overlaps.

Common options:

-p, –preset: preset parser for common formats (e.g., vcf, bed, gff), which sets the expected coordinate columns automatically.

-s, –sequence: 1-based column index for the chromosome/contig (sequence name) column.

-b, –begin: 1-based column index for the start (begin) coordinate column.

-e, –end: 1-based column index for the end (stop) coordinate column. If the file has no end column, set -e to the same value as -b (single-position intervals).

-0, –zero-based: interpret coordinates as 0-based (BED-style) instead of 1-based.

-C, –csi: generate a CSI index instead of the default TBI index (useful for very large coordinates/contigs).

-f, –force: overwrite an existing index file.

For a full list of options run tabix --help. The examples below show how to produce Tabix indexes for common file layouts.

example usage of tabix

For a VCF-format score (-p vcf: use the VCF preset):

$ tabix -p vcf score.vcf.gz

For a 1-based TSV score with a single position column (-s: chrom column, -b: pos column, -e: same as -b):

$ tabix -s 1 -b 2 -e 2 score.tsv.gz

For a 1-based TSV score with start and stop position columns (-s: chrom, -b: start, -e: end):

$ tabix -s 1 -b 2 -e 3 score.tsv.gz

For a 0-based TSV score with start and stop position columns (-0: 0-based coordinates, plus -s/-b/-e as above):

$ tabix -0 -s 1 -b 2 -e 3 score.tsv.gz