GRR configuration files
A GRR configuration file (also called a GRR definition file) is a small YAML
file that tells the GAIn command-line tools — grr_browse, grr_manage,
annotate_tabular, annotate_vcf, annotate_variant_effects, and the
rest — which Genomic Resource Repositories (GRRs) to use and in what order to
search them. It does not contain any genomic data itself; it only points to
the repositories (local directories or remote URLs) where resources live and
describes how to combine and cache them.
This page documents the structure of the configuration file, how the CLI tools locate it, the available repository types, and resource caching. For an introduction to GRRs and the resources they hold, see Genomic resources and repositories.
How the CLI tools find the configuration
Every CLI tool resolves the GRR it should use from the first of the following sources that is available (highest precedence first):
The
-g/--grrcommand-line option — a path to a GRR configuration file. This overrides everything else for that invocation:$ grr_browse -g /path/to/my_grr_definition.yaml $ annotate_tabular -g /path/to/my_grr_definition.yaml input.tsv pipeline.yaml
The
--grr-directorycommand-line option — a shortcut that uses a single local directory directly as a GRR, without writing a configuration file at all. It is equivalent to a one-entry configuration with adirectoryrepository (see below):$ grr_browse --grr-directory /path/to/my_grr
The
GRR_DEFINITION_FILEenvironment variable — a path to a GRR configuration file, used when no command-line option is given:$ export GRR_DEFINITION_FILE=/path/to/my_grr_definition.yaml $ grr_browse
The
~/.grr_definition.yamlfile in your home directory — the default configuration file. If it exists and none of the options above are set, the tools use it automatically. This is the most common way to configure your default GRRs once and have every tool pick them up.The built-in default — if none of the above is present, GAIn falls back to the public IossifovLab GRR. This is equivalent to the following configuration:
id: main-GRR type: http url: https://grr.iossifovlab.com
Configuration file structure
A GRR configuration file is a YAML mapping describing a single repository. Every
repository has a required type and an id, plus additional fields that
depend on the type. A repository can be a real repository (a local directory
or a remote URL) or a group that combines several child repositories.
Common fields
id (string): Identifier for the repository. Used in log messages and to refer to the repository.type (string, required): One ofgroup,directory,url,http,s3, orembedded(see below).cache_dir (string, optional): Directory used to cache downloaded resources locally. May be added to any repository type, including agroup(see Resource caching).
Repository types
directory— a local repository on diskA GRR stored in a local directory.
directory (string, required): Absolute path to a local directory containing the resources. A relative path is rejected.id: My_First_GRR type: directory directory: /home/user/grrs/My_First_GRR
The aliases
dirandfileare accepted as synonyms ofdirectory.url— a remote repository (HTTP, HTTPS, or S3)The general-purpose remote repository type. The scheme of the URL selects the protocol;
http,https, ands3are supported.url (string, required): Base URL of the remote repository.id: main-GRR type: url url: https://grr.iossifovlab.com
http— a remote HTTP(S) repositoryLike
urlbut restricted tohttp/httpsURLs.url (string, required): Base URL of the remote repository.s3— a remote S3 repositoryLike
urlbut restricted tos3URLs.url (string, required):s3://URL of the remote repository.embedded— an in-memory repositoryA repository whose resources are defined inline in the configuration. This is used mainly for testing and small examples.
content (mapping, required): Nested dictionary describing files and directories. Directory values are nested mappings; file values are file contents.The alias
memoryis accepted as a synonym ofembedded.group— a collection of repositoriesCombines several repositories and searches them in the order they appear in
children. When a resource ID is requested, the group queries each child in turn and returns the first match. Groups can be nested.children (list, required): A list of repository configurations (each a real repository or anothergroup).
Search order
Within a group, repositories are searched top to bottom and the first
repository that contains the requested resource wins. Order your children
accordingly — for example, list a local directory before a remote repository if
you want your local copies to take precedence, or after it if the remote should
be authoritative.
Resource caching
Many genomic resources are large (often hundreds of MB to many GB), and
repeatedly downloading or streaming them from a remote GRR can be slow and
network-dependent. Adding a cache_dir to a repository tells GAIn to cache
resources locally before using them.
With caching enabled, the first use of a resource may take longer while GAIn
downloads it into cache_dir; after that, GAIn reuses the cached copy, which
is typically much faster and avoids repeated network transfers. The tradeoff is
disk usage, so choose a cache_dir location with enough capacity.
cache_dir can be attached to any repository, including a group. When
attached to a group, it caches every resource served by that group — a
convenient way to put a single cache in front of several remote repositories at
once.
A complete annotated example
The configuration below covers most of the features described above. It defines
a top-level group, my_GRRs, with two children searched in order:
remote_GRRs— a nested group of two remote (url) repositories that share a single cache directory. Becausecache_diris set on the group, resources from both remote repositories are cached underremote_grr_cache.My_First_GRR— a localdirectoryrepository.
When a resource ID is requested, GAIn first searches main-GRR, then
GRR-ENCODE (both via the shared cache), and finally the local
My_First_GRR; the first match is returned.
type: group
id: "my_GRRs"
children:
- type: group
id: "remote_GRRs"
cache_dir: "<path_to_cache>/remote_grr_cache" # caches both remote GRRs below
children:
- id: "main-GRR"
type: "url"
url: "https://grr.iossifovlab.com"
- id: "GRR-ENCODE"
type: "url"
url: "https://grr-encode.iossifovlab.com"
- id: "My_First_GRR"
type: "directory"
directory: "<path_to_My_First_GRR>/My_First_GRR" # must be an absolute path
To use this configuration, save it as ~/.grr_definition.yaml (so every tool
picks it up automatically), point GRR_DEFINITION_FILE at it, or pass it
explicitly with -g:
$ grr_browse -g my_grr_definition.yaml