Genomic resources and resource repositories
The GPF system uses genomic resources such as reference genomes, gene models, genomic scores, etc. These resources are provided by resource repositories which can be accessed remotely or locally. The system can use multiple repositories at a time.
Genomic resources and resource repositories are fundamentally a collection of directories and files with special YAML configurations.
The following documentation will explain what genomic resources are available and how they can be configured, how resource repositories are configured and discovered by the system, and a short tutorial on creating a local repository with a custom resource.
Genomic resources
A genomic resource is a directory containing a special genomic_resource.yaml
configuration and an arbitrary number of files.
Additionally, GPF will create additional files (.MANIFEST
, the .grr
subdirectory) which are used internally to track changes to the resource.
genomic_resource.yaml
type: <genomic resource type>
# ...
meta:
description: <resource description>
summary: <resource summary>
labels:
<custom label>: <custom label value>
# ...
This is the configuration file for a genomic resource. Directories containing this file will be treated as genomic resources by the system.
It must be named genomic_resource.yaml
, as this is how the system will search for it.
Below are some the common fields that can be found in every config. Depending on the resource type, other fields may be present.
Field |
Description |
---|---|
type |
String. Sets the type of the resource. |
meta |
Subsection. Contains fields with information about the resource. |
labels |
Dictionary. Can contain arbitrary key/values. |
Below are the fields in the meta
section:
Field |
Description |
---|---|
description |
String. Description of the resource. |
summary |
String. Short summary of the resource. |
Types of genomic resources and their configurations
Genomic scores
Field |
Description |
---|---|
type |
One of |
table |
Subsection. Describes the file containing the scores, what columns/fields are present in it, etc. |
scores |
List of dictionaries that describes each score column available in the resource. |
default_annotation |
Subsection. The default annotation configuration to use with this resource. |
Gene models
Field |
Description |
---|---|
type |
|
filename |
String. Path to the models file. Relative to the resource directory. |
format |
String. Sets the expected format of the gene models. One of |
Reference genome
Field |
Description |
---|---|
type |
|
filename |
String. Path to the genome file. Relative to the resource directory. |
PARS |
Subsection. Configures the pseudoautosomal regions of the genome. |
chrom_prefix |
String. Configures the prefix contig names are expected to have in the genome. |
The format for the PARS
subsection is as follows:
PARS:
"X":
- "chrX:10000-2781479"
- "chrX:155701382-156030895"
"Y":
- "chrY:10000-2781479"
- "chrY:56887902-57217415"
Liftover chain
Field |
Description |
---|---|
type |
|
filename |
String. Path to the chain file. Relative to the resource directory. |
Annotation pipeline
Field |
Description |
---|---|
type |
|
filename |
String. Path to the annotation configuration file. Relative to the resource directory. |
Histograms and statistics
Each resource type defines a set of statistics that can be calculated for the
resource. These statistics are calculated by the grr_manage
command line tools
and stored in the resource directory under statistics
subdirectory.
For genomic and gene score resources the grr_manage
command line tool
calculates and
draws histograms for each of the scrores defined in the resource.
Here were are going to describe the common behavior for calculation and drawing of histograms for genomic and gene score resources. Other statistics are specific for the resource type and should be described in the resource type documentation.
Histograms
Histograms are calculated for each of the scores defined in a gene score or genomic score resource. The GPF supports three types of histograms:
NumberHistogram
- supported for scores of typeint
andfloat
. By default the histogram is calculated with 100 bins and is linear on both axes.CategoricalHistogram
- supported for scores of typestr
andint
. This is a histogram that shows the distribution of the unique values in the score. It is supported only for scores with less than 100 unique values.NullHistogram
- this histogram type defines a missing histogram. It is used when calculating a histogram is not possible or does not make sense.
Number Histograms Configuration
For each score defined in a genomic or gene score resource genomic_resource.yaml
file a histogram configuration can be defined. The number histogram configuration
supports the following fields:
type
- the type of the histogram. This should be set tonumber
.number_of_bins
- the number of bins in the histogram. By default this is set to 100.view_range
- the range of values that areshown in the histogram. This range could differ from the actual range of the score values. This is useful for adjustements of the histogram view.y_log_scale
- if set toTrue
the y axis of the histogram will be logarithmic.x_log_scale
- if set toTrue
the x axis of the histogram will be logarithmic.x_min_log
- whenx_log_scale
is set toTrue
this value defines the minimum value of the x axis.plot_function
- user defined plot function. When the default plot function is not suitable for the score, a user defined function can be used.
Example 1: Number histogram configuration
Here is a full example of a number histogram configuration comming from the hg38/score/phyloP100way genomic score resource:
type: position_score
table:
filename: hg38.phyloP100way.bw
header_mode: none # this makes no sense and should be removed
# score values
scores:
- id: phyloP100way
type: float
desc: "The score is a number that reflects the conservation at a position."
large_values_desc: "more conserved"
small_values_desc: "less conserved"
index: 3 # this makes no sense and should be removed
histogram:
type: number
number_of_bins: 100
view_range:
min: -20.0
max: 10.0
y_log_scale: True
Example 2: Number histogram configuration
Here is a full example of a number histogram configuration comming from the hg38/variant_frequencies/gnomAD_v3 genomic score resource:
type: allele_score
table:
filename: gnomad.genomes.r3.0.extract.tsv.gz
format: tabix
chrom:
name: CHROM
pos_begin:
name: POS
pos_end:
name: POS
reference:
name: REF
alternative:
name: ALT
scores:
...
- id: AF
name: AF
type: float
desc: "Alternative allele frequency in the all gnomAD v3.0 genome samples."
histogram:
type: number
number_of_bins: 126
view_range:
min: 0.0
max: 1.0
y_log_scale: True
x_log_scale: True
x_min_log: 0.00001
...
Categorical Histograms Configuration
Categorical histograms are suitable for scores that have limited (less than 100)
number of unique values. By default the values are displayed in the order of
their frequency. By default the top 20 values are displayed in the histogram.
Other values are grouped into the Other
category.
The categorical histogram configuration supports the following fields:
type
- the type of the histogram. This should be set tocategorical
.y_log_scale
- if set toTrue
the y axis of the histogram will be logarithmic.displayed_values_count
- the number of unique values that will be displayed in the histogram. Default value for this field is 20. The rest of the valuesare grouped into the
Other
category.displayed_values_percent
- the percentage of total mass of unique values that will be displayed. Other values are grouped into theOther
category. Only one ofdisplayed_values_count
anddisplayed_values_percent
can be set.value_order
- the order in which the unique values are displayed in the histogram.plot_function
- user defined plot function. When the default plot function is not suitable for the score, a user defined function can be used.
Example 1: Categorical histogram configuration
Here is a full example of a number and categorical histogram configuration comming from the hg38/scores/AlphaMissense genomic score resource:
type: np_score
table:
filename: AlphaMissense_hg38_modified.tsv.gz
format: tabix
chrom:
name: chrom
pos_begin:
name: pos
pos_end:
name: pos
reference:
name: ref
alternative:
name: alt
scores:
- id: am_pathogenicity
name: am_pathogenicity
type: float
desc: |
AlphaMissense Pathogenicity score is a deleteriousness score for missense variants
large_values_desc: "more pathogenic"
small_values_desc: "less pathogenic"
histogram:
type: number
number_of_bins: 100
view_range:
min: 0.0
max: 1.0
y_log_scale: True
- id: am_class
name: am_class
type: str
desc: |
AlphaMissense Class is a deleteriousness category for missense variants
histogram:
type: categorical
y_log_scale: True
Example 2: Categorical histogram configuration
Here is an example of a categorical histogram configuration displaying usage
of plot_function, displayed_values_count, and displayed_values_percent fields.
Note that plot_function uses the following format:
<python module>:<python function>
. The path to the python module should be
relative to the resource directory.
type: allele_score
table:
filename: clinvar_20221105_chr.vcf.gz
index_filename: clinvar_20221105_chr.vcf.gz.tbi
scores:
- id: CLNSIG
name: CLNSIG
type: str
desc: |
Clinical significance for this single variant; multiple values
are separated by a vertical bar
histogram:
type: categorical
y_log_scale: True
plot_function: "clinvar_plots.py:plot_clnsig"
- id: CLNREVSTAT
name: CLNREVSTAT
type: str
desc: |
ClinVar review status for the Variation ID
histogram:
type: categorical
y_log_scale: True
displayed_values_count: 35
- id: CLNVC
name: CLNVC
type: str
desc: |
Variant type
histogram:
type: categorical
y_log_scale: True
displayed_values_percent: 85.0
Here is the content of the clinvar_plots.py file:
from typing import IO
from dae.genomic_resources.histogram import CategoricalHistogram
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("agg")
def plot_clnsig(
outfile: IO,
histogram: CategoricalHistogram,
xlabel: str,
_small_values_description: str | None = None,
_large_values_description: str | None = None,
) -> None:
"""Plot histogram and save it into outfile."""
# pylint: disable=import-outside-toplevel
values = list(sorted(histogram.raw_values.items(), key=lambda x: -x[1]))
values = [v for v in values if "|" not in v[0]]
labels = [v[0] for v in values]
counts = [v[1] for v in values]
plt.figure(figsize=(40, 80), tight_layout=True)
_, ax = plt.subplots()
ax.bar(
x=labels,
height=counts,
tick_label=[str(v) for v in labels],
log=histogram.config.y_log_scale,
align="center",
)
plt.xlabel(f"\n{xlabel}")
plt.ylabel("count")
plt.tick_params(axis="x", labelrotation=90, direction="out")
plt.tight_layout()
plt.savefig(outfile)
plt.clf()
Null Histograms Configuration
Null histograms are used when calculating a histogram is not possible or does not make sense. The null histogram configuration supports the following fields:
type
- the type of the histogram. This should be set tonull
.reason
- the reason why the histogram is disabled. This field is required.
Example: Null histogram configuration
type: allele_score
table:
filename: clinvar_20221105_chr.vcf.gz
index_filename: clinvar_20221105_chr.vcf.gz.tbi
scores:
- id: RS
name: RS
type: str
desc: dbSNP ID (i.e. rs number)
histogram:
type: null
reason: "Histogram is not available for this score."
Resource repositories
Resource repositories are collections of genomic resources hosted either locally or remotely.
Repository discovery
The GPF system will by default look for a .grr_definition.yaml
file in the home directory of your user.
Alternatively, the system will use a repository configuration file pointed to by
the GRR_DEFINITION_FILE
environment variable if it has been set.
Finally, most CLI tools that use GRRs have a --grr <filename>
argument
that overrides the defaults.
To configure the GRRs to be used by default for your user, you can create
the file ~/.grr_definition.yaml
. An example of what the contents of this file
can be is:
id: "development"
type: group
children:
- id: "grr_local"
type: "directory"
directory: "~/my_grr"
- id: "default"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "~/default_grr_cache"
Repository configuration
Field |
Description |
---|---|
id |
String. The id of the repository. |
type |
String. One of |
children |
List of repository configurations for |
url |
String. URL of the remote repository for |
directory |
String. Path to the directory of resources for |
content |
Dictionary describing files and directories for |
cache_dir |
String. Path to a directory in which the resources from this repository will be cached. |
directory
A local filesystem repository.
http
A remote HTTP repository.
url
A remote S3 repository.
embedded
An in-memory repository.
group
A group of a number of repositories.
Caching of repositories
When a repository is configured with a cache_dir
option, it will cache resources locally before using them.
It is significantly faster to use cached resources, but it takes some time to cache them the first time they are used and they occupy substantial disk space.
Management of resources and repositories with CLI tools
The GPF system provides two CLI tools for management of genomic resources and repositories. Their usage is outlined below:
grr_manage
$ grr_manage --help
usage: grr_manage [-h] [--version] [--verbose]
{list,repo-init,repo-manifest,resource-manifest,repo-stats,resource-stats,repo-info,resource-info,repo-repair,resource-repair}
...
Genomic Resource Repository Management Tool
positional arguments:
{list,repo-init,repo-manifest,resource-manifest,repo-stats,resource-stats,repo-info,resource-info,repo-repair,resource-repair}
Command to execute
list List a GR Repo
repo-init Initialize a directory to turn it into a GRR
repo-manifest Create/update manifests for whole GRR
resource-manifest Create/update manifests for a resource
repo-stats Build the statistics for a resource
resource-stats Build the statistics for a resource
repo-info Build the index.html for the whole GRR
resource-info Build the index.html for the specific resource
repo-repair Update/rebuild manifest and histograms whole GRR
resource-repair Update/rebuild manifest and histograms for a resource
options:
-h, --help show this help message and exit
--version Prints GPF version and exists.
--verbose, -v, -V
grr_browse
$ grr_browse --help
usage: grr_browse [-h] [--version] [--verbose] [-g GRR] [--bytes]
Genomic Resource Repository Browse Tool
options:
-h, --help show this help message and exit
--version Prints GPF version and exists.
--verbose, -v, -V
--bytes Print the resource size in bytes
Repository/Resource:
-g GRR, --grr GRR path to GRR definition file.
Tutorial: Create a local repository with a custom resource
The genomic resource is a set of files stored in a directory. To make given
directory a genomic resource, it should contain genomic_resource.yaml
file.
A genomic resources repository is a directory that contains genomic resources.
To make a given directory into a repository, it should have a .CONTENTS
file.
Create an empty GRR
To create and empty GRR first create an empty directory. For example let us
create an empty directory named grr_test
, enter inside that directory and
run grr_manage repo-init
command:
mkdir grr_test
cd grr_test
grr_manage repo-init
After that the directory should contain an empty .CONTENTS
file:
ls -a
. .. .CONTENTS
If we try to list all resources in this repository we should get an empty list:
grr_manage list
Create an empty genomic resource
Let us create our first genomic resource. Create a directory
hg38/scores/score9
inside
grr_test
repository and create an empty genomic_resource.yaml
file
inside that directory:
mkdir -p hg38/scores/score9
cd hg38/scores/score9
touch genomic_resource.yaml
This will create an empty genomic resource in our repository
with ID hg38/scores/score9
.
If we list the resources in our repository we would get:
grr_manage list
working with repository: .../grr_test
Basic 0 1 0 hg38/scores/score9
When we create or change a resource we need to repair the repository:
grr_manage repo-repair
This command will create a .MANIFEST
file for our new resource
hg38/scores/score9
and will update the repository .CONTENTS
to include
the resource.
Add genomic score resources
Add all score resource files (score file and Tabix index) inside
the created directory hg38/scores/score9
. Let’s say these files are:
score9.tsv.gz
score9.tsv.gz.tbi
Configure the resource hg38/scores/score9
. To this end create
a genomic_resource.yaml
file, that contains the position score
configuration:
type: position_score
table:
filename: score9.tsv.gz
format: tabix
# defined by score_type
chrom:
name: chrom
pos_begin:
name: start
pos_end:
name: end
# score values
scores:
- id: score9
type: float
desc: "score9"
index: 3
histograms:
- score: score9
bins: 100
y_scale: "log"
x_scale: "linear"
default_annotation:
attributes:
- source: score9
destination: score9
meta: |
## score9
TODO
When ready you should run grr_manage resource-repair
from inside resource
directory:
cd hg38/scores/score9
grr_manage resource-repair
This command is going to calculate histograms for the score (if they are configured) and create or update the resource manifest.
Once the resource is ready we need to regenerate the repository contents:
grr_manage repo-repair