Genomic resources and resource repositories
The GPF system uses genomic resources such as reference genomes, gene models, genomic scores, etc. These resources are provided by resource repositories which can be accessed remotely or locally. The system can use multiple repositories at a time.
Genomic resources and resource repositories are fundamentally a collection of directories and files with special YAML configurations.
The following documentation will explain what genomic resources are available and how they can be configured, how resource repositories are configured and discovered by the system, and a short tutorial on creating a local repository with a custom resource.
We also prepared an extensive demo to help individuals get started with their own GRR (https://github.com/iossifovlab/mini_grr).
Genomic resources
A genomic resource is a directory containing a special
genomic_resource.yaml configuration and an arbitrary number of files.
Additionally, GPF will create additional files (.MANIFEST, the .grr
subdirectory) which are used internally to track changes to the resource.
genomic_resource.yaml
type: <genomic resource type>
# ...
meta:
description: <resource description>
summary: <resource summary>
labels:
<custom label>: <custom label value>
# ...
This is the configuration file for a genomic resource. Directories containing
this file will be treated as genomic resources by the system.
It must be named genomic_resource.yaml, as this is how the system will
search for it.
Below are some the common fields that can be found in every config. Depending on the resource type, other fields may be present.
Field |
Description |
|---|---|
type |
String. Sets the type of the resource. |
meta |
Subsection. Contains fields with information about the resource. |
labels |
Dictionary. Can contain arbitrary key/values. |
Below are the fields in the meta section:
Field |
Description |
|---|---|
description |
String. Description of the resource. |
summary |
String. Short summary of the resource. |
Types of genomic resources and their configurations
Genomic scores
Field |
Description |
|---|---|
type |
One of |
table |
Subsection. Describes the file containing the scores, what columns/fields are present in it, etc. |
scores |
List of dictionaries that describes each score column available in the resource. |
default_annotation |
Subsection. The default annotation configuration to use with this resource. |
Gene models
Field |
Description |
|---|---|
type |
|
filename |
String. Path to the models file. Relative to the resource directory. |
format |
String. Sets the expected format of the gene models.
One of |
Reference genome
Field |
Description |
|---|---|
type |
|
filename |
|
PARS |
|
chrom_prefix |
|
The format for the PARS subsection is as follows:
PARS:
"X":
- "chrX:10000-2781479"
- "chrX:155701382-156030895"
"Y":
- "chrY:10000-2781479"
- "chrY:56887902-57217415"
Liftover chain
Field |
Description |
|---|---|
type |
|
filename |
|
Annotation pipeline
Field |
Description |
|---|---|
type |
|
filename |
|
Histograms and statistics
Each resource type defines a set of statistics that can be calculated for the
resource. These statistics are calculated by the grr_manage command line
tools and stored in the resource directory under statistics subdirectory.
For genomic and gene score resources the grr_manage command line tool
calculates and draws histograms for each of the scrores defined in the
resource.
Here were are going to describe the common behavior for calculation and drawing of histograms for genomic and gene score resources. Other statistics are specific for the resource type and should be described in the resource type documentation.
Histograms
Histograms are calculated for each of the scores defined in a gene score or genomic score resource. The GPF supports three types of histograms:
NumberHistogram- supported for scores of typeintandfloat. By default the histogram is calculated with 100 bins and is linear on both axes.CategoricalHistogram- supported for scores of typestrandint. This is a histogram that shows the distribution of the unique values in the score. It is supported only for scores with less than 100 unique values.NullHistogram- this histogram type defines a missing histogram. It is used when calculating a histogram is not possible or does not make sense.
Number Histograms Configuration
For each score defined in a genomic or gene score resource
genomic_resource.yaml file a histogram configuration can be defined. The
number histogram configuration supports the following fields:
type- the type of the histogram. This should be set tonumber.number_of_bins- the number of bins in the histogram. By default this is set to 100.view_range- the range of values that areshown in the histogram. This range could differ from the actual range of the score values. This is useful for adjustements of the histogram view.y_log_scale- if set toTruethe y axis of the histogram will be logarithmic.x_log_scale- if set toTruethe x axis of the histogram will be logarithmic.x_min_log- whenx_log_scaleis set toTruethis value defines the minimum value of the x axis.plot_function- user defined plot function. When the default plot function is not suitable for the score, a user defined function can be used.
Example 1: Number histogram configuration
Here is a full example of a number histogram configuration comming from the hg38/score/phyloP100way genomic score resource:
type: position_score
table:
filename: hg38.phyloP100way.bw
header_mode: none # this makes no sense and should be removed
# score values
scores:
- id: phyloP100way
type: float
desc: "The score is a number that reflects the conservation at a position."
large_values_desc: "more conserved"
small_values_desc: "less conserved"
index: 3 # this makes no sense and should be removed
histogram:
type: number
number_of_bins: 100
view_range:
min: -20.0
max: 10.0
y_log_scale: True
Example 2: Number histogram configuration
Here is a full example of a number histogram configuration comming from the hg38/variant_frequencies/gnomAD_v3 genomic score resource:
type: allele_score
table:
filename: gnomad.genomes.r3.0.extract.tsv.gz
format: tabix
chrom:
name: CHROM
pos_begin:
name: POS
pos_end:
name: POS
reference:
name: REF
alternative:
name: ALT
scores:
...
- id: AF
name: AF
type: float
desc: "Alternative allele frequency in the all gnomAD v3.0 genome samples."
histogram:
type: number
number_of_bins: 126
view_range:
min: 0.0
max: 1.0
y_log_scale: True
x_log_scale: True
x_min_log: 0.00001
...
Categorical Histograms Configuration
Categorical histograms are suitable for scores that have limited
(less than 100) number of unique values. By default the values are displayed
in the order of their frequency. By default the top 20 values are displayed
in the histogram. Other values are grouped into the Other category.
The categorical histogram configuration supports the following fields:
type- the type of the histogram. This should be set tocategorical.y_log_scale- if set toTruethe y axis of the histogram will be logarithmic.displayed_values_count- the number of unique values that will be displayed in the histogram. Default value for this field is 20. The rest of the values are grouped into theOthercategory.displayed_values_percent- the percentage of total mass of unique values that will be displayed. Other values are grouped into theOthercategory. Only one ofdisplayed_values_countanddisplayed_values_percentcan be set.value_order- the order in which the unique values are displayed in the histogram.plot_function- user defined plot function. When the default plot function is not suitable for the score, a user defined function can be used.
Example 1: Categorical histogram configuration
Here is a full example of a number and categorical histogram configuration comming from the hg38/scores/AlphaMissense genomic score resource:
type: np_score
table:
filename: AlphaMissense_hg38_modified.tsv.gz
format: tabix
chrom:
name: chrom
pos_begin:
name: pos
pos_end:
name: pos
reference:
name: ref
alternative:
name: alt
scores:
- id: am_pathogenicity
name: am_pathogenicity
type: float
desc: |
AlphaMissense Pathogenicity score is a deleteriousness score for missense variants
large_values_desc: "more pathogenic"
small_values_desc: "less pathogenic"
histogram:
type: number
number_of_bins: 100
view_range:
min: 0.0
max: 1.0
y_log_scale: True
- id: am_class
name: am_class
type: str
desc: |
AlphaMissense Class is a deleteriousness category for missense variants
histogram:
type: categorical
y_log_scale: True
Example 2: Categorical histogram configuration
Here is an example of a categorical histogram configuration displaying usage
of plot_function, displayed_values_count, and displayed_values_percent fields.
Note that plot_function uses the following format:
<python module>:<python function>. The path to the python module should be
relative to the resource directory.
type: allele_score
table:
filename: clinvar_20221105_chr.vcf.gz
index_filename: clinvar_20221105_chr.vcf.gz.tbi
scores:
- id: CLNSIG
name: CLNSIG
type: str
desc: |
Clinical significance for this single variant; multiple values
are separated by a vertical bar
histogram:
type: categorical
y_log_scale: True
plot_function: "clinvar_plots.py:plot_clnsig"
- id: CLNREVSTAT
name: CLNREVSTAT
type: str
desc: |
ClinVar review status for the Variation ID
histogram:
type: categorical
y_log_scale: True
displayed_values_count: 35
- id: CLNVC
name: CLNVC
type: str
desc: |
Variant type
histogram:
type: categorical
y_log_scale: True
displayed_values_percent: 85.0
Here is the content of the clinvar_plots.py file:
from typing import IO
from dae.genomic_resources.histogram import CategoricalHistogram
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("agg")
def plot_clnsig(
outfile: IO,
histogram: CategoricalHistogram,
xlabel: str,
_small_values_description: str | None = None,
_large_values_description: str | None = None,
) -> None:
"""Plot histogram and save it into outfile."""
# pylint: disable=import-outside-toplevel
values = list(sorted(histogram.raw_values.items(), key=lambda x: -x[1]))
values = [v for v in values if "|" not in v[0]]
labels = [v[0] for v in values]
counts = [v[1] for v in values]
plt.figure(figsize=(40, 80), tight_layout=True)
_, ax = plt.subplots()
ax.bar(
x=labels,
height=counts,
tick_label=[str(v) for v in labels],
log=histogram.config.y_log_scale,
align="center",
)
plt.xlabel(f"\n{xlabel}")
plt.ylabel("count")
plt.tick_params(axis="x", labelrotation=90, direction="out")
plt.tight_layout()
plt.savefig(outfile)
plt.clf()
Null Histograms Configuration
Null histograms are used when calculating a histogram is not possible or does not make sense. The null histogram configuration supports the following fields:
type- the type of the histogram. This should be set tonull.reason- the reason why the histogram is disabled. This field is required.
Example: Null histogram configuration
type: allele_score
table:
filename: clinvar_20221105_chr.vcf.gz
index_filename: clinvar_20221105_chr.vcf.gz.tbi
scores:
- id: RS
name: RS
type: str
desc: dbSNP ID (i.e. rs number)
histogram:
type: "null"
reason: "Histogram is not available for this score."
Resource repositories
Resource repositories are collections of genomic resources hosted either locally or remotely.
Repository discovery
The GPF system will by default look for a .grr_definition.yaml file in the
home directory of your user.
Alternatively, the system will use a repository configuration file pointed to
by the GRR_DEFINITION_FILE environment variable if it has been set.
Finally, most CLI tools that use GRRs have a --grr <filename> argument
that overrides the defaults.
To configure the GRRs to be used by default for your user, you can create
the file ~/.grr_definition.yaml. An example of what the contents of this
file can be is:
id: "development"
type: group
children:
- id: "grr_local"
type: "directory"
directory: "~/my_grr"
- id: "default"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "~/default_grr_cache"
Repository configuration
Field |
Description |
|---|---|
id |
String. The id of the repository. |
type |
|
children |
|
url |
|
directory |
|
content |
|
cache_dir |
|
directoryA local filesystem repository.
httpA remote HTTP repository.
urlA remote S3 repository.
embeddedAn in-memory repository.
groupA group of a number of repositories.
Caching of repositories
When a repository is configured with a cache_dir option, it will cache
resources locally before using them. It is significantly faster to use cached
resources, but it takes some time to cache them the first time they are used
and they occupy substantial disk space.
Management of resources and repositories with CLI tools
The GPF system provides two CLI tools for management of genomic resources and repositories. Their usage is outlined below:
grr_manage
$ grr_manage --help
usage: grr_manage [-h] [--version] [--verbose]
{list,repo-init,repo-manifest,resource-manifest,repo-stats,resource-stats,repo-info,resource-info,repo-repair,resource-repair}
...
Genomic Resource Repository Management Tool
positional arguments:
{list,repo-init,repo-manifest,resource-manifest,repo-stats,resource-stats,repo-info,resource-info,repo-repair,resource-repair}
Command to execute
list List a GR Repo
repo-init Initialize a directory to turn it into a GRR
repo-manifest Create/update manifests for whole GRR
resource-manifest Create/update manifests for a resource
repo-stats Build the statistics for a resource
resource-stats Build the statistics for a resource
repo-info Build the index.html for the whole GRR
resource-info Build the index.html for the specific resource
repo-repair Update/rebuild manifest and histograms whole GRR
resource-repair Update/rebuild manifest and histograms for a resource
options:
-h, --help show this help message and exit
--version Prints GPF version and exists.
--verbose, -v, -V
grr_browse
$ grr_browse --help
usage: grr_browse [-h] [--version] [--verbose] [-g GRR] [--bytes]
Genomic Resource Repository Browse Tool
options:
-h, --help show this help message and exit
--version Prints GPF version and exists.
--verbose, -v, -V
--bytes Print the resource size in bytes
Repository/Resource:
-g GRR, --grr GRR path to GRR definition file.
Tutorial: Create a local repository with a custom resource
The genomic resource is a set of files stored in a directory. To make given
directory a genomic resource, it should contain genomic_resource.yaml
file.
A genomic resources repository is a directory that contains genomic resources.
To make a given directory into a repository, it should have a .CONTENTS
file.
Create an empty GRR
To create and empty GRR first create an empty directory. For example let us
create an empty directory named grr_test, enter inside that directory and
run grr_manage repo-init command:
mkdir grr_test
cd grr_test
grr_manage repo-init
After that the directory should contain an empty .CONTENTS file:
ls -a
. .. .CONTENTS
If we try to list all resources in this repository we should get an empty list:
grr_manage list
Create an empty genomic resource
Let us create our first genomic resource. Create a directory
hg38/scores/score9 inside
grr_test repository and create an empty genomic_resource.yaml file
inside that directory:
mkdir -p hg38/scores/score9
cd hg38/scores/score9
touch genomic_resource.yaml
This will create an empty genomic resource in our repository
with ID hg38/scores/score9.
If we list the resources in our repository we would get:
grr_manage list
working with repository: .../grr_test
Basic 0 1 0 hg38/scores/score9
When we create or change a resource we need to repair the repository:
grr_manage repo-repair
This command will create a .MANIFEST file for our new resource
hg38/scores/score9 and will update the repository .CONTENTS to include
the resource.
Add genomic score resources
Add all score resource files (score file and Tabix index) inside
the created directory hg38/scores/score9. Let’s say these files are:
score9.tsv.gz
score9.tsv.gz.tbi
Configure the resource hg38/scores/score9. To this end create
a genomic_resource.yaml file, that contains the position score
configuration:
type: position_score
table:
filename: score9.tsv.gz
format: tabix
# defined by score_type
chrom:
name: chrom
pos_begin:
name: start
pos_end:
name: end
# score values
scores:
- id: score9
type: float
desc: "score9"
index: 3
histograms:
- score: score9
bins: 100
y_scale: "log"
x_scale: "linear"
default_annotation:
attributes:
- source: score9
destination: score9
meta: |
## score9
TODO
When ready you should run grr_manage resource-repair from inside resource
directory:
cd hg38/scores/score9
grr_manage resource-repair
This command is going to calculate histograms for the score (if they are configured) and create or update the resource manifest.
Once the resource is ready we need to regenerate the repository contents:
grr_manage repo-repair
Genomic position table configuration
Table configuration fields
- filename
Path to the file containing the data, relative to the genomic resource’s directory.
- format
Format of the file configured in
filename. Currently supported formats aretabix,vcf_info,tsv,csvandbw. Auto-detection of the format works for the following filename extensions:Extension
Format
.bgz
tabix
.vcf.gz
vcf_info
.txt, .txt.gz, .tsv, .tsv.gz
tsv
.csv, .csv.gz
csv
.bw
bw
- header_mode
The default value is
file.Value
Effect
file
Will attempt to extract a header from the provided file.
list
Will take the list of strings provided with the configuration field
headeras header.none
No header. Columns will only be able to be configured via index.
- header
Used for providing a header when
header_modeis set tolist. Example:header_mode: list header: ["chrom", "start", "end", "score_value"]
- chrom_mapping
Allows transformation of the values in the chromosome column. Three options are available:
- add_prefix
Takes a string value and adds it as a prefix.
- del_prefix
Takes a string value to remove from the start of each chromosome.
- filename
Takes a filepath, relative to the genomic resource’s directory. The file’s contents must contain two columns delimited by whitespace. The first line must be the header, containing
chromandfile_chromas values. Thefile_chromcolumn contains values that will be found in the file, while thechromcolumn contains what they will be mapped to. An example is given below:chrom file_chrom Chromosome_1 1 Chromosome_22 22
- {column}
Generic configuration for a column in the genomic position table.
- column_name
Takes a string value. The name of the column as it appears in the file’s header. Cannot be used if no header has been provided for the table.
- column_index
Takes an integer value. The index of the column in the file.
- name
Deprecated version of
column_name.- index
Deprecated version of
column_index.
- chrom
Column configuration for the chromosome column. See explanation for {column} above.
- pos_begin
Column configuration for the start position column. See explanation for {column} above.
- pos_end
Column configuration for the end position column. See explanation for {column} above.
- reference
Column configuration for the reference column. See explanation for {column} above.
- alternative
Column configuration for the alternative column. See explanation for {column} above.
Score configuration fields
- id
Takes a string value. The identifier the system will use to refer to this score column in annotation configurations.
- type
Type of the column’s values. Takes one of the following values -
str,float,int.- column_name
Takes a string value. The name of the column as it appears in the file’s header. Cannot be used if no header has been provided for the table.
- column_index
Takes an integer value. The index of the column in the file.
- name
Deprecated version of
column_name.- index
Deprecated version of
column_index.- desc
A string describing the score column.
- na_values
Takes a string or list of strings value. Which score values to consider as
na.- histogram
Histogram configuration. See Histograms and statistics for more info.
Auto generated score definition
VCF files provide enough information to allow automatic generation of score definitions. These definitions can be overriden manually if necessary, either partially or fully.
Example VCF file:
##fileformat=VCFv4.1
##INFO=<ID=A,Number=1,Type=Integer,Description="Score A">
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 5 . A T . . A=1
Score A will get auto generated score definition as if created by configuration like this:
scores:
- id: A
type: int
column_name: A
desc: Score A
Some fields cannot be automatically generated. Use overriding to add more fields or change existing auto generated fields.
Define manually which score definitions should be overriden by first specifying the score id,
then add new fields (like histogram) or override existing auto generated (like type):
scores:
- id: A
type: float
histogram:
type: categorical
value_order: ["alpha", "beta"]
The resulting score definition with updated type and added histogram will be equivalent to the following configuration:
scores:
- id: A
type: float
column_name: A
desc: Score A
histogram:
type: categorical
value_order: ["alpha", "beta"]
How VCF types correspond to our types
VCF
GPF
Integer
int
Float
float
String
str
Flag
bool
Zero-based / BED format scores
table:
filename: data.txt.gz
format: tabix
zero_based: True
scores:
- id: score_1
name: score 1
type: float
The zero_based argument controls how the score file will be read.
Example configurations
Example table configuration for a genomic score resource.
This configuration is embedded in the score’s genomic_resource.yaml config.
# Example genomic_resource.yaml for an NP score resource.
table:
filename: whole_genome_SNVs.tsv.gz
format: tabix
# how to modify the values found when reading the chromosome column
chrom_mapping:
add_prefix: chr
# configuration for essential columns
chrom:
name: Chrom
pos_begin:
name: Pos
reference:
name: Ref
alternative:
name: Alt
# score values
scores:
- id: cadd_raw
type: float
name: RawScore
desc: |
CADD raw score for functional prediction of a SNP. The larger the score
the more likely the SNP has damaging effect
large_values_desc: "more damaging"
small_values_desc: "less damaging"
histogram:
type: number
number_of_bins: 100
view_range:
min: -8.0
max: 36.0
y_log_scale: True
# Example genomic_resource.yaml for a position score resource with multiple scores
# with different histogram configurations.
table:
filename: scorefile.tsv.gz
format: tabix
# configuration for essential columns
chrom:
name: chromosome
pos_begin:
name: start
pos_end:
name: stop
# score values
scores:
# float score
- id: score_A
type: float
name: NumericScore
number_hist:
number_of_bins: 120
view_range:
min: -10.0
max: 225.0
x_log_scale: True
x_min_log: 0.05
# integer score
- id: score_B
type: int
name: IntegerScore
number_hist:
number_of_bins: 10
# string score with categorical histogram
- id: score_C
type: str
name: CategoricalScore
histogram:
type: categorical
value_order: ["alpha", "beta", "gamma", "delta"]
# string score with no histogram
- id: score_D
type: str
name: WeirdScore
histogram:
type: null
reason: "Don't care about this score"
# Example bigWig score configuration.
type: position_score
table:
filename: hg38.phyloP7way.bw
# header mode must be set to none for bigWig scores
header_mode: none
# currently, it's necessary to explicitly configure the score with its index set to 3
scores:
- id: phyloP7way
type: float
column_index: 3
default_annotation:
- source: phyloP7way
name: phylop7way
How to generate tabix files
Note - in order to use tabix, the score file must already be compressed using bgzip.
$ tabix --help
Version: 1.22.1
Usage: tabix [OPTIONS] [FILE] [REGION [...]]
Indexing Options:
-0, --zero-based coordinates are zero-based
-b, --begin INT column number for region start [4]
-c, --comment CHAR skip comment lines starting with CHAR [null]
-C, --csi generate CSI index for VCF (default is TBI)
-e, --end INT column number for region end (if no end, set INT to -b) [5]
-f, --force overwrite existing index without asking
-m, --min-shift INT set minimal interval size for CSI indices to 2^INT [14]
-p, --preset STR gff, bed, sam, vcf, gaf
-s, --sequence INT column number for sequence names (suppressed by -p) [1]
-S, --skip-lines INT skip first INT lines [0]
Querying and other options:
-h, --print-header print also the header lines
-H, --only-header print only the header lines
-l, --list-chroms list chromosome names
-r, --reheader FILE replace the header with the content of FILE
-R, --regions FILE restrict to regions listed in the file
-T, --targets FILE similar to -R but streams rather than index-jumps
-D do not download the index file
--cache INT set cache size to INT megabytes (0 disables) [10]
--separate-regions separate the output by corresponding regions
--verbosity INT set verbosity [3]
-@, --threads INT number of additional threads to use [0]
$ bgzip --help
Version: 1.22.1
Usage: bgzip [OPTIONS] [FILE] ...
Options:
-b, --offset INT decompress at virtual file pointer (0-based uncompressed offset)
-c, --stdout write on standard output, keep original files unchanged
-d, --decompress decompress
-f, --force overwrite files without asking
-g, --rebgzip use an index file to bgzip a file
-h, --help give this help
-i, --index compress and create BGZF index
-I, --index-name FILE name of BGZF index file [file.gz.gzi]
-k, --keep don't delete input files during operation
-l, --compress-level INT Compression level to use when compressing; 0 to 9, or -1 for default [-1]
-o, --output FILE write to file, keep original files unchanged
-r, --reindex (re)index compressed file
-s, --size INT decompress INT bytes (uncompressed size)
-t, --test test integrity of compressed file
--binary Don't align blocks with text lines
-@, --threads INT number of compression threads to use [1]
Example usage of tabix
For a VCF-format score:
$ tabix -p vcf score.vcf.gz
For a 1-based TSV score with a single position column:
$ tabix -s 1 -b 2 score.tsv.gz
For a 1-based TSV score with start and stop position columns:
$ tabix -s 1 -b 2 -e 3 score.tsv.gz
For a 0-based TSV score with start and stop position columns:
$ tabix -0 -s 1 -b 2 -e 3 score.tsv.gz