Genomic position table configuration

Table configuration fields

filename

Path to the file containing the data, relative to the genomic resource’s directory.

format

Format of the file configured in filename. Currently supported formats are tabix, vcf_info, tsv, csv and bw. Auto-detection of the format works for the following filename extensions:

Extension	Format
.bgz	tabix
.vcf.gz	vcf_info
.txt, .txt.gz, .tsv, .tsv.gz	tsv
.csv, .csv.gz	csv
.bw	bw

header_mode

The default value is file.

Value	Effect
file	Will attempt to extract a header from the provided file.
list	Will take the list of strings provided with the configuration field `header` as header.
none	No header. Columns will only be able to be configured via index.

header

Used for providing a header when header_mode is set to list. Example:

header_mode: list
header: ["chrom", "start", "end", "score_value"]

chrom_mapping

Allows transformation of the values in the chromosome column. Three options are available:

add_prefix

Takes a string value and adds it as a prefix.

del_prefix

Takes a string value to remove from the start of each chromosome.

filename

Takes a filepath, relative to the genomic resource’s directory. The file’s contents must contain two columns delimited by whitespace. The first line must be the header, containing chrom and file_chrom as values. The file_chrom column contains values that will be found in the file, while the chrom column contains what they will be mapped to. An example is given below:

chrom           file_chrom
Chromosome_1    1
Chromosome_22   22

{column}

Generic configuration for a column in the genomic position table.

column_name: Takes a string value. The name of the column as it appears in the file’s header. Cannot be used if no header has been provided for the table.
column_index: Takes an integer value. The index of the column in the file.
name: Deprecated version of column_name.
index: Deprecated version of column_index.

chrom

Column configuration for the chromosome column. See explanation for {column} above.

pos_begin

Column configuration for the start position column. See explanation for {column} above.

pos_end

Column configuration for the end position column. See explanation for {column} above.

reference

Column configuration for the reference column. See explanation for {column} above.

alternative

Column configuration for the alternative column. See explanation for {column} above.

Score configuration fields

id: Takes a string value. The identifier the system will use to refer to this score column in annotation configurations.
type: Type of the column’s values. Takes one of the following values - str, float, int.
column_name: Takes a string value. The name of the column as it appears in the file’s header. Cannot be used if no header has been provided for the table.
column_index: Takes an integer value. The index of the column in the file.
name: Deprecated version of column_name.
index: Deprecated version of column_index.
desc: A string describing the score column.
na_values: Takes a string or list of strings value. Which score values to consider as na.
histogram: Histogram configuration. See Histograms and statistics for more info.

Auto generated score definition

VCF files provide enough information to allow automatic generation of score definitions. These definitions can be overriden manually if necessary, either partially or fully.

Example VCF file:

##fileformat=VCFv4.1
##INFO=<ID=A,Number=1,Type=Integer,Description="Score A">
#CHROM POS ID REF ALT QUAL FILTER  INFO
chr1   5   .  A   T   .    .       A=1

Score A will get auto generated score definition as if created by configuration like this:

scores:
- id: A
  type: int
  column_name: A
  desc: Score A

Some fields cannot be automatically generated. Use overriding to add more fields or change existing auto generated fields. Define manually which score definitions should be overriden by first specifying the score id, then add new fields (like histogram) or override existing auto generated (like type):

scores:
- id: A
  type: float
  histogram:
    type: categorical
    value_order: ["alpha", "beta"]

The resulting score definition with updated type and added histogram will be equivalent to the following configuration:

scores:
- id: A
  type: float
  column_name: A
  desc: Score A
  histogram:
    type: categorical
    value_order: ["alpha", "beta"]

How VCF types correspond to our types

VCF

GPF

Integer

int

Float

float

String

str

Flag

bool

Zero-based / BED format scores

table:
  filename: data.txt.gz
  format: tabix
  zero_based: True
scores:
- id: score_1
  name: score 1
  type: float

The zero_based argument controls how the score file will be read.

Setting it to true will read the score as a BED-style format - with 0-based, half-open intervals.

By default it is set to false, which will read the score in GPF’s internal format - with 1-based, closed intervals.

Example configurations

Example table configuration for a genomic score resource. This configuration is embedded in the score’s genomic_resource.yaml config.

# Example genomic_resource.yaml for an NP score resource.

table:
  filename: whole_genome_SNVs.tsv.gz
  format: tabix

  # how to modify the values found when reading the chromosome column
  chrom_mapping:
    add_prefix: chr

  # configuration for essential columns
  chrom:
    name: Chrom
  pos_begin:
    name: Pos
  reference:
    name: Ref
  alternative:
    name: Alt

# score values
scores:
  - id: cadd_raw
    type: float
    name: RawScore
    desc: |
      CADD raw score for functional prediction of a SNP. The larger the score
      the more likely the SNP has damaging effect
    large_values_desc: "more damaging"
    small_values_desc: "less damaging"
    histogram:
      type: number
      number_of_bins: 100
      view_range:
        min: -8.0
        max: 36.0
      y_log_scale: True

# Example genomic_resource.yaml for a position score resource with multiple scores
# with different histogram configurations.

table:
  filename: scorefile.tsv.gz
  format: tabix

  # configuration for essential columns
  chrom:
    name: chromosome
  pos_begin:
    name: start
  pos_end:
    name: stop

# score values
scores:
  # float score
  - id: score_A
    type: float
    name: NumericScore
    number_hist:
      number_of_bins: 120
      view_range:
        min: -10.0
        max: 225.0
      x_log_scale: True
      x_min_log: 0.05
  # integer score
  - id: score_B
    type: int
    name: IntegerScore
    number_hist:
      number_of_bins: 10
  # string score with categorical histogram
  - id: score_C
    type: str
    name: CategoricalScore
    histogram:
      type: categorical
      value_order: ["alpha", "beta", "gamma", "delta"]
  # string score with no histogram
  - id: score_D
    type: str
    name: WeirdScore
    histogram:
      type: null
      reason: "Don't care about this score"

# Example bigWig score configuration.

type: position_score

table:
  filename: hg38.phyloP7way.bw
  # header mode must be set to none for bigWig scores
  header_mode: none

# currently, it's necessary to explicitly configure the score with its index set to 3
scores:
  - id: phyloP7way
    type: float
    column_index: 3

default_annotation:
  - source: phyloP7way
    name: phylop7way

How to generate tabix files

Note - in order to use tabix, the score file must already be compressed using bgzip.

$ tabix --help

Version: 1.22.1
Usage:   tabix [OPTIONS] [FILE] [REGION [...]]

Indexing Options:
   -0, --zero-based           coordinates are zero-based
   -b, --begin INT            column number for region start [4]
   -c, --comment CHAR         skip comment lines starting with CHAR [null]
   -C, --csi                  generate CSI index for VCF (default is TBI)
   -e, --end INT              column number for region end (if no end, set INT to -b) [5]
   -f, --force                overwrite existing index without asking
   -m, --min-shift INT        set minimal interval size for CSI indices to 2^INT [14]
   -p, --preset STR           gff, bed, sam, vcf, gaf
   -s, --sequence INT         column number for sequence names (suppressed by -p) [1]
   -S, --skip-lines INT       skip first INT lines [0]

Querying and other options:
   -h, --print-header         print also the header lines
   -H, --only-header          print only the header lines
   -l, --list-chroms          list chromosome names
   -r, --reheader FILE        replace the header with the content of FILE
   -R, --regions FILE         restrict to regions listed in the file
   -T, --targets FILE         similar to -R but streams rather than index-jumps
   -D                         do not download the index file
       --cache INT            set cache size to INT megabytes (0 disables) [10]
       --separate-regions     separate the output by corresponding regions
       --verbosity INT        set verbosity [3]
   -@, --threads INT          number of additional threads to use [0]

$ bgzip --help

Version: 1.22.1
Usage:   bgzip [OPTIONS] [FILE] ...
Options:
   -b, --offset INT           decompress at virtual file pointer (0-based uncompressed offset)
   -c, --stdout               write on standard output, keep original files unchanged
   -d, --decompress           decompress
   -f, --force                overwrite files without asking
   -g, --rebgzip              use an index file to bgzip a file
   -h, --help                 give this help
   -i, --index                compress and create BGZF index
   -I, --index-name FILE      name of BGZF index file [file.gz.gzi]
   -k, --keep                 don't delete input files during operation
   -l, --compress-level INT   Compression level to use when compressing; 0 to 9, or -1 for default [-1]
   -o, --output FILE          write to file, keep original files unchanged
   -r, --reindex              (re)index compressed file
   -s, --size INT             decompress INT bytes (uncompressed size)
   -t, --test                 test integrity of compressed file
       --binary               Don't align blocks with text lines
   -@, --threads INT          number of compression threads to use [1]

Example usage of `tabix`

For a VCF-format score:

$ tabix -p vcf score.vcf.gz

For a 1-based TSV score with a single position column:

$ tabix -s 1 -b 2 score.tsv.gz

For a 1-based TSV score with start and stop position columns:

$ tabix -s 1 -b 2 -e 3 score.tsv.gz

For a 0-based TSV score with start and stop position columns:

$ tabix -0 -s 1 -b 2 -e 3 score.tsv.gz