GPF Getting Started Guide

Setup

Prerequisites

This guide assumes that you are working on a recent Linux (or Mac OS X) machine.

The GPF system is distributed as a Conda package. You must install a distribution of Conda or Mamba package manager if you do not have a working version of Anaconda, Miniconda, or Mamba. We recommended using a Miniforge distribution.

Go to the Miniforge home page and follow the instructions for your platform.

Warning

The GPF system is not supported on Windows.

GPF Installation

The GPF system is developed in Python and supports Python 3.11 and up.

Begin by creating an empty Conda environment named gpf:

mamba create -n gpf

To use this environment, you need to activate it using the following command:

mamba activate gpf

Afterwards, install the gpf_wdae conda package:

mamba install \
    -c conda-forge \
    -c bioconda \
    -c iossifovlab \
    -c defaults \
    gpf_wdae

This command is going to install GPF and all of its dependencies.

Getting the demonstration data

git clone https://github.com/iossifovlab/gpf-getting-started.git

Navigate to the newly created directory:

cd gpf-getting-started

This repository provides a minimal GPF instance configuration and sample data to be imported.

Starting and stopping the GPF web interface

By default, the GPF system looks for a file gpf_instance.yaml in the current directory (and its parent directories). If GPF finds such a file, it uses it as a configuration for the GPF instance. Otherwise, GPF will look for the DAE_DB_DIR environment variable. If it is not set, it throws an exception.

For this manual, we recommend setting the DAE_DB_DIR environment variable.

From within the gpf-getting-started directory, run the following command:

export DAE_DB_DIR=$(pwd)/minimal_instance

For this guide, we use a gpf_instance.yaml file that is already provided in the minimal_instance subdirectory:

instance_id: minimal_instance

reference_genome:
    resource_id: "hg38/genomes/GRCh38-hg38"

gene_models:
    resource_id: "hg38/gene_models/MANE/1.3"

GPF instance configuration requires a reference genome and gene models to annotate variants with effects on genes.

If not specified otherwise, GPF uses the GPF Genomic Resources Repository (GRR) located at https://grr.iossifovlab.com/ to find the resources it needs.

For this guide, we use the HG38 reference genome (hg38/genomes/GRCh38-hg38) and MANE 1.3 gene models (hg38/gene_models/MANE/1.3) provided in the default GRR.

Note

For more on GPF instance configuration, see GPF Instance Configuration.

Now we can run the GPF development web server and browse our empty GPF instance:

wgpf run

and browse the GPF development server at http://localhost:8000.

The web interface will be mostly empty as there is yet no data imported into the instance.

To stop the development GPF web server, you should press Ctrl-C - the usual keybinding for stopping long-running commands in a terminal.

Warning

The development web server, run by wgpf run used in this guide, is meant for development purposes only and is not suitable for serving the GPF system in production.

Importing genotype data

Import Tools and Import Project

The tool used to import genotype data is named import_genotypes. This tool expects an import project file that describes the import.

We support importing variants from multiple formats.

For this demonstration, we will be importing from the following formats:

  • List of de novo variants

  • Variant Call Format (VCF)

Example import of de novo variants

Let us import a small list of de novo variants.

Note

All the data files needed for this example are available in the gpf-getting-started repository under the subdirectory input_genotype_data.

A pedigree file that describes the families is needed - input_genotype_data/example.ped:

familyId  personId  dadId     momId     sex       status
f1        f1.dad                        M         unaffected
f1        f1.mom                        F         unaffected
f1        f1.p1     f1.dad    f1.mom    M         affected
f1        f1.s1     f1.dad    f1.mom    F         unaffected
f2        f2.mom                        F         unaffected
f2        f2.dad                        M         unaffected
f2        f2.p1     f2.dad    f2.mom    F         affected

We will also need the list of de novo variants input_genotype_data/example.tsv:

chrom     pos       ref       alt       person_id
chr14     21403214  T         C         f1.p1
chr14     21431459  G         C         f1.p1
chr14     21391016  A         AT        f2.p1
chr14     21403019  G         A         f2.p1
chr14     21393484  TCTTC     T         f2.p1
chr14     21409849  A         G         f1.p1

A project configuration file for importing this study - input_genotype_data/denovo_example.yaml - is also provided:

id: denovo_example

input:
  pedigree:
    file: example.ped

  denovo:
    files:
      - example.tsv

To import this project, run the following command:

import_genotypes input_genotype_data/denovo_example.yaml

Note

For more information on the import project configuration file, see Import Tools.

For more information on working with pedigree files, see Working With Pedigree Files Guide.

The import genotypes tool will read all the variants from the files specified in the project configuration, annotate them using the reference genome and gene models specified in the GPF instance configuration and finally store them in an appropriate format for use by the GPF system. By default, the imported files are stored in the internal_storage subdirectory of the GPF instance directory.

A minimal study configuration file for the imported study will also be created at minimal_instance/studies/denovo_example.yaml.

Intermediary files created during import will be stored in the import project directory input_genotype_data/denovo_example.

When the import finishes, you can run the GPF development server using:

wgpf run

and browse the content of the GPF development server at http://localhost:8000

The home page of the GPF system will show the imported study denovo_example.

../_images/denovo-example-home-page.png

If you follow the link to the study and choose the Dataset Statistics page, you will see some summary information for the imported study: families and individuals included in the study, types of families, and rates of de novo variants.

../_images/denovo-example-dataset-statistics.png

Statistics of individuals by affected status and sex

../_images/denovo-example-families-by-pedigree.png

Statistics of families by family type

../_images/denovo-example-rate-de-novo-variants.png

Rate of de novo variants

If you select the Genotype Browser page, you will be able to see the imported de novo variants. The default filters search for LGD de novo variants. It happens that all de novo variants imported in the denovo_example study are LGD variants.

So, when you click the Table Preview button, all the imported variants will be shown.

../_images/denovo-example-genotype-browser.png

Genotype Browser with de novo variants

Example import of VCF variants

Similar to the sample denovo variants, there are also sample variants in VCF format. They can be found in input_genotype_data/example.vcf; the same pedigree file from before is used.

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr14>
#CHROM    POS       ID        REF       ALT       QUAL      FILTER    INFO      FORMAT    f1.mom    f1.dad    f1.p1     f1.s1     f2.mom    f2.dad    f2.p1
chr14     21385738  .         C         T         .         .         .         GT        0/0       0/1       0/1       0/0       0/0       0/1       0/0
chr14     21385954  .         A         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/1       0/0       0/1
chr14     21393173  .         T         C         .         .         .         GT        0/1       0/0       0/0       0/1       0/0       0/0       0/0
chr14     21393702  .         C         T         .         .         .         GT        0/0       0/0       0/0       0/0       0/0       0/1       0/1
chr14     21393860  .         G         A         .         .         .         GT        0/0       0/1       0/1       0/1       0/0       0/0       0/0
chr14     21403023  .         G         A         .         .         .         GT        0/0       0/1       0/0       0/1       0/1       0/0       0/0
chr14     21405222  .         T         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/0       0/1       0/0
chr14     21409888  .         T         C         .         .         .         GT        0/1       0/0       0/1       0/0       0/1       0/0       1/0
chr14     21429019  .         C         T         .         .         .         GT        0/0       0/1       0/1       0/0       0/0       0/1       0/1
chr14     21431306  .         G         A         .         .         .         GT        0/0       0/1       0/1       0/1       0/0       0/0       0/0
chr14     21431623  .         A         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/1       1/1       1/1
chr14     21393540  .         GGAA      G         .         .         .         GT        0/1       0/1       1/1       0/0       0/0       0/0       0/0
chr14     21431499  .         T         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/0       0/1       0/1
chr14     21402010  .         G         A         .         .         .         GT        0/0       0/1       0/1       0/0       0/0       0/0       0/0

A project configuration file is provided - input_genotype_data/vcf_example.yaml.

id: vcf_example

input:
  pedigree:
    file: example.ped

  vcf:
    files:
      - example.vcf

To import them, run the following command:

import_genotypes input_genotype_data/vcf_example.yaml

When the import finishes, you can run the GPF development server using:

wgpf run

and browse the content of the GPF development server at http://localhost:8000

The GPF instance Home Page now includes the imported study vcf_example.

../_images/vcf-example-home-page.png

If you follow the link to the vcf_example, you will get to the Gene Browser page for the study. It happens that all imported VCF variants are located on the CHD8 gene. Fill CHD8 in the Gene Symbol box and click the Go button.

../_images/vcf-example-gene-browser.png

Gene Browser` for CHD8 gene shows variants from vcf_example study

In Gene Browser results, the top section has the summary variants, which show the location and frequency of the variants, the bottom section has the family variants, which show the family information, pedigree, and additional annotations.

The user may also observe these variants in the genotype browser by choosing:

  • ‘All’ in Present in Child

  • ‘All’ in Present in Parent and ‘all’ in Rarity

  • ‘All’ in Effect Types.

Example of a dataset (group of genotype studies)

The already imported studies denovo_example and vcf_example have genomic variants for the same group of individuals example.ped. We can create a dataset (group of genotype studies) that includes both studies.

To this end, create a directory datasets/example_dataset inside the GPF instance directory minimal_instance:

mkdir -p minimal_instance/datasets/example_dataset

and create the following configuration file example_dataset.yaml inside that directory:

id: example_dataset
name: Example Dataset

studies:
  - denovo_example
  - vcf_example

When the configuration is ready, re-run the wgpf run command. The home page of the GPF instance will change and now will include the configured dataset example_dataset.

../_images/example-dataset-home-page.png

Home page of the GPF instance showing the example_dataset

Follow the link to the Example Dataset, choose the Gene Browser page, and fill in CHD8 in the Gene Symbol. Click the Go button, and now you will be able to see the variants from both studies.

../_images/example-dataset-gene-browser.png

Gene Browser for CHD8 gene shows variants from both studies - denovo_example and vcf_example

Getting Started with Annotation

The import of genotype data into a GPF instance always runs the GPF effect annotator. It is easy to extend the annotation of genotype data during the import.

To define the annotation used during the import into a GPF instance, we have to add a configuration that defines the pipeline of annotators and resources to be used during the import.

In the public GPF Genomic Resources Repository (GRR) there is a collection of public genomic resources available for use with GPF system.

Let’s say that we want to annotate the genotype variants with GnomAD and ClinVar. We need to find the appropriate resources in the public GRR:

  • hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL - this is an allele_score resource and the annotator by default produces one additional attribute gnomad_v4_genome_ALL_af that is the allele frequency for the variant (check the hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL page for more information about the resource);

  • hg38/scores/ClinVar_20240730 - this is an allele_score resource and the annotator by default produces two additional attribute CLNSIG that is the aggregate germline classification for the variant and CLNDN that is the preferred disease name (check the hg38/scores/ClinVar_20240730 page for more information about the resource.

In order to use these resources in the GPF instance annotation, we need to edit the GPF instance configuration (minimal_instance/gpf_instance.yaml) and add lines 9-12 to it:

 1instance_id: minimal_instance
 2
 3reference_genome:
 4  resource_id: "hg38/genomes/GRCh38-hg38"
 5
 6gene_models:
 7  resource_id: "hg38/gene_models/MANE/1.3"
 8
 9annotation:
10  config:
11    - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
12    - allele_score: hg38/scores/ClinVar_20240730

When you start the GPF instance using the wgpf tool, it will automatically re-annotate any genotype data that is not up to date:

wgpf run

The variants in our Example Dataset will now have additional attributes that come from the annotation with GnomAD and ClinVar:

  • gnomad_v4_genome_ALL_af

  • CLNSIG

  • CLNDN

By default, the additional attributes produced by the annotation are usable in the following ways:

  • If you download the variants using the Genotype Browser download button, the additional attributes will be included in the downloaded file.

  • We can query the variants using the gnomad_v4_genome_ALL_af, CLNSIG and CLNDN genomic scores.

Let’s say we want to find all variants from Example Dataset that have gnomAD frequency. Navigate to the Genotype Browser tab for the Example Dataset. Select all checkboxes in the Genotype Browser filters. From the Genomic Score filter selects the gnomad_v4_genome_ALL_af score.

../_images/example-dataset-all-variants-with-gnomad-filter.png

Genotype browser for Example Dataset with all filters selected

Then click on the Download button. This will download family variants matching the selected filters in a tab-separated file similar to the one shown bellow. Attributes from the annotation are included as the last columns in the downloaded file.

family id

study

location

variant

CHROM

POS

REF

ALT

family person ids

family structure

family best state

family genotype

carrier person ids

carrier person attributes

inheritance type

family phenotypes

carrier phenotypes

parents called

study AF

worst effect

genes

all effects

effect details

gnomad_v4_genome_ALL_af

CLNSIG

CLNDN

f1

denovo_example

chr14:21409849

sub(A->G)

chr14

21409849

A

G

f1.dad;f1.mom;f1.p1;f1.s1

dad:M:unaffected;mom:F:unaffected;prb:M:affected;sib:F:unaffected

2212/0010

0/0;0/0;0/1;0/0

f1.p1

prb:M:affected

denovo

splice-site

CHD8

CHD8:splice-site

ENST00000646647.2:CHD8:splice-site:789/2581

0.00002

Uncertain_significance

not_provided

f2

vcf_example

chr14:21385954

sub(A->C)

chr14

21385954

A

C

f2.mom;f2.dad;f2.p1

mom:F:unaffected;dad:M:unaffected;prb:F:affected

121/101

0/1;0/0;0/1

f2.mom;f2.p1

mom:F:unaffected;prb:F:affected

mendelian

4

12.5

missense

CHD8

CHD8:missense

ENST00000646647.2:CHD8:missense:2469/2581(Ser->Ala)

0.00001

Uncertain_significance

not_provided

f1

vcf_example

chr14:21393173

sub(T->C)

chr14

21393173

T

C

f1.dad;f1.mom;f1.p1;f1.s1

dad:M:unaffected;mom:F:unaffected;prb:M:affected;sib:F:unaffected

2121/0101

0/0;0/1;0/0;0/1

f1.mom;f1.s1

mom:F:unaffected;sib:F:unaffected

mendelian

4

12.5

missense

CHD8

CHD8:missense

ENST00000646647.2:CHD8:missense:2134/2581(Asp->Gly)

0.00003

Uncertain_significance

Inborn_genetic_diseases

f2

vcf_example

chr14:21393702

sub(C->T)

chr14

21393702

C

T

f2.mom;f2.dad;f2.p1

mom:F:unaffected;dad:M:unaffected;prb:F:affected

211/011

0/0;0/1;0/1

f2.dad;f2.p1

dad:M:unaffected;prb:F:affected

mendelian

4

12.5

synonymous

CHD8

CHD8:synonymous

ENST00000646647.2:CHD8:synonymous:2031/2581

0.00001

Likely_benign

not_provided

f2

vcf_example

chr14:21405222

sub(T->C)

chr14

21405222

T

C

f2.mom;f2.dad;f2.p1

mom:F:unaffected;dad:M:unaffected;prb:F:affected

212/010

0/0;0/1;0/0

f2.dad

dad:M:unaffected

mendelian

4

12.5

synonymous

CHD8

CHD8:synonymous

ENST00000646647.2:CHD8:synonymous:1098/2581

0.00003

Likely_benign

not_provided

f1

vcf_example

chr14:21431306

sub(G->A)

chr14

21431306

G

A

f1.dad;f1.mom;f1.p1;f1.s1

dad:M:unaffected;mom:F:unaffected;prb:M:affected;sib:F:unaffected

1211/1011

0/1;0/0;0/1;0/1

f1.dad;f1.p1;f1.s1

dad:M:unaffected;prb:M:affected;sib:F:unaffected

mendelian

4

12.5

missense

CHD8

CHD8:missense

ENST00000646647.2:CHD8:missense:113/2581(Ser->Leu)

0.00005

Conflicting_classifications_of_pathogenicity

Inborn_genetic_diseases|not_provided

f2

vcf_example

chr14:21431623

sub(A->C)

chr14

21431623

A

C

f2.mom;f2.dad;f2.p1

mom:F:unaffected;dad:M:unaffected;prb:F:affected

100/122

0/1;1/1;1/1

f2.mom;f2.dad;f2.p1

mom:F:unaffected;dad:M:unaffected;prb:F:affected

mendelian

4

37.5

missense

CHD8

CHD8:missense

ENST00000646647.2:CHD8:missense:7/2581(Asp->Glu)

0.00001

Uncertain_significance

not_provided|Inborn_genetic_diseases

f1

vcf_example

chr14:21393541

del(3)

chr14

21393540

GGAA

G

f1.dad;f1.mom;f1.p1;f1.s1

dad:M:unaffected;mom:F:unaffected;prb:M:affected;sib:F:unaffected

1102/1120

0/1;0/1;1/1;0/0

f1.dad;f1.mom;f1.p1

dad:M:unaffected;mom:F:unaffected;prb:M:affected

mendelian

4

25.0

no-frame-shift

CHD8

CHD8:no-frame-shift

ENST00000646647.2:CHD8:no-frame-shift:2084/2581(SerSer->Ser)

0.00013

Conflicting_classifications_of_pathogenicity

Intellectual_developmental_disorder_with_autism_and_macrocephaly|not_provided

f2

vcf_example

chr14:21431499

sub(T->C)

chr14

21431499

T

C

f2.mom;f2.dad;f2.p1

mom:F:unaffected;dad:M:unaffected;prb:F:affected

211/011

0/0;0/1;0/1

f2.dad;f2.p1

dad:M:unaffected;prb:F:affected

mendelian

4

12.5

missense

CHD8

CHD8:missense

ENST00000646647.2:CHD8:missense:49/2581(Met->Val)

0.00276

Benign

not_provided|Inborn_genetic_diseases

Note

The attributes produced by the annotation can be used in the Genotype Browser preview table as described in Getting Started with Preview Columns.

Getting Started with Phenotype Data

Importing phenotype data

The import_phenotypes tool is used to import phenotype data.

Note

All the data files needed for this example are available in the gpf-getting-started repository under the subdirectory input_phenotype_data.

The tool requires an import project, a YAML file describing the contents of the phenotype data to be imported, along with configuration options on how to import them.

As an example, we are going to show how to import a simulated phenotype data into our GPF instance.

Inside the input_phenotype_data directory, the following data is provided:

  • pedigree.ped is the phenotype data pedigree file. input_phenotype_data/pedigree.ped:

    familyId    personId    dadId       momId       sex         status      role
    f1          f1.dad                              M           unaffected  dad
    f1          f1.mom                              F           unaffected  mom
    f1          f1.p1       f1.dad      f1.mom      M           affected    prb
    f1          f1.s1       f1.dad      f1.mom      F           unaffected  sib
    f2          f2.mom                              F           unaffected  mom
    f2          f2.dad                              M           unaffected  dad
    f2          f2.p1       f2.dad      f2.mom      F           affected    prb
    f2          f2.s1       f2.dad      f2.mom      M           unaffected  sib
    f3          f3.dad                              M           unaffected  dad
    f3          f3.mom                              F           unaffected  mom
    f3          f3.p1       f3.dad      f3.mom      M           affected    prb
    f3          f3.s1       f3.dad      f3.mom      F           unaffected  sib
    f4          f4.dad                              M           unaffected  dad
    f4          f4.mom                              F           unaffected  mom
    f4          f4.p1       f4.dad      f4.mom      F           affected    prb
    f4          f4.s1       f4.dad      f4.mom      M           unaffected  sib
    f5          f5.dad                              M           unaffected  dad
    f5          f5.mom                              F           unaffected  mom
    f5          f5.p1       f5.dad      f5.mom      M           affected    prb
    f5          f5.s1       f5.dad      f5.mom      F           unaffected  sib
    f6          f6.dad                              M           unaffected  dad
    f6          f6.mom                              F           unaffected  mom
    f6          f6.p1       f6.dad      f6.mom      F           affected    prb
    f6          f6.s1       f6.dad      f6.mom      M           unaffected  sib
    
  • instruments contains the phenotype instruments and measures to be imported. There are two instruments in the example:

    input_phenotype_data/instruments/basic_medical.csv

    personId

    age

    weight

    height

    race

    f1.dad

    50

    200

    180

    white

    f1.mom

    23

    160

    170

    white

    f1.s1

    2

    20

    79

    white

    f1.p1

    4

    40

    80

    white

    f2.dad

    32

    230

    170

    white

    f2.mom

    30

    153

    165

    white

    f2.s1

    12

    80

    130

    white

    f2.p1

    3

    30

    70

    white

    f3.dad

    45

    175

    165

    white

    f3.mom

    41

    173

    154

    white

    f3.s1

    23

    170

    180

    white

    f3.p1

    7

    80

    120

    white

    f4.dad

    25

    190

    185

    white

    f4.mom

    35

    200

    150

    white

    f4.s1

    17

    160

    165

    white

    f4.p1

    5

    50

    100

    white

    f5.dad

    31

    250

    176

    asian

    f5.mom

    39

    180

    154

    asian

    f5.s1

    11

    130

    150

    asian

    f5.p1

    5

    55

    100

    asian

    f6.dad

    30

    200

    173

    african american

    f6.mom

    27

    140

    178

    african american

    f6.s1

    1

    15

    30

    african american

    f6.p1

    8

    80

    130

    african american

    input_phenotype_data/instruments/iq.csv

    personId

    verbal-iq

    non-verbal-iq

    diagnosis-notes

    f1.p1

    98

    70

    originally diagnosed as Asperger

    f1.s1

    98

    102

    f2.p1

    60

    45

    walked late

    severe seizures

    f2.s1

    115

    83

    f3.p1

    108

    93

    excels at school

    f3.s1

    97

    90

    f4.p1

    90

    80

    f4.s1

    107

    91

    f5.p1

    85

    92

    f5.s1

    105

    115

    f6.p1

    90

    70

    sleep abnormality

    f6.s1

    95

    101

  • measure_descriptions.tsv contains descriptions of the provided measures.

    input_phenotype_data/measure_descriptions.tsv:

    instrumentName      measureName         description
    basic_medical       age                 The individual's age in years
    basic_medical       weight              The individual's weight in pounds
    basic_medical       height              The individual's height in centimeters
    basic_medical       race                The individual's race
    iq                  verbal_iq           Verbal IQ
    iq                  non_verbal_iq       Non verbal IQ
    
  • import_project.yaml is the import project configuration that we will use to import this data.

    input_phenotype_data/import_project.yaml:

    id: mini_pheno
    
    instrument_files:
      - instruments/basic_medical.csv
      - instruments/iq.csv
    
    data_dictionary:
      files:
        - path: measure_descriptions.tsv
    
    pedigree: pedigree.ped
    
    person_column: personId
    
    study_config:
      regressions:
        reg_1:
          display_name: "Age"
          instrument_name: basic_medical
          measure_names:
            - age
          jitter: 0.1
    

Note

For more information on how to import phenotype data, see Phenotype Database Tools

We will use the import_phenotypes tool to import the phenotype data. It will import the phenotype database directly to our GPF instance’s phenotype storage:

import_phenotypes input_phenotype_data/import_project.yaml

When the import finishes, you can run the GPF development server using:

wgpf run

Now, on the GPF instance Home Page, you should see the mini_pheno phenotype study.

../_images/mini-pheno-home-page.png

Home page with imported phenotype study

If you follow the link, you will see the Phenotype Browser tab with the imported data.

../_images/mini-pheno-phenotype-browser.png

Phenotype Browser tab with imported data

In the Phenotype Browser tab, you can search for phenotype instruments and measures, see the aggregated figures for the measures, and download selected instruments and measures.

Configure a genotype study to use phenotype data

To demonstrate how a study is configured with a phenotype database, we will be working with the already configured example_dataset dataset.

The phenotype databases can be attached to one or more studies and/or datasets. If you want to attach the mini_pheno phenotype study to the example_dataset dataset, you need to specify it in the dataset’s configuration file, which can be found at minimal_instance/datasets/example_dataset/example_dataset.yaml.

Add the following line to the configuration file:

phenotype_data: mini_pheno

When you restart the server, you should be able to see Phenotype Browser and Phenotype Tool tabs enabled for the Example Dataset dataset.

Additionally, in the Genotype Browser, the Family Filters and Person Filters sections will have the Pheno Measures filters enabled.

../_images/example-dataset-genotype-browser-pheno-filters.png

Example Dataset genotype browser using Pheno Measures family filters

Getting Started with Preview Columns

Configure genotype columns in Genotype Browser

Once you have annotated your variants, the additional attributes produced by the annotation can be displayed in the variants preview table.

In our example, the annotation produces three additional attributes:

  • gnomad_v4_genome_ALL_af

  • CLNSIG

  • CLNDN

Let us add these attributes to the variants preview table for the example_dataset dataset.

In the preview table, each column could show multiple values. In GPF, when you want to show multiple values in a single column, you need to define a column group.

The column group is a collection of attributes that are shown together in the preview table. The values in a column group are shown in a single cell.

By default, the study configuration includes several predefined column groups: family, variant, genotype, effect and frequency.

../_images/example-dataset-default-column-groups.png

Default column groups in the Preview Table

In the study configuration, you can define new column groups or redefine already existing ones. Let us redefine the existing column group frequency to include the gnomAD frequency and define a new column group clinvar to include the ClinVar attributes.

The column group is defined in the column_groups section of the configuration file.

Edit the example_dataset.yaml dataset configuration in minimal_instance/datasets/example_dataset and add the following section at the end of the configuration file:

 1genotype_browser:
 2  column_groups:
 3    frequency:
 4      name: frequency
 5      columns:
 6      - allele_freq
 7      - gnomad_v4_genome_ALL_af
 8
 9    clinvar:
10      name: ClinVar
11      columns:
12      - CLNSIG
13      - CLNDN
14
15  preview_columns_ext:
16    - clinvar

In lines 3-7, we re-define the existing column group frequency to include the study frequency allele_freq and gnomAD frequency gnomad_v4_genome_ALL_af.

In lines 9-13, we define a new column group clinvar that contains the values of the annotation attributes CLNSIG and CLNDN.

In lines 15-16, we extend the preview table columns. The new column groups clinvar will be added to the preview table.

If we now stop the wgpf tool and rerun it, we will be able to see the new columns in the preview table.

From the GPF instance Home Page, follow the link to the Example Dataset page and choose the Genotype Browser. Select all checkboxes in Present in Child, Present in Parent and Effect Types sections.

../_images/example-dataset-genotype-browser-extended-columns-filters.png

Then click the Preview button and will be able to see all the imported variants with their additional attributes coming from the annotation.

../_images/example-dataset-genotype-browser-extended-columns-variants.png

Example Dataset genotype browser displaying variants with additional columns gnomAD v4 and ClinVar.

Configure phenotype columns in Genotype Browser

The Genotype Browser allows you to add phenotype attributes to the table preview and the download file.

Phenotype attributes show values from a phenotype database that are associated with the displayed family variant. To configure such a column, you need to specify the following properties:

  • source - the measure ID whose values will be shown in the column;

  • role - the role of the person in the family for which we are going to show the phenotype measure value;

  • name - the display name of the column in the table.

Let’s add some phenotype columns to the Genotype Browser preview table in Example Dataset. To do this, you need to define them in the study’s config, in the genotype browser section of the configuration file. We are going to modify the example_dataset.yaml dataset configuration in minimal_instance/datasets/example_dataset/example_data.yaml:

 1genotype_browser:
 2  columns:
 3    phenotype:
 4      prb_verbal_iq:
 5        role: prb
 6        name: Verbal IQ
 7        source: iq.verbal_iq
 8
 9      prb_non_verbal_iq:
10        role: prb
11        name: Non-Verbal IQ
12        source: iq.non_verbal_iq
13
14  column_groups:
15    frequency:
16      name: frequency
17      columns:
18      - allele_freq
19      - gnomad_v4_genome_ALL_af
20
21    clinvar:
22      name: ClinVar
23      columns:
24      - CLNSIG
25      - CLNDN
26
27    proband_iq:
28      name: Proband IQ
29      columns:
30      - prb_verbal_iq
31      - prb_non_verbal_iq
32
33  preview_columns_ext:
34    - clinvar
35    - proband_iq

Lines 2-12 define the two new columns with values coming from the phenotype data attributes:

  • prb_verbal_iq - is a column that uses the value of the phenotype measure iq.verbal_iq for the family proband. The display name of the column will be Verbal IQ;

  • prb_non_verbal_iq - is a column that uses the value of the phenotype measure iq.non_verbal_iq for the family proband. The display name of the column will be Non-Verbal IQ.

We want these two columns to be shown together in the preview table. To do this, we need to define a new column group. In lines 27-31, we define a column group called proband_iq that contains the columns prb_verbal_iq and prb_non_verbal_iq.

To add the new column group proband_iq to the preview table, we need to add it to the preview_columns_ext section of the configuration file. In line 35, we add the new column group proband_iq at the end of the preview table.

When you restart the server, go to the Genotype Browser tab of the Example Dataset dataset and select all checkboxes in Present in Child, Present in Parent and Effect Types sections:

../_images/example-dataset-proband-iq-column-group-filters.png

When you click on the Table Preview button, you will be able to see the new column group proband_iq in the preview table.

../_images/example-dataset-proband-iq-column-group-variants.png

Example Dataset genotype browser using pheno measures columns

Note

For more on study configuration, see the GPF Study Configuration section.

Example import of real de Novo variants

Source of the data

As an example, let us import de novo variants from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).

We will focus on de novo variants from the SSC collection published in the paper mentioned above.

To import these variants into the GPF system, we need a pedigree file describing the families and a list of de novo variants.

From the supplementary data for the paper, you can download the following files:

Note

All the data files needed for this example are available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

Preprocess the Family Data

The list of children in Supplementary_Data_1.tsv.gz contains a lot of data that is not relevant for the import. We are going to use only the first five columns from that file that look as follows:

gunzip -c Supplementary_Data_1.tsv.gz | head | cut -f 1-5 | less -S -x 20
collection          familyId            personId            affected status     sex
SSC                 11000               11000.p1            affected            M
SSC                 11000               11000.s1            unaffected          F
SSC                 11003               11003.p1            affected            M
SSC                 11003               11003.s1            unaffected          F
SSC                 11004               11004.p1            affected            M
SSC                 11004               11004.s1            unaffected          M
SSC                 11006               11006.p1            affected            M
SSC                 11006               11006.s1            unaffected          M
SSC                 11008               11008.p1            affected            M
  • The first column contains the collection. This study includes data from the SSC and AGRE collections. We are going to import only variants from the SSC collection.

  • The second column contains the family ID.

  • The third column contains the person’s ID.

  • The fourth column contains the affected status of the individual.

  • The fifth column contains the sex of the individual.

We need a pedigree file describing the family’s structure to import the data into GPF. The SupplementaryData1_Children.tsv.gz contains only the children; it does not include information about their parents. Fortunately for the SSC collection, it is not difficult to build the whole families’ structures from the information we have.

So, before starting the work on the import, we need to preprocess the list of children and transform it into a pedigree file.

For the SSC collection, if you have a family with ID`<fam_id>`, then the identifiers of the individuals in the family are going to be formed as follows:

  • mother - <fam_id>.mo;

  • father - <fam_id>.fa;

  • proband - <fam_id>.p1;

  • first sibling - <fam_id>.s1;

  • second sibling - <fam_id>.s2.

Another essential restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father, and siblings is unaffected.

Having this information, we can use the following Awk script to transform the list of children in a pedigree:

gunzip -c Supplementary_Data_1.tsv.gz | awk '
    BEGIN {
        OFS="\t"
        print "familyId", "personId", "dadId", "momId", "status", "sex"
    }
    $1 == "SSC" {
        fid = $2
        if( fid in families == 0) {
            families[fid] = 1
            print fid, fid".mo", "0", "0", "unaffected", "F"
            print fid, fid".fa", "0", "0", "unaffected", "M"
        }
        print fid, $3, fid".fa", fid".mo", $4, $5
    }' > ssc_denovo.ped

If we run this script, it will read Supplementary_Data_1.tsv.gz and produce the appropriate pedigree file ssc_denovo.ped.

Note

The resulting pedigree file is also available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

Here is a fragment from the resulting pedigree file:

familyId       personId       dadId          momId          status         sex
11000          11000.mo       0              0              unaffected     F
11000          11000.fa       0              0              unaffected     M
11000          11000.p1       11000.fa       11000.mo       affected       M
11000          11000.s1       11000.fa       11000.mo       unaffected     F
11003          11003.mo       0              0              unaffected     F
11003          11003.fa       0              0              unaffected     M
11003          11003.p1       11003.fa       11003.mo       affected       M
11003          11003.s1       11003.fa       11003.mo       unaffected     F
11004          11004.mo       0              0              unaffected     F
11004          11004.fa       0              0              unaffected     M

Preprocess the SNP and INDEL de Novo variants

The Supplementary_Data_2.tsv.gz file contains 255232 variants. For the import, we will use columns four and nine from this file:

gunzip -c Supplementary_Data_2.tsv.gz | head | cut -f 4,9 | less -S -x 20
personIds           variant in VCF format
13210.p1            chr1:184268:G:A
12782.s1            chr1:191408:G:A
12972.s1            chr1:271774:AG:A
12420.p1            chr1:484721:AG:A
12518.p1,12518.s1   chr1:691130:T:C
13882.p1            chr1:738645:C:G
14039.s1            chr1:819832:G:T
13872.p1            chr1:824001:AAAAT:A

Using the following Awk script, we can transform this file into easy to import the list of de Novo variants:

gunzip -c Supplementary_Data_2.tsv.gz | cut -f 4,9 | awk '
    BEGIN{
        OFS="\t"
        print "chrom", "pos", "ref", "alt", "person_id"
    }
    NR > 1 {
        split($2, v, ":")
        print v[1], v[2], v[3], v[4], $1
    }' > ssc_denovo.tsv

This script will produce a file named ssc_denovo.tsv with the following content:

chrom          pos            ref            alt            person_id
chr1           184268         G              A              13210.p1
chr1           191408         G              A              12782.s1
chr1           271774         AG             A              12972.s1
chr1           484721         AG             A              12420.p1
chr1           691130         T              C              12518.p1,12518.s1
chr1           738645         C              G              13882.p1
chr1           819832         G              T              14039.s1
chr1           824001         AAAAT          A              13872.p1
chr1           826779         T              C              12132.s1
chr1           834505         G              A              13801.p1

Note

The resulting ssc_denovo.tsv file is also available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import/input_data.

Caching GRR

Now we are about to import 255K variants. During the import, the GPF system will annotate these variants using the GRR resources from our public GRR. For small studies with few variants, this approach is quite convenient. However, for larger studies, it is better to cache the GRR resources locally.

To do this, we need to configure the GPF to use a local cache. Create a file named .grr_definition.yaml in your home directory with the following content:

id: "seqpipe"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "<path_to_your_cache_dir>"

The cache_dir parameter specifies the directory where the GRR resources will be cached. The cache directory should be specified as an absolute path. For example, /tmp/grr_cache or /Users/lubo/grrCache.

To download all the resources needed for our minimal_instance annotation, run the following command from the gpf-getting-started directory:

grr_cache_repo -i minimal_instance/gpf_instance.yaml

Note

The grr_cache_repo command will download all the resources needed for the GPF instance. This may take a while, depending on your internet connection and the number of resources your configuration requires.

The resources will be downloaded to the directory specified in the cache_dir parameter in the .grr_definition.yaml file.

For the gpf-getting-started repository, the resources that will be downloaded are:

  • hg38/genomes/GRCh38-hg38

  • hg38/gene_models/MANE/1.3

  • hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL

  • hg38/scores/ClinVar_20240730

The total size of the downloaded resources is about 15 GB.

Data Import of ssc_denovo

Now we have a pedigree file, ssc_denovo.ped, and a list of de novo variants, ssc_denovo.tsv. To import this data we need to prepare an import project. The import project is already available in the example imports directory example_imports/denovo_and_cnv_import/ssc_denovo.yaml:

 1id: ssc_denovo
 2
 3input:
 4  pedigree:
 5    file: ssc_denovo.ped
 6
 7  denovo:
 8    files:
 9    - ssc_denovo.tsv
10
11processing_config:
12  denovo: chromosome

When importing genotype data, we often need to instruct the import tool how to split the import process into multiple jobs. For this purpose, we can use processing_config section of the import project. On lines 11-12 of the ssc_denovo.yaml file, we have defined the processing_config section that will split the import de Novo variants into jobs by chromosome. (For more on import project configuration, see Import Tools.)

Note

The project file ssc_denovo.yaml is available in the the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

To import the study, from the gpf-getting-started directory we should run:

time import_genotypes -v -j 10 example_imports/denovo_and_cnv_import/ssc_denovo.yaml

The -j 10 option instructs the import_genotypes tool to use 10 threads and the -v option controls the verbosity of the output.

This command will take a while to run. The time it takes to run will depend on the number of variants in the input file and the number of threads used for the import.

Note

For example, on a MacBook Pro with the Apple M1 Pro chip, the import of the SSC de Novo variants took about 5 minutes:

real    5m29.950s
user    31m52.320s
sys     1m41.755s

When the import finishes, we can run the development GPF server:

wgpf run

In the Home page of the GPF instance, we should have the new study ssc_denovo.

../_images/ssc_denovo_home_page.png

The home page has the imported SSC de Novo study.

If you follow the link to the study and choose the Genotype Browser tab, you will be able to query the imported variants.

../_images/ssc_denovo_genotype_browser.png

Genotype browser for the SSC de novo variants.

Configure preview and download columns

While importing the SSC de novo variants, we used the annotation defined in the minimal instance configuration file. So, all imported variants are annotated with GnomAD and ClinVar genomic scores.

We can use these score values to define additional columns in the preview table and the download file similar to the Getting Started with Preview Columns.

Edit the ssc_denovo configuration file located at minimal_instance/studies/ssc_denovo/ssc_denovo.yaml and add the following snippet to the configuration file:

 1genotype_browser:
 2  column_groups:
 3    frequency:
 4      name: frequency
 5      columns:
 6      - allele_freq
 7      - gnomad_v4_genome_ALL_af
 8
 9    clinvar:
10      name: ClinVar
11      columns:
12      - CLNSIG
13      - CLNDN
14
15  preview_columns_ext:
16    - clinvar

Now, restart the GPF development server:

wgpf run

Go to the Genotype Browser tab of the ssc_denovo study and click Preview Table button. The preview table should now contain the additional columns for GnomAD and ClinVar genomic scores.

../_images/ssc_denovo_genotype_browser_with_annotated_columns.png

Genotype browser with additional columns for GnomAD and ClinVar genomic scores.

Example import of real CNV variants

Source of the data

As an example for the import of CNV variants, we will use data from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).

We already discussed the import of de Novo variants from this paper in Example import of real de Novo variants.

Now we will focus on the import of CNV variants from the same paper.

To import these variants into the GPF system, we need a pedigree file describing the families and a list of CNV variants.

From the supplementary data for the paper, you can download the following files:

Note

All the data files needed for this example are available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

We already discussed how to transform the list of children into a pedigree file in the Preprocess the Family Data section.

Now we need to prepare the CNV variants file.

Preprocess the CNV variants

The Supplementary_Data_4.tsv.gz file contains 376 CNV variants from SSC and AGRE collections.

For the import, we will use the columns two, five, six, and seven:

gunzip -c Supplementary_Data_4.tsv.gz | cut -f 2,5-7 | less -S -x 25
collection               personIds                location                 variant
SSC                      12613.p1                 chr1:1305145-1314126     duplication
AGRE                     AU2725301                chr1:3069177-4783791     duplication
SSC                      13424.s1                 chr1:3975501-3977800     deletion
SSC                      12852.p1                 chr1:6647401-6650500     deletion
SSC                      13776.p1                 chr1:8652301-8657600     deletion
SSC                      13373.s1                 chr1:9992001-9994100     deletion
SSC                      14198.p1                 chr1:12224601-12227300   deletion
SSC                      13259.p1                 chr1:15687701-15696200   deletion
SSC                      14696.s1                 chr1:30388501-30398807   deletion

Using the following Awk script, we will filter only variants from SSC collection:

gunzip -c Supplementary_Data_4.tsv.gz | cut -f 2,5-7 | awk '
    BEGIN{
        OFS="\t"
        print "location", "variant", "person_id"
    }
    $1 == "SSC" {
        print $3, $4, $2
    }' > ssc_cnv.tsv

This script will produce a file named ssc_cnv.tsv with the following content:

location                      variant                       person_id
chr1:1305145-1314126          duplication                   12613.p1
chr1:3975501-3977800          deletion                      13424.s1
chr1:6647401-6650500          deletion                      12852.p1
chr1:8652301-8657600          deletion                      13776.p1
chr1:9992001-9994100          deletion                      13373.s1
chr1:12224601-12227300        deletion                      14198.p1
chr1:15687701-15696200        deletion                      13259.p1
chr1:30388501-30398807        deletion                      14696.s1
chr1:40513501-40534200        deletion                      14534.p1
chr1:40513501-40534200        deletion                      14534.s1

Note

The resulting ssc_cnv.tsv file is available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import/input_data.

Data Import of ssc_cnv

Now we have a pedigree file, ssc_denovo.ped, and a list of CNV variants, ssc_cnv.tsv. To import the data we need an import project. The import project for import ssc_cnv data is already available in the examples directory example_imports/denovo_and_cnv_import/ssc_cnv.yaml:

 1id: ssc_cnv
 2
 3input:
 4  pedigree:
 5    file: ssc_denovo.ped
 6
 7  cnv:
 8    files:
 9    - ssc_cnv.tsv
10
11    location: location
12    variant_type: variant
13    plus_values: duplication
14    minus_values: deletion
15    person_id: person_id

Lines 12-14 configure how the CNV variants are defined in the input file. The variant specifies the type of the variant and values deletion and duplication are used to define the CNV variant type.

Note

The project file ssc_cnv.yaml is available in the the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

To import the study, from the gpf-getting-started directory we should run:

time import_genotypes -v -j 1 example_imports/denovo_and_cnv_import/ssc_cnv.yaml

When the import finishes, we can run the development GPF server:

wgpf run

In the Home page of the GPF instance, we should have the new study ssc_cnv.

../_images/ssc_cnv_home_page.png

Home page with the imported ssc_cnv study.

If you follow the link to the ssc_cnv study and choose the Genotype Browser tab, you will be able to query the imported CNV variants.

../_images/ssc_cnv_genotype_browser.png

Genotype browser for the SSC CNV variants.

Getting Started with Gene Sets

The GPF system provides support for the collection of gene symbols of interest for the analysis of genotype data. There are two types of gene sets that can be used in GPF:

  • de Novo gene sets - for each genotype study that has de Novo variants, the GPF system can create gene sets that contain a list of genes with de Novo variants of interest; for example, genes with LGSs de Novo variants, genes with LGDs de Novo variants in males, etc.

  • pre-defined gene sets - these are gene sets that are defined in the GRR used by the GPF instance; for example, in the public GPF Genomic Resources Repository (GRR) there are multiple gene set collections ready for use in the GPF instance.

De Novo Gene Set

By default, for each genotype study with de Novo variants, the GPF system creates a collection of de Novo gene sets with pre-defined properties. For example:

  • LGDs - genes with LGDs de Novo variants;

  • LGDs.Female - genes with LGDs de Novo variants in females;

  • LGDs.Male - genes with LGDs de Novo variants in males;

  • Missense - genes with missense de Novo variants;

  • Missense.Female - genes with missense de Novo variants in females;

  • Missense.Male - genes with missense de Novo variants in males;

  • etc.

You can use these gene sets in multiple tools in the GPF system. For example, if you navigate to Genotype Browser for ssc_denovo study, and select the Genes > Gene Sets tab, you will see the list of de Novo gene sets generated for the study.

../_images/ssc_denovo_denovo_gene_sets.png

De Novo Gene Sets for ssc_denovo study

You can use these gene sets in the Genotype Browser to filter the variants in genes that are included in the selected gene set.

Pre-defined Gene Set Collections

To add pre-defined gene sets from the GRR to the GPF instance, you need to edit the GPF instance configuration file (minimal_instance/gpf_instance.yaml).

Let’s say that we want to add the following gene set collections from the public GRR:

To do this, you need to add lines 14-18 to the GPF instance configuration file (minimal_instance/gpf_instance.yaml):

 1instance_id: minimal_instance
 2
 3reference_genome:
 4  resource_id: "hg38/genomes/GRCh38-hg38"
 5
 6gene_models:
 7  resource_id: "hg38/gene_models/MANE/1.3"
 8
 9annotation:
10  config:
11    - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
12    - allele_score: hg38/scores/ClinVar_20240730
13
14gene_sets_db:
15  gene_set_collections:
16  - gene_properties/gene_sets/autism
17  - gene_properties/gene_sets/relevant
18  - gene_properties/gene_sets/GO_2024-06-17_release

When you restart the GPF instance, the configured gene set collections will be available in the GPF instance user interface. For example, if you navigate to Genotype Browser for ssc_denovo study, and select the Genes > Gene Sets tab, you will see the configured gene set collections.

../_images/ssc_denovo_gene_set_collections.png

Gene Set Collections in the ssc_denovo Genotype Browser interface

Pre-defined Gene Scores

To add pre-defined gene scores from the GRR to the GPF instance, you need to edit the GPF instance configuration file (minimal_instance/gpf_instance.yaml).

Let’s say that we want to add the following gene set collections from the public GRR:

To do this, you need to add lines 20-26 to the GPF instance configuration file (minimal_instance/gpf_instance.yaml):

 1instance_id: minimal_instance
 2
 3reference_genome:
 4  resource_id: "hg38/genomes/GRCh38-hg38"
 5
 6gene_models:
 7  resource_id: "hg38/gene_models/MANE/1.3"
 8
 9annotation:
10  config:
11    - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
12    - allele_score: hg38/scores/ClinVar_20240730
13
14gene_sets_db:
15  gene_set_collections:
16  - gene_properties/gene_sets/autism
17  - gene_properties/gene_sets/relevant
18  - gene_properties/gene_sets/GO_2024-06-17_release
19
20gene_scores_db:
21  gene_scores:
22  - gene_properties/gene_scores/Satterstrom_Buxbaum_Cell_2020
23  - gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015
24  - gene_properties/gene_scores/LGD
25  - gene_properties/gene_scores/RVIS
26  - gene_properties/gene_scores/LOEUF

When you restart the GPF instance, the configured gene scores will be available in the GPF instance user interface. For example, if you navigate to Genotype Browser for ssc_denovo study, and select the Genes > Gene Scores tab, you will see the configured gene set collections.

../_images/ssc_denovo_gene_scores.png

Gene Scores in the ssc_denovo Genotype Browser interface

Getting Started with Enrichment Tool

By default, for each genotype study with de Novo variants, the GPF system enables the Enrichment tool.

The Enrichment Tool allows the user to test if a given set of genes is affected by more or fewer de novo mutations in the children than expected.

To use the Enrichment Tool, a user must choose a set of genes either by selecting one of the gene sets that have already been configured in GPF or by providing their own list of gene symbols.

The user also must select among the background models that GPF uses to compute the expected number of de novo mutations within the given dataset.

Note

By default, for studies with de Novo variants, only one background model is configured: enrichment/samocha_background

To use other background models, the user must edit the study configuration file.

If you navigate to the Enrichment Tool page for the ssc_denovo study, you will be able to use the tool with run different tests.

../_images/ssc_denovo_enrichment_tool.png

Enrichment Tool for ssc_denovo study.

Getting Started with Gene Profiles

The Gene Profile tool provides summary statistics of the data managed by GPF and additional relevant information organized by gene.

To enable the Gene Profile tool, you need to create a configuration for the tool and add it to the GPF instance configuration file.

Let us create a configuration for the Gene Profile tool in the GPF instance directory minimal_instance/gene_profiles.yaml wit the following content:

 1datasets:
 2  ssc_denovo:
 3    statistics:
 4    - id: denovo_lgds
 5      description: de Novo LGDs
 6      display_name: dn LGDs
 7      effects:
 8      - LGDs
 9      category: denovo
10    - id: denovo_missense
11      description: de Novo missense
12      display_name: dn mis
13      effects:
14      - missense
15      category: denovo
16      default_visible: true
17    - id: denovo_intronic_indels
18      description: number of de Novo intronic indels
19      display_name: dn IIND
20      effects:
21      - intron
22      category: denovo
23      variant_types:
24      - ins
25      - del
26      default_visible: true
27    person_sets:
28    - set_name: affected
29      collection_name: status
30      default_visible: true
31    - set_name: unaffected
32      collection_name: status
33      default_visible: true
34
35gene_scores:
36- category: autism_scores
37  display_name: Autism Gene Scores
38  scores:
39  - score_name: Satterstrom Buxbaum Cell 2020 qval
40    format: "%%.2f"
41  - score_name: Iossifov Wigler PNAS 2015 post noaut
42    format: "%%.2f"
43
44- category: protection_scores
45  display_name: Protection Gene Scores
46  scores:
47    - score_name: RVIS_rank
48      format: "%%s"
49    - score_name: LGD_rank
50      format: "%%s"
51    - score_name: LOEUF_rank
52      format: "%%s"
53
54gene_sets:
55- category: autism_gene_sets
56  display_name: Autism Gene Sets
57  sets:
58  - set_id: autism candidates from Iossifov PNAS 2015
59    collection_id: autism
60  - set_id: autism candidates from Sanders Neuron 2015
61    collection_id: autism
62  - set_id: Yuen Scherer Nature 2017
63    collection_id: autism
64  - set_id: Turner Eichler ajhg 2019
65    collection_id: autism
66  - set_id: Satterstrom Buxbaum Cell 2020 top
67    collection_id: autism
68
69gene_links:
70- name: Gene Browser
71  url: "datasets/ssc_denovo/gene-browser/{gene}"
72
73- name: GeneCards
74  url: "https://www.genecards.org/cgi-bin/carddisp.pl?gene={gene}"
75- name: SFARI gene
76  url: "https://gene.sfari.org/database/human-gene/{gene}"
77
78default_dataset: ssc_denovo
79
80order:
81- autism_gene_sets_rank
82- autism_scores
83- ssc_denovo
84- protection_scores

There are several sections in this configuration file:

  • datasets: This section defines the studies and datasets that will be used to collect variants statistics. In our example we are going to use the ssc_denovo study - see lines 2-33.

    • For each study or dataset we should define what type of variant statistics we want to collect. In our example we are going to collect three types of statistics:

      • Count of LGDs de Novo variants for eash gene - lines 4-9;

      • Count of missense de Novo variants for each gene - lines 10-16;

      • Count of intronic INDEL variants for each gene - lines 17-26.

    • For each study or dataset we should define how to split individuals into groups. In our example we are going to split them into two groups - affected and unaffected - lines 27-33.

  • gene_scores: This section defines groups of gene scores that will be used in gene profiles. The gene profiles will include score values for each gene from the defined gene scores. In our example we are going to use to groups of gene scores:

    • autism_scores - lines 36-42;

    • protectsion_scores - lines 44-52.

    Please note that all gene scores used in this configuration section should be defined in the GPF instance configuration file.

  • gene_sets: This section defines groups of gene sets that will be used in gene profiles. The gene profiles will show if the gene is included in the defined gene sets. In our example we are going to use one group of gene sets:

    • autism_gene_sets - lines 54-67;

      Please note that all gene sets used in this configuration section should be defined in the GPF instance configuration file.

  • gene_links: This section defines links to internal and external tools that contain information aboun genes. In our example we are defining three links: - lines 70-71 - link to the GPF Gene Browser tools; - lines 73-74 - link to the GeneCards site; - lines 75-76 - link to the SFARI Gene site.

Once we have this configuration we need to add it to the GPF instance configuration:

 1instance_id: minimal_instance
 2
 3reference_genome:
 4  resource_id: "hg38/genomes/GRCh38-hg38"
 5
 6gene_models:
 7  resource_id: "hg38/gene_models/MANE/1.3"
 8
 9annotation:
10  config:
11    - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
12    - allele_score: hg38/scores/ClinVar_20240730
13
14gene_sets_db:
15  gene_set_collections:
16  - gene_properties/gene_sets/autism
17  - gene_properties/gene_sets/relevant
18  - gene_properties/gene_sets/GO_2024-06-17_release
19
20gene_scores_db:
21  gene_scores:
22  - gene_properties/gene_scores/Satterstrom_Buxbaum_Cell_2020
23  - gene_properties/gene_scores/Iossifov_Wigler_PNAS_2015
24  - gene_properties/gene_scores/LGD
25  - gene_properties/gene_scores/RVIS
26  - gene_properties/gene_scores/LOEUF
27
28gene_profiles_config:
29  conf_file: gene_profiles.yaml

Once we have configured the GPF Gene Profiles we need to prebuild the gene profiles. The prebuilding of the gene profiles is done using the generate_gene_profile command. By default the generate_gene_profiles command will generate profiles for all genes in the GPF instance gene models.

The gene models we are using in our example hg38/gene_models/MANE/1.3 have 19,285 genes. Please note, that generating gene profiles for all genes will take a while to finish. On a MacBook Pro M1 with 32GB of RAM it took about 10 minutes to finish.

generate_gene_profile

Note

If you want to speed up the process of generating gene profiles, you can limit the number of genes for which the profiles will be generated. For example, in the following command we are generating gene profiles for a list of ten genes:

generate_gene_profile \
    --genes \
    CHD8,NCKAP1,DSCAM,ANK2,GRIN2B,SYNGAP1,ARID1B,MED13L,GIGYF1,WDFY3

Once the generation of gene profiles is finished, you can start the GPF instance using the wgpf command:

wgpf run

In the home page of the GPF instance you should be able to see Gene Profiles tool:

../_images/gene_profiles_home_page.png

Gene Profiles tool links added to the GPF instance home page

If you follow the All Genes link from the Home Page, you will be taken to the Gene Profiles table with information about genes.

../_images/gene_profiles.png

Gene Profiles table with summary information about genes

If you select a gene from the table, the GPF will open the Gene Profile page for the selected gene.

../_images/gene_profile_page.png

Gene Profile page for the CHD8 gene

Note

For more information about the Gene Profile tool, please refer to the user interface documentation Gene Profiles.

Example Usage of GPF Python Interface

The simplest way to start using GPF’s Python API is to import the GPFInstance class and instantiate it:

from dae.gpf_instance.gpf_instance import GPFInstance
gpf_instance = GPFInstance.build()

This gpf_instance object groups several interfaces, each dedicated to managing different parts of the underlying data. It can be used to interact with the system as a whole.

Querying genotype data

For example, to list all studies configured in the startup GPF instance, use:

gpf_instance.get_genotype_data_ids()

This will return a list with the IDs of all configured studies:

['ssc_denovo', 'denovo_example', 'vcf_example', 'ssc_cnv', 'example_dataset']

To get a specific study and query it, you can use:

st = gpf_instance.get_genotype_data('example_dataset')
vs = list(st.query_variants())

Note

The query_variants method returns a Python iterator.

To get the basic information about variants found by the query_variants method, you can use:

for v in vs:
    for aa in v.alt_alleles:
        print(aa)

will produce the following output:

chr14:21391016 A->AT f2
chr14:21393484 TCTTC->T f2
chr14:21402010 G->A f1
chr14:21403019 G->A f2
chr14:21403214 T->C f1
chr14:21431459 G->C f1
chr14:21385738 C->T f1
chr14:21385738 C->T f2
chr14:21385954 A->C f2
chr14:21393173 T->C f1
chr14:21393702 C->T f2
chr14:21393860 G->A f1
chr14:21403023 G->A f1
chr14:21403023 G->A f2
chr14:21405222 T->C f2
chr14:21409888 T->C f1
chr14:21409888 T->C f2
chr14:21429019 C->T f1
chr14:21429019 C->T f2
chr14:21431306 G->A f1
chr14:21431623 A->C f2
chr14:21393540 GGAA->G f1

The query_variants interface allows you to specify what kind of variants you are interested in. For example, if you only need “synonymous” variants, you can use:

st = gpf_instance.get_genotype_data('example_dataset')
vs = st.query_variants(effect_types=['synonymous'])
vs = list(vs)
len(vs)
>> 4

Or, if you are interested in “synonymous” variants only in people with “prb” role, you can use:

vs = st.query_variants(effect_types=['synonymous'], roles='prb')
vs = list(vs)
len(vs)
>> 1

Querying phenotype data

To list all available phenotype data, use:

gpf_instance.get_phenotype_data_ids()

This will return a list with the IDs of all configured phenotype data:

['mini_pheno']

To get a specific phenotype data and query it, use:

pd = gpf_instance.get_phenotype_data("comp_pheno")

We can see what instruments and measures are available in the data:

pd.instruments

>> {'i1': Instrument(i1, 7)}
pd.measures

>> {'i1.age': Measure(i1.age, MeasureType.continuous, [68.00148724003327, 606.2292731817272]),
    'i1.iq': Measure(i1.iq, MeasureType.continuous, [-11.109304318239424, 174.2897342432941]),
    'i1.m1': Measure(i1.m1, MeasureType.continuous, [28.876821569323646, 143.02866815069675]),
    'i1.m2': Measure(i1.m2, MeasureType.continuous, [17.650256211303596, 69.72059461639753]),
    'i1.m3': Measure(i1.m3, MeasureType.continuous, [20.34949100410408, 122.8324621617449]),
    'i1.m4': Measure(i1.m4, MeasureType.continuous, [0, 10]),
    'i1.m5': Measure(i1.m5, MeasureType.categorical, val1, val2, val3, val4, val5)}

We can then get specific measure values for specific individuals:

from dae.variants.attributes import Role

list(pd.get_people_measure_values(["i1.iq"], roles=[Role.prb], family_ids=["f1", "f2", "f3"]))

>> [{'person_id': 'f1.p1',
     'family_id': 'f1',
     'role': 'prb',
     'status': 'affected',
     'sex': 'M',
     'i1.iq': 104.9118881225586},
    {'person_id': 'f2.p1',
     'family_id': 'f2',
     'role': 'prb',
     'status': 'affected',
     'sex': 'M',
     'i1.iq': 66.6941146850586},
    {'person_id': 'f3.p1',
     'family_id': 'f3',
     'role': 'prb',
     'status': 'affected',
     'sex': 'M',
     'i1.iq': 69.3330078125}]