GPF Getting Started Guide

Setup

Prerequisites

This guide assumes that you are working on a recent Linux (or Mac OSX) machine.

The GPF system is distributed as a Conda package.

If you do not have a working version of Anaconda or Miniconda or Mamba, you must install one. We recommended using a Miniforge distribution.

Go to the Miniforge home page and follow the instruction for you platform.

Warning

The GPF system is not supported on Windows.

GPF Installation

The GPF system is developed in Python and supports Python 3.11 and up. The recommended way to set up a conda GPF environment.

Create an empty Conda environment named gpf:

mamba create -n gpf

To use this environment, you need to activate it using the following command:

mamba activate gpf

Install the gpf_wdae conda package into the already activated gpf environment:

mamba install \
    -c conda-forge \
    -c bioconda \
    -c iossifovlab \
    -c defaults \
    gpf_wdae

This command is going to install GPF and all of its dependencies.

Getting the demonstration data

git clone https://github.com/iossifovlab/gpf-getting-started.git

Navigate to the newly-created directory:

cd gpf-getting-started

This repository provides a minimal instance and sample data to be imported.

Starting and stopping the GPF web interface

By default, the GPF system looks for a file gpf_instance.yaml in the current directory (and its parent directories). If GPF finds such a file, it uses it as a configuration for the GPF instance. Otherwise, GPF will look for a the DAE_DB_DIR environment variable. If it is not set, it throws an exception.

For this manual we recommend setting the DAE_DB_DIR environment variable.

From within the gpf-getting-started directory run the following command:

export DAE_DB_DIR=$(pwd)/minimal_instance

For this guide we use a gpf_instance.yaml file that is already provided in the minimal_instance subdirectory:

# The id of the instance.
instance_id: minimal_instance

# The reference genome to use for this instance.
reference_genome:
    resource_id: "hg38/genomes/GRCh38-hg38"

# The gene models to use for this instance.
gene_models:
    resource_id: "hg38/gene_models/MANE/1.3"

# The annotation pipeline configuration to use. Uncomment to enable.
# annotation:
#   config:
#     - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
#     - allele_score: hg38/scores/ClinVar_20240730

GPF instance configuration requires a reference genome and gene models to annotate variants with effects on genes. For this giude we use HG38 reference genome and MANE 1.3 gene models.

If not specified otherwise, GPF uses the GPF Genomic Resources Repository (GRR) located at https://grr.iossifovlab.com/ to find the resources it needs.

The reference genome used by this GPF instance is hg38/genomes/GRCh38-hg38 and the gene models are hg38/gene_models/MANE/1.3 from the default GRR.

Note

For more on GPF instance configuration see GPF Instance Configuration.

Now we can run the GPF development web server and browse our empty GPF instance:

wgpf run

and browse the GPF development server at http://localhost:8000.

The web interface will be mostly empty, as at this point there is no data imported into the instance.

To stop the development GPF web server, you should press Ctrl-C - the usual keybinding for stopping long-running Linux commands in a terminal.

Warning

The development web server run by wgpf run used in this guide is meant for development purposes only and is not suitable for serving the GPF system in production.

Importing genotype data

Import Tools and Import Project

Importing genotype data into a GPF instance involves multiple steps. The tool used to import genotype data is named import_genotypes. This tool expects an import project file that describes the import.

We support importing variants from multiple formats.

For this demonstration, we will be importing from the following formats:

  • List of de novo variants

  • Variant Call Format (VCF)

Example import of de novo variants

Let us import a small list of de novo variants.

A pedigree file that describes the families is needed - input_genotype_data/example.ped:

familyId  personId  dadId     momId     sex       status
f1        f1.dad    0         0         M         unaffected
f1        f1.mom    0         0         F         unaffected
f1        f1.p1     f1.dad    f1.mom    M         affected
f1        f1.s1     f1.dad    f1.mom    F         unaffected
f2        f2.mom    0         0         F         unaffected
f2        f2.dad    0         0         M         unaffected
f2        f2.p1     f2.dad    f2.mom    F         affected

We will also need the list of de novo variants input_genotype_data/example.tsv:

chrom     pos       ref       alt       person_id
chr14     21403214  T         C         f1.p1
chr14     21431459  G         C         f1.p1
chr14     21391016  A         AT        f2.p1
chr14     21403019  G         A         f2.p1
chr14     21393484  TCTTC     T         f2.p1
chr14     21409849  A         G         f1.p1

A project configuration file for importing this study - input_genotype_data/denovo_example.yaml - is also provided:

id: denovo_example

input:
  pedigree:
    file: example.ped

  denovo:
    files:
      - example.tsv

To import this project run the following command:

import_genotypes input_genotype_data/denovo_example.yaml

Note

For more information on the import project configuration file, see Import Tools.

For more information on workgin with pedigree files, see Working With Pedigree Files Guide.

When the import finishes you can run the GPF development server using:

wgpf run

and browse the content of the GPF development server at http://localhost:8000

The home page of the GPF system will show the imported study denovo_example.

../_images/denovo-example-home-page.png

If you follow the link to the study, and choose Dataset Statistics page, you will see some summary information for the imported study: families and individuals included in the study, types of families and rates of de novo variants.

../_images/denovo-example-dataset-statistics.png

Statistics of individuals by affected status and sex

../_images/denovo-example-families-by-pedigree.png

Statistics of families by family type

../_images/denovo-example-rate-de-novo-variants.png

Rate of de novo variants

If you select Genotype Browser page, you will be able to see the imported de novo variants. The default filters search for LGD de novo variants. It happens, that all de novo variants imported in the denovo_example study are LGD variants.

So, when you click Preview button all the imported variants will be shown.

../_images/denovo-example-genotype-browser.png

Genotype Browser with de novo variants

Example import of VCF variants

Similar to the sample denovo variants, there are also sample variants in VCF format. They can be found in input_genotype_data/example.vcf and the same pedigree file from before is used.

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr14>
#CHROM    POS       ID        REF       ALT       QUAL      FILTER    INFO      FORMAT    f1.mom    f1.dad    f1.p1     f1.s1     f2.mom    f2.dad    f2.p1
chr14     21385738  .         C         T         .         .         .         GT        0/0       0/1       0/1       0/0       0/0       0/1       0/0
chr14     21385954  .         A         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/1       0/0       0/1
chr14     21393173  .         T         C         .         .         .         GT        0/1       0/0       0/0       0/1       0/0       0/0       0/0
chr14     21393702  .         C         T         .         .         .         GT        0/0       0/0       0/0       0/0       0/0       0/1       0/1
chr14     21393860  .         G         A         .         .         .         GT        0/0       0/1       0/1       0/1       0/0       0/0       0/0
chr14     21403023  .         G         A         .         .         .         GT        0/0       0/1       0/0       0/1       0/1       0/0       0/0
chr14     21405222  .         T         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/0       0/1       0/0
chr14     21409888  .         T         C         .         .         .         GT        0/1       0/0       0/1       0/0       0/1       0/0       1/0
chr14     21429019  .         C         T         .         .         .         GT        0/0       0/1       0/1       0/0       0/0       0/1       0/1
chr14     21431306  .         G         A         .         .         .         GT        0/0       0/1       0/1       0/1       0/0       0/0       0/0
chr14     21431623  .         A         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/1       1/1       1/1
chr14     21393540  .         GGAA      G         .         .         .         GT        0/1       0/1       1/1       0/0       0/0       0/0       0/0
chr14     21431499  .         T         C         .         .         .         GT        0/0       0/0       0/0       0/0       0/0       0/1       0/1
chr14     21402010  .         G         A         .         .         .         GT        0/0       0/1       0/1       0/0       0/0       0/0       0/0

A project configuration file is provided - input_genotype_data/vcf_example.yaml.

id: vcf_example

input:
  pedigree:
    file: example.ped

  vcf:
    files:
      - example.vcf

To import them, run the following command:

import_genotypes input_genotype_data/vcf_example.yaml

When the import finishes you can run the GPF development server using:

wgpf run

and browse the content of the GPF development server at http://localhost:8000

The GRF instance Home Page now includes the imported study vcf_example.

../_images/vcf-example-home-page.png

If you follow the link to the vcf_example you will get to the Gene Browser page for the study. It happens that all imported VCF variants are located on CHD8 gene. Fill CHD8 in the Gene Symbol box and click Go button.

../_images/vcf-example-gene-browser.png

Gene Browser for CHD8 gene shows variants from vcf_example study

Example of a dataset (group of genotype studies)

The already imported studies denovo_example and vcf_example have genomic variants for the same group of individuals example.ped. We can create a dataset (group of genotype studies) that include both studies.

To this end create a directory datasets/example_dataset inside the GPF instance directory minimal_instance:

mkdir -p minimal_instance/datasets/example_dataset

and place the following configuration file example_dataset.yaml inside that directory:

id: example_dataset
name: Example Dataset

studies:
  - denovo_example
  - vcf_example

When ready with the configuration restart the wgpf command. The home page of the GPF instance will change and now will include the configured dataset example_dataset.

../_images/example-dataset-home-page.png

Home page of the GPF instance showing the example_dataset

Follow the link to the Example Dataset, choose the Gene Browser pageå and fill CHD8 in the Gene Symbol. Click Go button and now you will be able to see the variants from both studies.

../_images/example-dataset-gene-browser.png

Gene Browser for CHD8 gene shows variants from both studies - denovo_example and vcf_example

Getting Started with Annotation

The import of genotype data into a GPF instance always runs effect annotation. It is easy to extend the annotation of genotype data during the import.

To define the annotation used during the import into a GPF instance we have to add a configuration that defines the pipeline of annotators and resources to be used during the import.

In the public GPF Genomic Resources Repository (GRR) there is a collection of public genomic resources available for use with GPF system.

Let say that we want to annotate the genotype variants with GnomAD and ClinVar. We need to find the appropriate resources in the public GRR:

  • hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL - this is an allele_score resource and the annotator by default produces one additional attribute gnomad_v4_genome_ALL_af that is the allele frequency for the variant (check the hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL page for more information about the resource);

  • hg38/scores/ClinVar_20240730 - this is an allele_score resource and the annotator by default produces two additional attribute CLNSIG that is the aggregate germline classification for the variant and CLNDN that is preferred disease name (check the hg38/scores/ClinVar_20240730 page for more information about the resource).

In order to use these resources in the GPF instance annotation, we need to edit the GPF instance configuration (minimal_instance/gpf_instance.yaml) and add the following snipped the configuration file:

# The annotation pipeline configuration to use. Uncomment to enable.
annotation:
  config:
    - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
    - allele_score: hg38/scores/ClinVar_20240730

Re-running GPF will automatically re-annotate any genotype data that is not up to date:

wgpf run

The variants in our Example Dataset will now have additional attributes that come from the annotation with GnomAD and ClinVar:

  • gnomad_v4_genome_ALL_af;

  • CLNSIG;

  • CLNDN.

If we browse our Example Dataset there is almost no difference. The only difference is that now in the genotype browser, the genomic scores section is not empty and we can query our variants using the gnomad_v4_genome_ALL_af, CLNSIG and CLNDN genomic scores.

../_images/example-dataset-genotype-browser-gnomics-scores.png

Genotype browser using GnomAD and ClinVar genomic scores

Note

The attributes produced by the annotation can be used in the Genotype Browser preview table as described in Getting Started with Preview and Download Columns.

Getting Started with Phenotype Data

Importing phenotype data

To import phenotype data, the import_phenotypes tool is used.

The tool requires an import project, a YAML file describing the contents of the phenotype data to be imported, along with configuration options on how to import them.

As an example, we are going to show how to import simulated phenotype data into our GPF instance.

Inside the input_phenotype_data directory, the following data is provided:

  • pedigree.ped is the phenotype data pedigree file. input_phenotype_data/pedigree.ped:

    familyId  personId  dadId     momId     sex       status    role
    f1        f1.dad    0         0         1         1         dad
    f1        f1.mom    0         0         2         1         mom
    f1        f1.p1     f1.dad    f1.mom    1         2         prb
    f1        f1.s1     f1.dad    f1.mom    2         1         sib
    f2        f2.mom    0         0         2         1         mom
    f2        f2.dad    0         0         1         1         dad
    f2        f2.p1     f2.dad    f2.mom    2         2         prb
    f2        f2.s1     f2.dad    f2.mom    1         1         sib
    f3        f3.dad    0         0         1         1         dad
    f3        f3.mom    0         0         2         1         mom
    f3        f3.p1     f3.dad    f3.mom    1         2         prb
    f3        f3.s1     f3.dad    f3.mom    2         1         sib
    f4        f4.dad    0         0         1         1         dad
    f4        f4.mom    0         0         2         1         mom
    f4        f4.p1     f4.dad    f4.mom    2         2         prb
    f4        f4.s1     f4.dad    f4.mom    1         1         sib
    f5        f5.dad    0         0         1         1         dad
    f5        f5.mom    0         0         2         1         mom
    f5        f5.p1     f5.dad    f5.mom    1         2         prb
    f5        f5.s1     f5.dad    f5.mom    2         1         sib
    f6        f6.dad    0         0         1         1         dad
    f6        f6.mom    0         0         2         1         mom
    f6        f6.p1     f6.dad    f6.mom    2         2         prb
    f6        f6.s1     f6.dad    f6.mom    1         1         sib
    
  • instruments contains the phenotype instruments and measures to be imported. There are two instruments in the example:

    input_phenotype_data/instruments/basic_medical.csv:

    personId,age,weight,height,race
    f1.dad,50,200,180,white
    f1.mom,23,160,170,white
    f1.s1,2,20,79,white
    f1.p1,4,40,80,white
    f2.dad,32,230,170,white
    f2.mom,30,153,165,white
    f2.s1,12,80,130,white
    f2.p1,3,30,70,white
    f3.dad,45,175,165,white
    f3.mom,41,173,154,white
    f3.s1,23,170,180,white
    f3.p1,7,80,120,white
    f4.dad,25,190,185,white
    f4.mom,35,200,150,white
    f4.s1,17,160,165,white
    f4.p1,5,50,100,white
    f5.dad,31,250,176,asian
    f5.mom,39,180,154,asian
    f5.s1,11,130,150,asian
    f5.p1,5,55,100,asian
    f6.dad,30,200,173,affrican american
    f6.mom,27,140,178,affrican american
    f6.s1,1,15,30,affrican american
    f6.p1,8,80,130,affrican american
    

    input_phenotype_data/instruments/iq.csv:

    personId,verbal_iq,non_verbal_iq,diagnosis_notes
    f1.p1,60,45,"walked late, severe seizures"
    f1.s1,98,102,
    f2.p1,98,70,originally diagnosed as Asperger
    f2.s1,115,83,
    f3.p1,90,80,
    f3.s1,97,90,
    f4.p1,108,93,excels at school
    f4.s1,107,91,
    f5.p1,90,70,sleep abnormality
    f5.s1,105,115,
    f6.p1,85,92,
    f6.s1,95,101,
    
  • measure_descriptions.tsv contains descriptions for the provided measures.

    input_phenotype_data/measure_descriptions.tsv:

    instrumentName      measureName         description
    basic_medical       age                 The individual's age in years
    basic_medical       weight              The individual's weight in punds
    basic_medical       height              The individual's height in centimeters
    basic_medical       race                The individual's race
    iq                  verbal_iq           Verbal IQ
    iq                  non_verbal_iq       Non verbal IQ
    
  • import_project.yaml is the import project configuration that we will use to import this data.

    input_phenotype_data/import_project.yaml:

    id: mini_pheno
    
    instrument_files:
      - instruments/basic_medical.csv
      - instruments/iq.csv
    
    data_dictionary:
      files:
        - path: measure_descriptions.tsv
    
    pedigree: pedigree.ped
    
    person_column: personId
    
    work_dir: work
    
    study_config:
      regressions:
        reg_1:
          display_name: "Age regression"
          measure_names:
            - age
          instrument_name: basic_medical
          jitter: 0.1
    
    

Note

For more information on how to import phenotype data see Phenotype Database Tools

To import the phenotype data, we will use the import_phenotypes tool. It will import the phenotype database directly to our GPF instance’s phenotype storage:

import_phenotypes input_phenotype_data/import_project.yaml

When the import finishes you can run the GPF development server using:

wgpf run

Now on the GPF instance Home Page you should see the mini_pheno phenotype study.

../_images/mini-pheno-home-page.png

Home page with imported phenotype study

If you follow the link, you will see the Phenotype Browser tab with the imported data.

../_images/mini-pheno-phenotype-browser.png

Phenotype Browser tab with imported data

In the Phenotype Browser tab you can search for phenotype instruments and measures, see the aggregated figures for the measures, and download selected instruments and measures.

Configure a genotype study to use phenotype data

To demonstrate how a study is configured with a phenotype database, we will be working with the already imported example_dataset dataset.

The phenotype databases can be attached to one or more studies and/or datasets. If you want to attach the mini_pheno phenotype study to the example_dataset dataset, you need to specify it in the dataset’s configuration file, which can be found at minimal_instance/datasets/example_dataset/example_dataset.yaml.

Add the following line to the configuration file:

phenotype_data: mini_pheno

When you restart the server, you should be able to see Phenotype Browser and Phenotype Tool tabs enabled for the Example Dataset dataset.

Additionally, in the Genotype Browser the Family Filters and Person Filters sections will have the Pheno Measures filters enabled.

../_images/example-dataset-genotype-browser-pheno-filters-2.png

Example Dataset genotype browser using Pheno Measures family filters

Getting Started with Preview and Download Columns

Configure genotype columns in Genotype Browser

Once you have annotated your variants the additional attributes produced by the annotation can be used in the variants preview table and in the variants download file. For each study and dataset you can specify which columns are shown in the variants’ table preview, as well as those which will be downloaded.

In our example the annotation produces three additional attributes: gnomad_v4_genome_ALL_af, CLNSIG, and CLNDN. Let us add these attributes to the variants preview table and the variants download file for the example_dataset dataset.

Edit the example_dataset.yaml dataset configuration in minimal_instance/datasets/example_dataset and add the following section at the end of the configuration file:

 1genotype_browser:
 2  columns:
 3    genotype:
 4      gnomad_v4_genome_af:
 5        name: gnomAD v4 AF
 6        source: gnomad_v4_genome_ALL_af
 7        format: "%%.5f"
 8      clinvar_clnsig:
 9        name: CLNSIG
10        source: CLNSIG
11      clinvar_clndn:
12        name: CLNDN
13        source: CLNDN
14
15  column_groups:
16    gnomad_v4:
17      name: gnomAD v4
18      columns:
19      - gnomad_v4_genome_af
20
21    clinvar:
22      name: ClinVar
23      columns:
24      - clinvar_clnsig
25      - clinvar_clndn
26
27  preview_columns_ext:
28    - gnomad_v4
29    - clinvar
30
31  download_columns_ext:
32    - gnomad_v4_genome_af
33    - clinvar_clnsig
34    - clinvar_clndn

Lines 3-13 define the three new columns with values comming from the genotype data attributes:

  • gnomad_v4_genome_af - is a column that uses the value of the attribute gnomad_v4_genome_ALL_af and formats it as a float with 5 decimal places. The display name of the column will be gnomAD v4 AF;

  • clinvar_clnsig - is a column that uses the value of the attribute CLNSIG. The display name of the column will be CLNSIG;

  • clinvar_clndn - is a column that uses the value of the attribute CLNDN. The display name of the column will be CLNDN.

In the preview table each column could show multiple values. In GPF when you want to show multiple values in single column, you need to define a column group.

The column group is a collection of columns that are shown together in the preview table. The values in a column group are shown in a single cell. The column group is defined in the column_groups section of the configuration file.

In lines 16-19 we define a column group gnomad_v4 that contains the column gnomad_v4_genome_af.

In lines 21-25 we define a column group clinvar that contains the columns clinvar_clnsig and clinvar_clndn.

In lines 27-29 we extend the preview table columns. The new column groups gnomad_v4 and clinvar will be added to the preview table.

In lines 32-34 we extend the download file columns. The columns gnomad_v4_genome_af, clinvar_clnsig, clinvar_clndn will be added to the download file.

If we now stop the wgpf tool and run it again, we will be able to see the new columns in the preview table and in the download file.

From the GPF instance Home Page follow the link to the Example Dataset page and choose the Genotype Browser. Select all checkboxes in Present in Child, Present in Parent and Effect Types sections.

../_images/example-dataset-genotype-browser-extended-columns-filters.png

Then click the Preview button and will be able to see all the imported variants with their additional attributes comming from the annotation.

../_images/example-dataset-genotype-browser-extended-columns-variants.png

Example Dataset genotype browser displaying variants with additional columns gnomAD v4 and ClinVar.

Configure phenotype columns in Genotype Browser

The Genotype Browser allows you to add phenotype columns to the table preview and download file.

Phenotype columns show values from a phenotype database. To configure such a column you need to specify following attributes:

  • source - the measure ID which values we are going to show in the column;

  • role - the role of the person in the family for which we are going to

    show the phenotype measure value;

  • name - the display name of the column in the table.

Let’s add a phenotype columns to the Genotype Browser preview table. To do this, you need to define them in the study’s config, in the genotype browser section of the configuration file.

 1genotype_browser:
 2  columns:
 3    genotype:
 4      gnomad_v4_genome_af:
 5        name: gnomAD v4 AF
 6        source: gnomad_v4_genome_ALL_af
 7        format: "%%.5f"
 8      clinvar_clnsig:
 9        name: CLNSIG
10        source: CLNSIG
11      clinvar_clndn:
12        name: CLNDN
13        source: CLNDN
14
15    phenotype:
16      prb_verbal_iq:
17        role: prb
18        name: Verbal IQ
19        source: iq.verbal_iq
20
21      prb_non_verbal_iq:
22        role: prb
23        name: Non-Verbal IQ
24        source: iq.non_verbal_iq
25
26  column_groups:
27    gnomad_v4:
28      name: gnomAD v4
29      columns:
30      - gnomad_v4_genome_af
31
32    clinvar:
33      name: ClinVar
34      columns:
35      - clinvar_clnsig
36      - clinvar_clndn
37
38    proband_iq:
39      name: Proband IQ
40      columns:
41      - prb_verbal_iq
42      - prb_non_verbal_iq
43
44  preview_columns_ext:
45    - gnomad_v4
46    - clinvar
47    - proband_iq
48
49  download_columns_ext:
50    - gnomad_v4_genome_af
51    - clinvar_clnsig
52    - clinvar_clndn
53    - prb_verbal_iq
54    - prb_non_verbal_iq

Lines 15-24 define two new columns with values coming from the phenotype data attributes:

  • prb_verbal_iq - is a column that uses the value of the phenotype measure iq.verbal_iq for the family proband. The display name of the column will be Verbal IQ;

  • prb_non_verbal_iq - is a column that uses the value of the phenotype measure iq.non_verbal_iq for the family proband. The display name of the column will be Non-Verbal IQ.

In the preview table each column could show multiple values. In GPF when you want to show multiple values in single column, you need to define a column group.

The column group is a collection of columns that are shown together in the preview table. The values in a column group are shown in a single cell. The column group is defined in the column_groups section of the configuration file.

In lines 38-42 we define a column group called proband_iq that contains the columns prb_verbal_iq and prb_non_verbal_iq.

To add the new column group proband_iq to the preview table, we need to add it to the preview_columns_ext section of the configuration file. In line 47 we add the new column group proband_iq at the end of the preview table.

When you restart the server, go to the Genotype Browser tab of the Example Dataset dataset and select all checkboxes in Present in Child, Present in Parent and Effect Types sections:

../_images/example-dataset-proband-iq-column-group-filters.png

When you click on the Table Preview button, you will be able to see the new column group proband_iq in the preview table.

../_images/example-dataset-proband-iq-column-group-variants.png

Example Dataset genotype browser using pheno measures columns

Example Usage of GPF Python Interface

The simplest way to start using GPF’s Python API is to import the GPFInstance class and instantiate it:

from dae.gpf_instance.gpf_instance import GPFInstance
gpf_instance = GPFInstance.build()

This gpf_instance object groups together a number of objects, each dedicated to managing different parts of the underlying data. It can be used to interact with the system as a whole.

For example, to list all studies configured in the startup GPF instance, use:

gpf_instance.get_genotype_data_ids()

This will return a list with the ids of all configured studies:

['denovo_example',
 'vcf_example',
 'example_dataset']

To get a specific study and query it, you can use:

st = gpf_instance.get_genotype_data('example_dataset')
vs = list(st.query_variants())

Note

The query_variants method returns a Python iterator.

To get the basic information about variants found by the query_variants method, you can use:

for v in vs:
    for aa in v.alt_alleles:
        print(aa)

chr14:21391016 A->AT f2
chr14:21393484 TCTTC->T f2
chr14:21402010 G->A f1
chr14:21403019 G->A f2
chr14:21403214 T->C f1
chr14:21431459 G->C f1
chr14:21385738 C->T f1
chr14:21385738 C->T f2
chr14:21385954 A->C f2
chr14:21393173 T->C f1
chr14:21393702 C->T f2
chr14:21393860 G->A f1
chr14:21403023 G->A f1
chr14:21403023 G->A f2
chr14:21405222 T->C f2
chr14:21409888 T->C f1
chr14:21409888 T->C f2
chr14:21429019 C->T f1
chr14:21429019 C->T f2
chr14:21431306 G->A f1
chr14:21431623 A->C f2
chr14:21393540 GGAA->G f1

The query_variants interface allows you to specify what kind of variants you are interested in. For example, if you only need “synonymous” variants, you can use:

st = gpf_instance.get_genotype_data('example_dataset')
vs = st.query_variants(effect_types=['synonymous'])
vs = list(vs)
len(vs)

>> 4

Or, if you are interested in “synonymous” variants only in people with “prb” role, you can use:

vs = st.query_variants(effect_types=['synonymous'], roles='prb')
vs = list(vs)
len(vs)

>> 1

Example import of real de Novo variants

Source of the data

As an example let us import de novo variants from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).

We will focus on de novo variants from the SSC collection published in the aforementioned paper.

To import these variants into the GPF system, we need a pedigree file describing the families and a list of de novo variants.

From the supplementary data for the paper can download the following files:

Note

All the data files needed for this example are available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

Preprocess the Family Data

The list of children in Supplementary_Data_1.tsv.gz contains a lot of data that is not relevant for the import. We are going to use only the first five columns from that file that look as follows:

gunzip -c Supplementary_Data_1.tsv.gz | head | cut -f 1-5 | less -S -x 20


collection          familyId            personId            affected status     sex
SSC                 11000               11000.p1            affected            M
SSC                 11000               11000.s1            unaffected          F
SSC                 11003               11003.p1            affected            M
SSC                 11003               11003.s1            unaffected          F
SSC                 11004               11004.p1            affected            M
SSC                 11004               11004.s1            unaffected          M
SSC                 11006               11006.p1            affected            M
SSC                 11006               11006.s1            unaffected          M
SSC                 11008               11008.p1            affected            M
  • The first column contains the collection. This study contains data from SSC and AGRE collections. We are going to import only variants from the SSC collection.

  • The second column contains the family ID.

  • The third column contains the person’s ID.

  • The fourth column contains the affected status of the individual.

  • The fifth column contains the sex of the individual.

We need a pedigree file describing the family’s structure to import the data into GPF. The SupplementaryData1_Children.tsv.gz contains only the children; it does not include information about their parents. Fortunately for the SSC collection, it is not difficult to build the whole families’ structures from the information we have.

So, before starting the work on the import, we need to preprocess the list of children and transform it into a pedigree file.

For the SSC collection, if you have a family with ID`<fam_id>`, then the identifiers of the individuals in the family are going to be formed as follows:

  • mother - <fam_id>.mo;

  • father - <fam_id>.fa;

  • proband - <fam_id>.p1;

  • first sibling - <fam_id>.s1;

  • second sibling - <fam_id>.s2.

Another essential restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father, and siblings is unaffected.

Having this information, we can use the following Awk script to transform the list of children into a pedigree:

gunzip -c Supplementary_Data_1.tsv.gz | awk '
    BEGIN {
        OFS="\t"
        print "familyId", "personId", "dadId", "momId", "status", "sex"
    }
    $1 == "SSC" {
        fid = $2
        if( fid in families == 0) {
            families[fid] = 1
            print fid, fid".mo", "0", "0", "unaffected", "F"
            print fid, fid".fa", "0", "0", "unaffected", "M"
        }
        print fid, $3, fid".fa", fid".mo", $4, $5
    }' > ssc_denovo.ped

If we run this script, it will read Supplementary_Data_1.tsv.gz and produce the appropriate pedigree file ssc_denovo.ped.

Note

The resulting pedigree file is also available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

Here is a fragment from the resulting pedigree file:

familyId       personId       dadId          momId          status         sex
11000          11000.mo       0              0              unaffected     F
11000          11000.fa       0              0              unaffected     M
11000          11000.p1       11000.fa       11000.mo       affected       M
11000          11000.s1       11000.fa       11000.mo       unaffected     F
11003          11003.mo       0              0              unaffected     F
11003          11003.fa       0              0              unaffected     M
11003          11003.p1       11003.fa       11003.mo       affected       M
11003          11003.s1       11003.fa       11003.mo       unaffected     F
11004          11004.mo       0              0              unaffected     F
11004          11004.fa       0              0              unaffected     M

Preprocess the SNP and INDEL de Novo variants

The Supplementary_Data_2.tsv.gz file contains 255232 variants. For the import, we will use columns four and nine from this file:

gunzip -c Supplementary_Data_2.tsv.gz | head | cut -f 4,9 | less -S -x 20

personIds           variant in VCF format
13210.p1            chr1:184268:G:A
12782.s1            chr1:191408:G:A
12972.s1            chr1:271774:AG:A
12420.p1            chr1:484721:AG:A
12518.p1,12518.s1   chr1:691130:T:C
13882.p1            chr1:738645:C:G
14039.s1            chr1:819832:G:T
13872.p1            chr1:824001:AAAAT:A

Using the following Awk script, we can transform this file into easy to import list of de Novo variants:

gunzip -c Supplementary_Data_2.tsv.gz | cut -f 4,9 | awk '
    BEGIN{
        OFS="\t"
        print "chrom", "pos", "ref", "alt", "person_id"
    }
    NR > 1 {
        split($2, v, ":")
        print v[1], v[2], v[3], v[4], $1
    }' > ssc_denovo.tsv

This script will produce a file named ssc_denovo.tsv with the following content:

chrom          pos            ref            alt            person_id
chr1           184268         G              A              13210.p1
chr1           191408         G              A              12782.s1
chr1           271774         AG             A              12972.s1
chr1           484721         AG             A              12420.p1
chr1           691130         T              C              12518.p1,12518.s1
chr1           738645         C              G              13882.p1
chr1           819832         G              T              14039.s1
chr1           824001         AAAAT          A              13872.p1
chr1           826779         T              C              12132.s1
chr1           834505         G              A              13801.p1

Note

The resulting ssc_denovo.tsv file is also available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import/input_data.

Caching GRR

Now we are about to import 255K variants. During the import, the GPF system will annotate these variants using the GRR resources from our public GRR.

For small studies with few variants this approach is quite convienient. However, for larger studies, it is better to cache the GRR resources locally.

To do this, we need to configure the GRR to use a local cache. Create a file named .grr_definition.yaml in your home directory with the following content:

id: "seqpipe"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "<path_to_your_cache_dir>"

The cache_dir parameter specifies the directory where the GRR resources will be cached. The cache directory should be specified as an absolute path. For example, /tmp/grr_cache or /Users/lubo/grrCache.

To download all the resources needed for out minimal_instance, run the following command from gpf-getting-started directory:

grr_cache_repo -i minimal_instance/gpf_instance.yaml

Note

The grr_cache_repo command will download all the resources needed for the GPF instance. This may take a while, depending on your internet connection and the number of resources you configuration require.

The resources will be downloaded to the directory specified in the cache_dir parameter in the .grr_definition.yaml file.

For the gpf-getting-started repository, the resources that will be downloaded are:

  • hg38/genomes/GRCh38-hg38

  • hg38/gene_models/MANE/1.3

  • hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL

  • hg38/scores/ClinVar_20240730

The total size of the downloaded resources is about 15 GB.

Data Import of ssc_denovo

Now we have a pedigree file, ssc_denovo.ped, and a list of de novo variants, ssc_denovo.tsv. Let us prepare an import project configuration file, ssc_denovo.yaml:

 1id: ssc_denovo
 2
 3input:
 4  pedigree:
 5    file: ssc_denovo.ped
 6
 7  denovo:
 8    files:
 9    - ssc_denovo.tsv
10
11processing_config:
12  denovo: chromosome

When importing genotype data, we often need to instruct the import tool how to split the import process into multiple jobs. For this purpose, we can use processing_config section of the import project. On lines 11-12 of the ssc_denovo.yaml file, we have defined the processing_config section that will split the import de Novo variants into jobs by chromosome. (For more on import project configuration see Import Tools.)

Note

The project file ssc_denovo.yaml is available in the the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

To import the study, from the gpf-getting-started directory we should run:

time import_genotypes -v -j 10 example_imports/denovo_and_cnv_import/ssc_denovo.yaml

real    5m29.950s
user    31m52.320s
sys     1m41.755s

The -j 10 option instructs the import_genotypes tool to use 10 threads and the -v option controls the verbosity of the output.

When the import finishes, we can run the development GPF server:

wgpf run

In the Home page of the GPF instance we should have the new study ssc_denovo.

../_images/ssc_denovo_home_page.png

Home page with the imported SSC de novo variants.

If you follow the link to the study, and choose the Genotype Browser tab, you will be able to query the imported variants.

../_images/ssc_denovo_genotype_browser.png

Genotype browser for the SSC de novo variants.

Configure preview and download columns

While importing the SSC de novo variants, we were using the annotation defined in the minimal instance configuration file. So, all imported variants are annotated with GnomAD and ClinVar genomic scores.

We can use these scores to define additional columns in the preview table and the download file similar to Getting Started with Preview and Download Columns.

Edit the ssc_denovo configuration file located at minimal_instance/datasets/ssc_denovo/ssc_denovo.yaml and add the following snipped to the configuration file:

 1genotype_browser:
 2  columns:
 3    genotype:
 4      gnomad_v4_genome_af:
 5        name: gnomAD v4 AF
 6        source: gnomad_v4_genome_ALL_af
 7        format: "%%.5f"
 8      clinvar_clnsig:
 9        name: CLNSIG
10        source: CLNSIG
11      clinvar_clndn:
12        name: CLNDN
13        source: CLNDN
14
15  column_groups:
16    gnomad_v4:
17      name: gnomAD v4
18      columns:
19      - gnomad_v4_genome_af
20
21    clinvar:
22      name: ClinVar
23      columns:
24      - clinvar_clnsig
25      - clinvar_clndn
26
27  preview_columns_ext:
28    - gnomad_v4
29    - clinvar
30
31  download_columns_ext:
32    - gnomad_v4_genome_af
33    - clinvar_clnsig
34    - clinvar_clndn

Now, restart the GPF development server:

wgpf run

Go to the Genotype Browser tab of the ssc_denovo study and click Preview Table button. The preview table should now contain the additional columns for GnomAD and ClinVar genomic scores.

../_images/ssc_denovo_genotype_browser_with_annotated_columns.png

Genotype browser with additional columns for GnomAD and ClinVar genomic scores.

Example import of real CNV variants

Source of the data

As an example for import of CNV variants we will use data from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).

We already discussed the import of de Novo variants from this paper in Example import of real de Novo variants.

Now we will focus on the import of CNV variants from the same paper.

To import these variants into the GPF system, we need a pedigree file describing the families and a list of CNV variants.

From the supplementary data for the paper can download the following files:

Note

All the data files needed for this example are available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

We already discussed how to transform the list of children into a pedigree file in the Preprocess the Family Data section.

Now we need to prepare the CNV variants file.

Preprocess the CNV variants

The Supplementary_Data_4.tsv.gz file contains 376 CNV variants from SSC and AGRE collections.

For the import we will use the colums two, five, six and seven:

gunzip -c Supplementary_Data_4.tsv.gz | cut -f 2,5-7 | less -S -x 25

collection               personIds                location                 variant
SSC                      12613.p1                 chr1:1305145-1314126     duplication
AGRE                     AU2725301                chr1:3069177-4783791     duplication
SSC                      13424.s1                 chr1:3975501-3977800     deletion
SSC                      12852.p1                 chr1:6647401-6650500     deletion
SSC                      13776.p1                 chr1:8652301-8657600     deletion
SSC                      13373.s1                 chr1:9992001-9994100     deletion
SSC                      14198.p1                 chr1:12224601-12227300   deletion
SSC                      13259.p1                 chr1:15687701-15696200   deletion
SSC                      14696.s1                 chr1:30388501-30398807   deletion

Using the following Awk script we will filter only variants from SSC collection:

gunzip -c Supplementary_Data_4.tsv.gz | cut -f 2,5-7 | awk '
    BEGIN{
        OFS="\t"
        print "location", "variant", "person_id"
    }
    $1 == "SSC" {
        print $3, $4, $2
    }' > ssc_cnv.tsv

This script will produce a file named ssc_cnv.tsv with the following content:

location                      variant                       person_id
chr1:1305145-1314126          duplication                   12613.p1
chr1:3975501-3977800          deletion                      13424.s1
chr1:6647401-6650500          deletion                      12852.p1
chr1:8652301-8657600          deletion                      13776.p1
chr1:9992001-9994100          deletion                      13373.s1
chr1:12224601-12227300        deletion                      14198.p1
chr1:15687701-15696200        deletion                      13259.p1
chr1:30388501-30398807        deletion                      14696.s1
chr1:40513501-40534200        deletion                      14534.p1
chr1:40513501-40534200        deletion                      14534.s1

Note

The resulting ssc_cnv.tsv file is available in the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import/input_data.

Data Import of ssc_cnv

Now we have a pedigree file, ssc_denovo.ped, and a list of de novo variants, ssc_cnv.tsv. Let us prepare an import project configuration file, ssc_cnv.yaml:

 1id: ssc_cnv
 2
 3input:
 4  pedigree:
 5    file: ssc_denovo.ped
 6
 7  cnv:
 8    files:
 9    - ssc_cnv.tsv
10
11    location: location
12    variant_type: variant
13    plus_values: duplication
14    minus_values: deletion
15    person_id: person_id

Lines 12-14 define how CNV variant is defined in the input file. The variant specifies the type of the variant and values deletion and duplication are used to define the CNV variant type.

Note

The project file ssc_cnv.yaml is available in the the gpf-getting-started repository under the subdirectory example_imports/denovo_and_cnv_import.

To import the study, from the gpf-getting-started directory we should run:

time import_genotypes -v -j 1 example_imports/denovo_and_cnv_import/ssc_cnv.yaml

When the import finishes, we can run the development GPF server:

wgpf run

In the Home page of the GPF instance we should have the new study ssc_cnv.

../_images/ssc_cnv_home_page.png

Home page with the imported ssc_cnv study.

If you follow the link to the study, and choose the Genotype Browser tab, you will be able to query the imported CNV variants.

../_images/ssc_cnv_genotype_browser.png

Genotype browser for the SSC CNV variants.