GPF Getting Started Guide

Prerequisites

This guide assumes that you are working on a recent Linux box.

Working version of anaconda or miniconda

The GPF system is distributed as an Anaconda package using the conda package manager.

If you do not have a working version of Anaconda or Miniconda, you must install one. We recommended using a Miniconda version.

Go to the Miniconda distribution page, download the Linux installer

wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

and install it in your local environment:

sh Miniconda3-latest-Linux-x86_64.sh

Note

At the end of the installation process, you will be asked if you wish to allow the installer to initialize Miniconda3 by running conda init. If you choose to, every terminal you open after that will have the base Anaconda environment activated, and you’ll have access to the conda commands used below.

Once Anaconda/Miniconda is installed, we would recommend installing mamba instead of conda. Mamba will speed up the installation of packages:

conda install -c conda-forge mamba

GPF Installation

The GPF system is developed in Python and supports Python 3.9 and up. The recommended way to set up the GPF development environment is to use Anaconda.

Install GPF

Create an empty Anaconda environment named gpf:

conda create -n gpf

To use this environment, you need to activate it using the following command:

conda activate gpf

Install the gpf_wdae conda package into the already activated gpf environment:

mamba install \
    -c conda-forge \
    -c bioconda \
    -c iossifovlab \
    -c defaults \
    gpf_wdae

This command is going to install GPF and all of its dependencies.

Clone the example “getting-started” repository

git clone https://github.com/iossifovlab/getting-started.git

This repository provides a minimal instance and sample data to be imported.

The reference genome used by this GPF instance is hg38/genomes/GRCh38-hg38 from the default GRR. The gene models used by this GPF instance are hg38/gene_models/refSeq_v20200330 from the default GRR. If not specified otherwise, GPF uses the default genomic resources repository located at https://www.iossifovlab.com/distribution/public/genomic-resources-repository/. Resources are used without caching.

Run the GPF development web server

By default, the GPF system looks for a file gpf_instance.yaml in the current directory (and its parent directories). If GPF finds such a file, it uses it as a configuration for the GPF instance. Otherwise, it throws an exception.

Additionally, GPF will also consider the DAE_DB_DIR environment variable. Sourcing the provided setenv.sh file will set this variable for you.

source setenv.sh

Now we can run the GPF development web server and browse our empty GPF instance:

wgpf run

and browse the GPF development server at http://localhost:8000.

To stop the development GPF web server, you should press Ctrl-C - the usual keybinding for stopping long-running Linux commands in a terminal.

Warning

The development web server run by wgpf run used in this guide is meant for development purposes only and is not suitable for serving the GPF system in production.

Import genotype variants

Data Storage

The GPF system uses genotype storages for storing genomic variants.

We are going to use in-memory genotype storage for this guide. It is easiest to set up and use, but it is unsuitable for large studies.

By default, each GPF instance has internal in-memory genotype storage.

Import Tools and Import Project

Importing genotype data into a GPF instance involves multiple steps. The tool used to import genotype data is named import_tools. This tool expects an import project file that describes the import.

This tool supports importing variants from three formats:

  • List of de novo variants

  • List of de novo CNV variants

  • Variant Call Format (VCF)

Example import of de novo variants: helloworld

Let us import a small list of de novo variants. We will need the list of de novo variants raw_genotype_data/helloworld.tsv, and a pedigree file that describes the families - raw_genotype_data/helloworld.ped:

A project configuration file for importing this study (raw_genotype_data/import_denovo_project.yaml) is also provided.

To import this project run the following command:

import_genotypes raw_genotype_data/import_denovo_project.yaml

When the import finishes you can run the GPF development server using:

wgpf run

and browse the content of the GPF development server at http://localhost:8000

Example import of VCF variants: vcf_helloworld

Similar to the sample denovo variants, there are also sample variants in VCF format. They can be found in raw_genotype_data/helloworld.vcf and the same pedigree file from before is used.

To import them, run the following command:

import_genotypes raw_genotype_data/vcf_helloworld.yaml

When the import finishes you can run the GPF development server using:

wgpf run

and browse the content of the GPF development server at http://localhost:8000

Example of a dataset (group of genotype studies)

The already imported studies denovo_helloworld and vcf_helloworld have genomic variants for the same group of individuals helloworld.ped. We can create a dataset (group of genotype studies) that include both studies.

To this end create a directory datasets/helloworld inside the GPF instance directory minimal_instance:

cd minimal_instance
mkdir -p datasets/helloworld

and place the following configuration file helloworld.yaml inside that directory:

id: helloworld
name: Hello World Dataset

studies:
  - denovo_helloworld
  - vcf_helloworld

Example import of de novo variants from Rates of contributory de novo mutation in high and low-risk autism families

Let us import de novo variants from the Yoon, S., Munoz, A., Yamrom, B. et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021)..

We will focus on de novo variants from the SSC collection published in the aforementioned paper. To import these variants into the GPF system we need a list of de novo variants and a pedigree file describing the families. The list of de novo variants is available from Supplementary Data 2. The pedigree file for this study is not available. Instead, we have a list of children available from Supplementary Data 1.

Let us first export these Excel spreadsheets into CSV files. Let us say that the list of de novo variants from the SSC collection is saved into a file named SupplementaryData2_SSC.tsv and the list of children is saved into a TSV file named SupplementaryData1_Children.tsv.

Note

Input files for this example can be downloaded from denovo-in-high-and-low-risk-papter.tar.gz.

Preprocess the families data

To import the data into GPF we need a pedigree file describing the structure of the families. The SupplementaryData1_Children.tsv contains only the list of children. There is no information about their parents. Fortunately for the SSC collection it is not difficult to build the full families’ structures from the information we have. For the SSC collection if you have a family with ID <fam_id>, then the identifiers of the individuals in the family are going to be formed as follows:

  • mother - <fam_id>.mo;

  • father - <fam_id>.fa;

  • proband - <fam_id>.p1;

  • first sibling - <fam_id>.s1;

  • second sibling - <fam_id>.s2.

Another important restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father and siblings are unaffected.

Using all these conventions we can write a simple python script build_ssc_pedigree.py to convert SupplementaryData1_Children.tsv into a pedigree file ssc_denovo.ped:

"""Converts SupplementaryData1_Children.tsv into a pedigree file."""
import pandas as pd

children = pd.read_csv("SupplementaryData1_Children.tsv", sep="\t")
ssc = children[children.collection == "SSC"]

# list of all individuals in SSC
persons = []
# each person is represented by a tuple:
# (familyId, personId, dadId, momId, status, sex)

for fam_id, members in ssc.groupby("familyId"):
    persons.append((fam_id, f"{fam_id}.mo", "0", "0", "unaffected", "F"))
    persons.append((fam_id, f"{fam_id}.fa", "0", "0", "unaffected", "F"))
    for child in members.to_dict(orient="records"):
        persons.append((
            fam_id, child["personId"], f"{fam_id}.fa", f"{fam_id}.mo",
            child["affected status"], child["sex"]))

with open("ssc_denovo.ped", "wt", encoding="utf8") as output:
    output.write(
        "\t".join(("familyId", "personId", "dadId", "momId", "status", "sex")))
    output.write("\n")

    for person in persons:
        output.write("\t".join(person))
        output.write("\n")

If we run this script it will read SupplementaryData1_Children.tsv and produce the appropriate pedigree file ssc_denovo.ped.

Preprocess the variants data

The SupplementaryData2_SSC.tsv file contains 255231 variants. To import so many variants in in-memory genotype storage is not appropriate. For this example we are going to use a subset of 10000 variants:

head -n 10001 SupplementaryData2_SSC.tsv > ssc_denovo.tsv

Data import of ssc_denovo

Now we have a pedigree file ssc_denovo.ped and a list of de novo variants ssc_denovo.tsv. Let us prepare an import project configuration file ssc_denovo.yaml:

id: ssc_denovo

input:
  pedigree:
    file: ssc_denovo.ped

  denovo:
    files:
      - ssc_denovo.tsv
    person_id: personIds
    variant: variant
    location: location

To import the study we should run:

import_tools ssc_denovo.yaml

and when the import finishes we can run the development GPF server:

wgpf run

In the list of studies, we should have a new study ssc_denovo.

Getting started with Dataset Statistics

To generate family and de novo variant reports, you can use the generate_common_report.py tool. It supports the option --show-studies to list all studies and datasets configured in the GPF instance:

generate_common_report.py --show-studies

To generate the reports for a given study or dataset, you can use the --studies option.

By default the dataset statistics are disabled. If we try to run

generate_common_report.py --studies helloworld

it will not generate the dataset statistics. Instead, it will print a message that the reports are disabled to study helloworld:

WARNING:generate_common_reports:skipping study helloworld

To enable the dataset statistics for the helloworld dataset we need to modify the configuration and add a new section that enables dataset statistics:

id: helloworld
name: Hello World Dataset

studies:
  - denovo_helloworld
  - vcf_helloworld

common_report:
  enabled: True

Let us now re-run the generate_common_report.py command:

generate_common_report.py --studies helloworld

If we now start the GPF development server:

wgpf run

and browse the helloworld dataset we will see the Dataset Statistics section available.

Getting started with de novo gene sets

To generate de novo gene sets, you can use the generate_denovo_gene_sets.py tool. Similar to the reports_tool above, you can use the --show-studies and --studies option.

By default the de novo gene sets are disabled. If you want to enable them for a specific study or dataset you need to update the configuration and add a section that enable the de novo gene sets:

denovo_gene_sets:
  enabled: true

For example the configuration of helloworld dataset should become similar to:

id: helloworld
name: Hello World Dataset

studies:
  - denovo_helloworld
  - vcf_helloworld

common_report:
  enabled: True

denovo_gene_sets:
  enabled: true

Then we can generate the de novo gene sets for helloworld dataset by running:

generate_denovo_gene_sets.py --studies helloworld

Getting Started with Annotation

The import of genotype data into a GPF instance always runs effect annotation. It is easy to extend the annotation of genotype data during the import.

To define the annotation used during the import into a GPF instance we have to add a configuration file that defines the pipeline of annotators. After that, we need to configure the GPF instance to use this annotation pipeline.

There is a public Genomic Resources Repository (GRR) with a collection of public genomic resources available for use with GPF system.

Example: Annotation with GnomAD 3.0

To annotate the genotype variants with GnomAD allele frequencies we should find the GnomAD genomic resource in our public GRR. We will choose to use hg38/variant_frequencies/gnomAD_v3/genomes resource. If we navigate to the resource page we will see that this is an allele_score resource. So to use it in the annotation we should use the allele_score annotator.

The minimal configuration of annotation with this GnomAD resource is the following:

- allele_score: hg38/variant_frequencies/gnomAD_v3/genomes

Store this annotation configuration in a file named annotation.yaml and configure the GPF instance to use this annotation configuration:

reference_genome:
  resource_id: "hg38/genomes/GRCh38-hg38"

gene_models:
  resource_id: "hg38/gene_models/refSeq_v20200330"

annotation:
  conf_file: annotation.yaml

Now we can re-run the import for our helloworld examples:

  • Go to the denovo-helloworld project directory and re-run the import:

    import_tools -f denovo_helloworld.yaml
    
  • Go to the vcf-helloworld project directory and re-run the import:

    import_tools -f vcf_helloworld.yaml
    

Once the re-import finishes, the variants in our Hello World Dataset have additional attributes that come from the annotation with GnomAD v3.0. By default annotation adds the following three attributes:

  • genome_gnomad_v3_af_percent - allele frequencies as a percent;

  • genome_gnomad_v3_ac - allele count;

  • genome_gnomad_v3_an - number of sequenced alleles.

If we run the GPF development server and browse our Hello World Dataset there are almost no difference. The only difference is that now in the genotype browse the genomic scores section is not empty and we can query our variants using the genome_gnomad_v3_af_percent genomic score.

../_images/helloworld-gnomad-annotation-with-genomic-scores-filter.png

To make the new annotation attributes available in the variants preview table and in the variants download we need to change the study configuration. Check the Getting Started with Preview and Download Columns section for additional information.

Getting Started with Preview and Download Columns

When importing data into a GPF instance we can run an annotation pipeline that adds additional attributes to each variant. To make these attributes available in the variants preview table and in the variants download file we need to change the configuration of the corresponding study or dataset.

For each study dataset, you can specify which columns are shown in the variants’ table preview, as well as those which will be downloaded.

Example: Redefine the Frequency column in the preview table of Hello World Dataset`

As an example, we are going to redefine the Frequency column for helloworld dataset to include attributes from annotation with GnomAD v3 genomic score.

Navigate to the helloworld dataset folder:

cd datasets/helloworld

and edit the helloworld.yaml file. Add the following section to the end:

genotype_browser:
  columns:
    genotype:
      genome_gnomad_v3_af_percent:
        name: gnomAD v3 AF
        source: genome_gnomad_v3_af_percent
        format: "%%.3f"
      genome_gnomad_v3_ac:
        name: gnomAD v3 AC
        source: genome_gnomad_v3_ac
        format: "%%d"
      genome_gnomad_v3_an:
        name: gnomAD v3 AN
        source: genome_gnomad_v3_an
        format: "%%d"
  column_groups:
    freq:
      name: "Frequency"
      columns:
        - genome_gnomad_v3_af_percent
        - genome_gnomad_v3_ac
        - genome_gnomad_v3_an

This overwrites the definition of the default preview column Frequency to include the gnomAD v3 frequencies. If we now browse the Hello World Dataset and run variants preview in the genotype browser we will start seeing the GnomAD attributes:

../_images/helloworld-gnomad-frequency-preview-columns.png

Example: Add GnomAD v3 columns to the variants download

As an example let us add GnomAD v3 columns to the variants downloads.

By default, each genotype study or dataset has a list of predefined columns used when downloading variants. The users can replace the default list of download columns by defining the download_columns list or they can extend the predefined list of download columns by defining the download_columns_ext list of columns.

In the example below we are going to use download_columns_ext to add GnomAD v3 columns to the properties of downloaded variants:

genotype_browser:
  columns:
    genotype:
      genome_gnomad_v3_af_percent:
        name: gnomAD v3 AF
        source: genome_gnomad_v3_af_percent
        format: "%%.3f"
      genome_gnomad_v3_ac:
        name: gnomAD v3 AC
        source: genome_gnomad_v3_ac
        format: "%%d"
      genome_gnomad_v3_an:
        name: gnomAD v3 AN
        source: genome_gnomad_v3_an
        format: "%%d"
  column_groups:
    freq:
      name: "Frequency"
      columns:
        - genome_gnomad_v3_af_percent
        - genome_gnomad_v3_ac
        - genome_gnomad_v3_an

  download_columns_ext:
    - genome_gnomad_v3_af_percent
    - genome_gnomad_v3_ac
    - genome_gnomad_v3_an

Getting Started with Gene Browser

The Gene Browser in the GPF system uses the allele frequency as a Y-coordinate when displaying the allele. By default, the allele frequency used is the frequency of the alleles in the imported data.

../_images/helloworld-gene-browser-study-frequency.png

After annotation of the helloworld data with GnomAD v3 we can use the GnomAD allele frequency in the Gene Browser.

Example: configure the gene browser to use gnomAD frequency as the variant frequency

To configure the Hello World Dataset to use GnomAD v3 allele frequency we need to add a new section gene_browser in the configuration file of the datasets datasets/helloworld/helloworld.yaml as follows:

id: helloworld
name: Hello World Dataset

...

gene_browser:
  frequency_column: genome_gnomad_v3_af_percent

If we restart the GPF development server and navigate to Hello World Dataset Gene Browser, the Y-axes will use the GnomAD allele frequency instead of the study allele frequency.

../_images/helloworld-gene-browser-gnomad-frequency.png

Getting Started with Enrichment Tool

For studies that include de Novo variants, you can enable the enrichment tool UI. As an example, let us enable it for the already imported iossifov_2014 study.

Go to the directory where the configuration file of the iossifov_2014 study is located:

cd gpf_test/studies/iossifov_2014

Edit the study configuration file iossifov_2014.conf and add the following section in the end of the file:

[enrichment]
enabled = true

Restart the GPF web server:

wdaemanage.py runserver 0.0.0.0:8000

Now when you navigate to the iossifov_2014 study in the browser, the Enrichment Tool tab will be available.

Getting Started with Phenotype Data

Setting up the GPF instance phenotype database

The GPF instance has four configuration settings that determine how phenotype data is read and stored:

  • The most important is the phenotype data directory, which is where the phenotype data configurations are. If not specified, will attempt to look for the environment variable DAE_PHENODB_DIR, and if not found will default to the directory pheno inside the GPF instance directory.

  • Phenotype storages can be configured to tell the GPF instance where to look for phenotype data DB files. If no phenotype storages are defined, a default phenotype storage is used, which uses the phenotype data directory

  • The cache option can be configured to tell the GPF instance and GPF tools where to store generated phenotype browser data. Data will be stored inside the <cache_dir>/pheno directory.

  • The phenotype images option can be configured to tell the GPF instance and GPF tools where to store generated phenotype browser images.

You can examine the provided gpf_instance.yaml to see how these settings are configured in it.

Importing phenotype data

To import phenotype data, the import_phenotypes tool is used.

The tool requires an import project, a YAML file describing the contents of the phenotype data to be imported, along with configuration options on how to import them.

As an example, we are going to show how to import simulated phenotype data into our GPF instance.

Inside the raw_phenotype_data directory, the following data is provided:

  • instruments contains the phenotype instruments and measures to be imported.

  • pedigree.ped is the corresponding pedigree file.

  • measure_descriptions.tsv contains descriptions for the provided measures.

  • import_project.yaml is the import project configuration that we will use to import this data.

To import the phenotype data, we will use the import_phenotypes.py tool. It will import the phenotype database directly to our GPF instance’s phenotype storage:

import_phenotypes raw_phenotype_data/import_project.yaml

When the import finishes you can run the GPF development server using:

wgpf run

This will generate a phenotype browser database automatically, and the phenotype study should be directly accessible.

Phenotype browser databases are necessary to view the data through the web application. They are further described in the phenotype data documentation.

Configuring a phenotype database

Phenotype databases have a short configuration file which points the system to their files, as well as specifying additional properties. When importing a phenotype database through the import_phenotypes tool, a configuration file is automatically generated. You may inspect the minimal_instance/pheno/mini_pheno/mini_pheno.yaml configuration file generated from the import tool:

browser_images_url: static/images/
id: mini_pheno
name: mini_pheno
phenotype_storage:
db: mini_pheno/mini_pheno.db
id: storage1
regressions:
reg_1:
    display_name: Regression one
    instrument_name: instrument_1
    jitter: 0.1
    measure_names:
    - measure_1
type: study

Configure Genotype Study With Phenotype Data

To demonstrate how a study is configured with a phenotype database, we will be working with the already imported helloworld dataset.

The phenotype databases can be attached to one or more studies and/or datasets. If you want to attach the mini_pheno phenotype study to the helloworld dataset, you need to specify it in the dataset’s configuration file, which can be found at minimal_instance/datasets/helloworld/helloworld.yaml.

Add the following line at the beginning of the file, outside of any section:

phenotype_data: mini_pheno

To enable the phenotype_browser_ui, add this line:

phenotype_browser: true

After this, the beginning of the configuration file should look like this:

id: helloworld
name: Hello World Dataset

phenotype_data: mini_pheno
phenotype_browser: true

studies:
- denovo_helloworld
- vcf_helloworld

When you restart the server, you should be able to see the ‘Phenotype Browser’ tab in the helloworld dataset.

Configure Family Filters in Genotype Browser

A study or a dataset can have phenotype filters configured for its genotype_browser_ui when it has a phenotype database attached to it. The configuration looks like this:

genotype_browser:
  enabled: true
  family_filters:
    sample_continuous_filter:
    name: Sample Filter Name
    source_type: continuous
    from: phenodb
    filter_type: multi
    role: prb

After adding the family filters configuration, restart the web server and navigate to the Genotype Browser. You should be able to see the Advanced option under the Family Filters - this is where the family filters can be applied.

Configure Phenotype Columns in Genotype Browser

Phenotype columns contain values from a phenotype database. These values are selected from the individual who has the variant displayed in the genotype_browser_ui’s table preview. They can be added when a phenotype database is attached to a study.

Let’s add a phenotype column. To do this, you need to define it in the study’s config, in the genotype browser section:

genotype_browser:
  # ...
  columns:
    phenotype:
      sample_pheno_measure:
        role: prb
        source: instrument_1.measure_1
        name: Sample Pheno Measure Column

For the phenotype columns to be in the Genotype Browser table preview or download file, they have to be present in the preview_columns or the download_columns in the Genotype Browser configuration. Add this in the genotype_browser section:

preview_columns:
- family
- variant
- genotype
- effect
- gene_scores
- sample_pheno_measure

Enabling the Phenotype Tool

To enable the phenotype_tool_ui for a study, you must edit the study’s configuration file and set the appropriate property, as with the phenotype_browser_ui. Open the helloworld dataset conifguration file and add the following line:

phenotype_tool: true

After editing, it should look like this:

id: helloworld
name: Hello World Dataset

phenotype_data: mini_pheno
phenotype_browser: true
phenotype_tool: true

studies:
- denovo_helloworld
- vcf_helloworld

Restart the GPF web server and select the helloworld dataset. You should see the phenotype_tool_ui tab. Once you have selected it, you can select a phenotype measure of your choice. To get the tool to acknowledge the variants in the helloworld dataset, select the “All” option of the “Present in Parent” field.

Click on the “Report” button to produce the results.

Example Usage of GPF Python Interface

The simplest way to start using GPF’s Python API is to import the GPFInstance class and instantiate it:

from dae.gpf_instance.gpf_instance import GPFInstance
gpf_instance = GPFInstance.build()

This gpf_instance object groups together a number of objects, each dedicated to managing different parts of the underlying data. It can be used to interact with the system as a whole.

For example, to list all studies configured in the startup GPF instance, use:

gpf_instance.get_genotype_data_ids()

This will return a list with the ids of all configured studies:

['denovo_helloworld',
 'vcf_helloworld',
 'helloworld']

To get a specific study and query it, you can use:

st = gpf_instance.get_genotype_data('helloworld')
vs = list(st.query_variants())

Note

The query_variants method returns a Python iterator.

To get the basic information about variants found by the query_variants method, you can use:

for v in vs:
    for aa in v.alt_alleles:
        print(aa)

chr1:1287138 C->A f1
chr1:3602485 AC->A f1
chr1:12115730 G->A f1
chr1:20639952 C->T f2
chr1:21257524 C->T f2
chr14:21385738 C->T f1
chr14:21385738 C->T f2
chr14:21385954 A->C f2
chr14:21393173 T->C f1
chr14:21393702 C->T f2
chr14:21393860 G->A f1
chr14:21403023 G->A f1
chr14:21403023 G->A f2
chr14:21405222 T->C f2
chr14:21409888 T->C f1
chr14:21409888 T->C f2
chr14:21429019 C->T f1
chr14:21429019 C->T f2
chr14:21431306 G->A f1
chr14:21431623 A->C f2
chr14:21393540 GGAA->G f1

The query_variants interface allows you to specify what kind of variants you are interested in. For example, if you only need “splice-site” variants, you can use:

st = gpf_instance.get_genotype_data('helloworld')
vs = st.query_variants(effect_types=['splice-site'])
vs = list(vs)
print(len(vs))

>> 2

Or, if you are interested in “splice-site” variants only in people with “prb” role, you can use:

vs = st.query_variants(effect_types=['splice-site'], roles='prb')
vs = list(vs)
len(vs)

>> 1