GPF Getting Started Guide
Prerequisites
This guide assumes that you are working on a recent Linux box.
Working version of anaconda or miniconda
The GPF system is distributed as an Anaconda package using the conda
package manager.
If you do not have a working version of Anaconda or Miniconda, you must install one. We recommended using a Miniconda version.
Go to the Miniconda distribution page, download the Linux installer
wget -c https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
and install it in your local environment:
sh Miniconda3-latest-Linux-x86_64.sh
Note
At the end of the installation process, you will be asked if you wish
to allow the installer to initialize Miniconda3 by running conda init.
If you choose to, every terminal you open after that will have the base
Anaconda environment activated, and you’ll have access to the conda
commands used below.
Once Anaconda/Miniconda is installed, we would recommend installing mamba
instead of conda
. Mamba will speed up the installation of packages:
conda install -c conda-forge mamba
GPF Installation
The GPF system is developed in Python and supports Python 3.9 and up. The recommended way to set up the GPF development environment is to use Anaconda.
Install GPF
Create an empty Anaconda environment named gpf:
conda create -n gpf
To use this environment, you need to activate it using the following command:
conda activate gpf
Install the gpf_wdae conda package into the already activated gpf environment:
mamba install \
-c conda-forge \
-c bioconda \
-c iossifovlab \
-c defaults \
gpf_wdae
This command is going to install GPF and all of its dependencies.
Clone the example “getting-started” repository
git clone https://github.com/iossifovlab/getting-started.git
This repository provides a minimal instance and sample data to be imported.
The reference genome used by this GPF instance is hg38/genomes/GRCh38-hg38
from the default GRR.
The gene models used by this GPF instance are hg38/gene_models/refSeq_v20200330
from the default GRR.
If not specified otherwise, GPF uses the default genomic resources
repository located at
https://www.iossifovlab.com/distribution/public/genomic-resources-repository/.
Resources are used without caching.
Run the GPF development web server
By default, the GPF system looks for a file gpf_instance.yaml
in the
current directory (and its parent directories). If GPF finds such a file, it
uses it as a configuration for the GPF instance. Otherwise, it throws an
exception.
Additionally, GPF will also consider the DAE_DB_DIR
environment variable.
Sourcing the provided setenv.sh
file will set this variable for you.
source setenv.sh
Now we can run the GPF development web server and browse our empty GPF instance:
wgpf run
and browse the GPF development server at http://localhost:8000
.
To stop the development GPF web server, you should press Ctrl-C
- the usual
keybinding for stopping long-running Linux commands in a terminal.
Warning
The development web server run by wgpf run
used in this guide
is meant for development purposes only
and is not suitable for serving the GPF system in production.
Import genotype variants
Data Storage
The GPF system uses genotype storages for storing genomic variants.
We are going to use in-memory genotype storage for this guide. It is easiest to set up and use, but it is unsuitable for large studies.
By default, each GPF instance has internal in-memory genotype storage.
Import Tools and Import Project
Importing genotype data into a GPF instance involves multiple steps. The tool used to import genotype data is named import_tools. This tool expects an import project file that describes the import.
This tool supports importing variants from three formats:
List of de novo variants
List of de novo CNV variants
Variant Call Format (VCF)
Example import of de novo variants: helloworld
Let us import a small list of de novo variants. We will need the list of
de novo variants raw_genotype_data/helloworld.tsv
, and a pedigree file
that describes the families - raw_genotype_data/helloworld.ped
:
A project configuration file for importing this study
(raw_genotype_data/import_denovo_project.yaml
) is also provided.
To import this project run the following command:
import_genotypes raw_genotype_data/import_denovo_project.yaml
When the import finishes you can run the GPF development server using:
wgpf run
and browse the content of the GPF development server at http://localhost:8000
Example import of VCF variants: vcf_helloworld
Similar to the sample denovo variants, there are also sample variants in VCF format.
They can be found in raw_genotype_data/helloworld.vcf
and the same pedigree file from before is used.
To import them, run the following command:
import_genotypes raw_genotype_data/vcf_helloworld.yaml
When the import finishes you can run the GPF development server using:
wgpf run
and browse the content of the GPF development server at http://localhost:8000
Example of a dataset (group of genotype studies)
The already imported studies denovo_helloworld
and vcf_helloworld
have genomic variants for the same group of individuals helloworld.ped
.
We can create a dataset (group of genotype studies) that include both studies.
To this end create a directory datasets/helloworld
inside the GPF instance
directory minimal_instance
:
cd minimal_instance
mkdir -p datasets/helloworld
and place the following configuration file helloworld.yaml
inside that directory:
id: helloworld
name: Hello World Dataset
studies:
- denovo_helloworld
- vcf_helloworld
Example import of de novo variants from Rates of contributory de novo mutation in high and low-risk autism families
Let us import de novo variants from the Yoon, S., Munoz, A., Yamrom, B. et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021)..
We will focus on de novo variants from the SSC collection published in the aforementioned paper. To import these variants into the GPF system we need a list of de novo variants and a pedigree file describing the families. The list of de novo variants is available from Supplementary Data 2. The pedigree file for this study is not available. Instead, we have a list of children available from Supplementary Data 1.
Let us first export these Excel spreadsheets into CSV files. Let us say that the
list of de novo variants from the SSC collection is saved into a file named
SupplementaryData2_SSC.tsv
and the list of children is saved into a TSV file
named SupplementaryData1_Children.tsv
.
Note
Input files for this example can be downloaded from
denovo-in-high-and-low-risk-papter.tar.gz
.
Preprocess the families data
To import the data into GPF we need a pedigree file describing the structure
of the families. The SupplementaryData1_Children.tsv
contains only the list
of children. There is no information about their parents. Fortunately for the
SSC collection it is not difficult to build the full families’ structures from
the information we have. For the SSC collection if you have a family with ID
<fam_id>
, then the identifiers of the individuals in the family are going to
be formed as follows:
mother -
<fam_id>.mo
;father -
<fam_id>.fa
;proband -
<fam_id>.p1
;first sibling -
<fam_id>.s1
;second sibling -
<fam_id>.s2
.
Another important restriction for SSC is that the only affected person in the
family is the proband. The affected status of the mother, father and
siblings are unaffected
.
Using all these conventions we can write a simple python script
build_ssc_pedigree.py
to convert
SupplementaryData1_Children.tsv
into a pedigree file ssc_denovo.ped
:
"""Converts SupplementaryData1_Children.tsv into a pedigree file."""
import pandas as pd
children = pd.read_csv("SupplementaryData1_Children.tsv", sep="\t")
ssc = children[children.collection == "SSC"]
# list of all individuals in SSC
persons = []
# each person is represented by a tuple:
# (familyId, personId, dadId, momId, status, sex)
for fam_id, members in ssc.groupby("familyId"):
persons.append((fam_id, f"{fam_id}.mo", "0", "0", "unaffected", "F"))
persons.append((fam_id, f"{fam_id}.fa", "0", "0", "unaffected", "F"))
for child in members.to_dict(orient="records"):
persons.append((
fam_id, child["personId"], f"{fam_id}.fa", f"{fam_id}.mo",
child["affected status"], child["sex"]))
with open("ssc_denovo.ped", "wt", encoding="utf8") as output:
output.write(
"\t".join(("familyId", "personId", "dadId", "momId", "status", "sex")))
output.write("\n")
for person in persons:
output.write("\t".join(person))
output.write("\n")
If we run this script it will read SupplementaryData1_Children.tsv
and
produce the appropriate pedigree file ssc_denovo.ped
.
Preprocess the variants data
The SupplementaryData2_SSC.tsv
file contains 255231 variants. To import so
many variants in in-memory genotype storage is not appropriate. For this
example we are going to use a subset of 10000 variants:
head -n 10001 SupplementaryData2_SSC.tsv > ssc_denovo.tsv
Data import of ssc_denovo
Now we have a pedigree file ssc_denovo.ped
and a list of de novo
variants ssc_denovo.tsv
. Let us prepare an import project configuration
file ssc_denovo.yaml
:
id: ssc_denovo
input:
pedigree:
file: ssc_denovo.ped
denovo:
files:
- ssc_denovo.tsv
person_id: personIds
variant: variant
location: location
To import the study we should run:
import_tools ssc_denovo.yaml
and when the import finishes we can run the development GPF server:
wgpf run
In the list of studies, we should have a new study ssc_denovo
.
Getting started with Dataset Statistics
To generate family and de novo variant reports, you can use
the generate_common_report.py
tool. It supports the option --show-studies
to list all studies and datasets configured in the GPF instance:
generate_common_report.py --show-studies
To generate the reports for a given study or dataset, you can use the
--studies
option.
By default the dataset statistics are disabled. If we try to run
generate_common_report.py --studies helloworld
it will not generate the dataset statistics. Instead, it will print
a message that the reports are disabled to study helloworld
:
WARNING:generate_common_reports:skipping study helloworld
To enable the dataset statistics for the helloworld
dataset we need to
modify the configuration and add
a new section that enables dataset statistics:
id: helloworld
name: Hello World Dataset
studies:
- denovo_helloworld
- vcf_helloworld
common_report:
enabled: True
Let us now re-run the generate_common_report.py
command:
generate_common_report.py --studies helloworld
If we now start the GPF development server:
wgpf run
and browse the helloworld
dataset we will see the Dataset Statistics
section available.
Getting started with de novo gene sets
To generate de novo gene sets, you can use the
generate_denovo_gene_sets.py
tool. Similar to the reports_tool above,
you can use the --show-studies
and --studies
option.
By default the de novo gene sets are disabled. If you want to enable them for a specific study or dataset you need to update the configuration and add a section that enable the de novo gene sets:
denovo_gene_sets:
enabled: true
For example the configuration of helloworld
dataset should become similar to:
id: helloworld
name: Hello World Dataset
studies:
- denovo_helloworld
- vcf_helloworld
common_report:
enabled: True
denovo_gene_sets:
enabled: true
Then we can generate the de novo gene sets for helloworld
dataset by
running:
generate_denovo_gene_sets.py --studies helloworld
Getting Started with Annotation
The import of genotype data into a GPF instance always runs effect annotation. It is easy to extend the annotation of genotype data during the import.
To define the annotation used during the import into a GPF instance we have to add a configuration file that defines the pipeline of annotators. After that, we need to configure the GPF instance to use this annotation pipeline.
There is a public Genomic Resources Repository (GRR) with a collection of public genomic resources available for use with GPF system.
Example: Annotation with GnomAD 3.0
To annotate the genotype variants with GnomAD allele frequencies we should
find the GnomAD genomic resource in our public GRR. We will choose to use
hg38/variant_frequencies/gnomAD_v3/genomes
resource. If we navigate
to the resource page we will see that this is an allele_score
resource.
So to use it in the annotation we should use the allele_score
annotator.
The minimal configuration of annotation with this GnomAD resource is the following:
- allele_score: hg38/variant_frequencies/gnomAD_v3/genomes
Store this annotation configuration in a file named annotation.yaml
and
configure the GPF instance to use this annotation configuration:
reference_genome:
resource_id: "hg38/genomes/GRCh38-hg38"
gene_models:
resource_id: "hg38/gene_models/refSeq_v20200330"
annotation:
conf_file: annotation.yaml
Now we can re-run the import for our helloworld
examples:
Go to the
denovo-helloworld
project directory and re-run the import:import_tools -f denovo_helloworld.yaml
Go to the
vcf-helloworld
project directory and re-run the import:import_tools -f vcf_helloworld.yaml
Once the re-import finishes, the variants in our Hello World Dataset
have
additional attributes that come from the annotation with GnomAD v3.0
. By
default annotation adds the following three attributes:
genome_gnomad_v3_af_percent
- allele frequencies as a percent;genome_gnomad_v3_ac
- allele count;genome_gnomad_v3_an
- number of sequenced alleles.
If we run the GPF development server and browse our Hello World Dataset
there are almost no difference. The only difference is that now in the
genotype browse the genomic scores section is not empty and we can query
our variants using the genome_gnomad_v3_af_percent
genomic score.

To make the new annotation attributes available in the variants preview table and in the variants download we need to change the study configuration. Check the Getting Started with Preview and Download Columns section for additional information.
Getting Started with Preview and Download Columns
When importing data into a GPF instance we can run an annotation pipeline that adds additional attributes to each variant. To make these attributes available in the variants preview table and in the variants download file we need to change the configuration of the corresponding study or dataset.
For each study dataset, you can specify which columns are shown in the variants’ table preview, as well as those which will be downloaded.
Example: Redefine the Frequency column in the preview table of Hello World Dataset`
As an example, we are going to redefine the Frequency column for helloworld
dataset to include attributes from annotation with GnomAD v3 genomic score.
Navigate to the helloworld
dataset folder:
cd datasets/helloworld
and edit the helloworld.yaml
file. Add the following section to the end:
genotype_browser:
columns:
genotype:
genome_gnomad_v3_af_percent:
name: gnomAD v3 AF
source: genome_gnomad_v3_af_percent
format: "%%.3f"
genome_gnomad_v3_ac:
name: gnomAD v3 AC
source: genome_gnomad_v3_ac
format: "%%d"
genome_gnomad_v3_an:
name: gnomAD v3 AN
source: genome_gnomad_v3_an
format: "%%d"
column_groups:
freq:
name: "Frequency"
columns:
- genome_gnomad_v3_af_percent
- genome_gnomad_v3_ac
- genome_gnomad_v3_an
This overwrites the definition of the default preview column Frequency to include the gnomAD v3 frequencies. If we now browse the Hello World Dataset and run variants preview in the genotype browser we will start seeing the GnomAD attributes:

Example: Add GnomAD v3 columns to the variants download
As an example let us add GnomAD v3 columns to the variants downloads.
By default, each genotype study or dataset has a list of predefined columns used
when downloading variants. The users can replace the default list of download
columns by defining the download_columns
list or they can extend the predefined
list of download columns by defining the download_columns_ext
list of columns.
In the example below we are going to use download_columns_ext
to add
GnomAD v3 columns to the properties of downloaded variants:
genotype_browser:
columns:
genotype:
genome_gnomad_v3_af_percent:
name: gnomAD v3 AF
source: genome_gnomad_v3_af_percent
format: "%%.3f"
genome_gnomad_v3_ac:
name: gnomAD v3 AC
source: genome_gnomad_v3_ac
format: "%%d"
genome_gnomad_v3_an:
name: gnomAD v3 AN
source: genome_gnomad_v3_an
format: "%%d"
column_groups:
freq:
name: "Frequency"
columns:
- genome_gnomad_v3_af_percent
- genome_gnomad_v3_ac
- genome_gnomad_v3_an
download_columns_ext:
- genome_gnomad_v3_af_percent
- genome_gnomad_v3_ac
- genome_gnomad_v3_an
Getting Started with Gene Browser
The Gene Browser in the GPF system uses the allele frequency as a Y-coordinate when displaying the allele. By default, the allele frequency used is the frequency of the alleles in the imported data.

After annotation of the helloworld
data with GnomAD v3 we can use the GnomAD
allele frequency in the Gene Browser.
Example: configure the gene browser to use gnomAD frequency as the variant frequency
To configure the Hello World Dataset to use GnomAD v3 allele frequency
we need to add a new section
gene_browser
in the configuration file of the datasets
datasets/helloworld/helloworld.yaml
as follows:
id: helloworld
name: Hello World Dataset
...
gene_browser:
frequency_column: genome_gnomad_v3_af_percent
If we restart the GPF development server and navigate to Hello World Dataset
Gene Browser, the Y-axes will use the GnomAD allele frequency instead of the
study allele frequency.

Getting Started with Enrichment Tool
For studies that include de Novo variants, you can enable the enrichment tool UI. As an example, let us enable it for the already imported iossifov_2014 study.
Go to the directory where the configuration file of the iossifov_2014 study is located:
cd gpf_test/studies/iossifov_2014
Edit the study configuration file iossifov_2014.conf
and add the following section in the end of the file:
[enrichment]
enabled = true
Restart the GPF web server:
wdaemanage.py runserver 0.0.0.0:8000
Now when you navigate to the iossifov_2014 study in the browser, the Enrichment Tool tab will be available.
Getting Started with Phenotype Data
Setting up the GPF instance phenotype database
The GPF instance has four configuration settings that determine how phenotype data is read and stored:
The most important is the phenotype data directory, which is where the phenotype data configurations are. If not specified, will attempt to look for the environment variable
DAE_PHENODB_DIR
, and if not found will default to the directorypheno
inside the GPF instance directory.Phenotype storages can be configured to tell the GPF instance where to look for phenotype data DB files. If no phenotype storages are defined, a default phenotype storage is used, which uses the phenotype data directory
The cache option can be configured to tell the GPF instance and GPF tools where to store generated phenotype browser data. Data will be stored inside the
<cache_dir>/pheno
directory.The phenotype images option can be configured to tell the GPF instance and GPF tools where to store generated phenotype browser images.
You can examine the provided gpf_instance.yaml
to see how these settings are configured in it.
Importing phenotype data
To import phenotype data, the import_phenotypes
tool is used.
The tool requires an import project, a YAML file describing the contents of the phenotype data to be imported, along with configuration options on how to import them.
As an example, we are going to show how to import simulated phenotype data into our GPF instance.
Inside the raw_phenotype_data
directory, the following data is provided:
instruments
contains the phenotype instruments and measures to be imported.pedigree.ped
is the corresponding pedigree file.measure_descriptions.tsv
contains descriptions for the provided measures.import_project.yaml
is the import project configuration that we will use to import this data.
To import the phenotype data, we will use the import_phenotypes.py
tool. It will import
the phenotype database directly to our GPF instance’s phenotype storage:
import_phenotypes raw_phenotype_data/import_project.yaml
When the import finishes you can run the GPF development server using:
wgpf run
This will generate a phenotype browser database automatically, and the phenotype study should be directly accessible.
Phenotype browser databases are necessary to view the data through the web application. They are further described in the phenotype data documentation.
Configuring a phenotype database
Phenotype databases have a short configuration file which points
the system to their files, as well as specifying additional properties.
When importing a phenotype database through the
import_phenotypes
tool, a configuration file is automatically
generated. You may inspect the minimal_instance/pheno/mini_pheno/mini_pheno.yaml
configuration file generated from the import tool:
browser_images_url: static/images/
id: mini_pheno
name: mini_pheno
phenotype_storage:
db: mini_pheno/mini_pheno.db
id: storage1
regressions:
reg_1:
display_name: Regression one
instrument_name: instrument_1
jitter: 0.1
measure_names:
- measure_1
type: study
Configure Genotype Study With Phenotype Data
To demonstrate how a study is configured with a phenotype database, we will
be working with the already imported helloworld
dataset.
The phenotype databases can be attached to one or more studies and/or datasets.
If you want to attach the mini_pheno
phenotype study to the helloworld
dataset,
you need to specify it in the dataset’s configuration file, which can be found at
minimal_instance/datasets/helloworld/helloworld.yaml
.
Add the following line at the beginning of the file, outside of any section:
phenotype_data: mini_pheno
To enable the phenotype_browser_ui, add this line:
phenotype_browser: true
After this, the beginning of the configuration file should look like this:
id: helloworld
name: Hello World Dataset
phenotype_data: mini_pheno
phenotype_browser: true
studies:
- denovo_helloworld
- vcf_helloworld
When you restart the server, you should be able to see the ‘Phenotype Browser’ tab in the helloworld dataset.
Configure Family Filters in Genotype Browser
A study or a dataset can have phenotype filters configured for its genotype_browser_ui when it has a phenotype database attached to it. The configuration looks like this:
genotype_browser:
enabled: true
family_filters:
sample_continuous_filter:
name: Sample Filter Name
source_type: continuous
from: phenodb
filter_type: multi
role: prb
After adding the family filters configuration, restart the web server and navigate to the Genotype Browser. You should be able to see the Advanced option under the Family Filters - this is where the family filters can be applied.
Configure Phenotype Columns in Genotype Browser
Phenotype columns contain values from a phenotype database. These values are selected from the individual who has the variant displayed in the genotype_browser_ui’s table preview. They can be added when a phenotype database is attached to a study.
Let’s add a phenotype column. To do this, you need to define it in the study’s config, in the genotype browser section:
genotype_browser:
# ...
columns:
phenotype:
sample_pheno_measure:
role: prb
source: instrument_1.measure_1
name: Sample Pheno Measure Column
For the phenotype columns to be in the Genotype Browser table preview or download file,
they have to be present in the preview_columns
or the download_columns
in the Genotype Browser
configuration. Add this in the genotype_browser section:
preview_columns:
- family
- variant
- genotype
- effect
- gene_scores
- sample_pheno_measure
Enabling the Phenotype Tool
To enable the phenotype_tool_ui for a study, you must edit the study’s configuration file and set the appropriate property, as with the phenotype_browser_ui. Open the helloworld dataset conifguration file and add the following line:
phenotype_tool: true
After editing, it should look like this:
id: helloworld
name: Hello World Dataset
phenotype_data: mini_pheno
phenotype_browser: true
phenotype_tool: true
studies:
- denovo_helloworld
- vcf_helloworld
Restart the GPF web server and select the helloworld
dataset.
You should see the phenotype_tool_ui tab. Once you have selected it, you
can select a phenotype measure of your choice. To get the tool to acknowledge
the variants in the helloworld
dataset, select the “All” option of the
“Present in Parent” field.
Click on the “Report” button to produce the results.
Example Usage of GPF Python Interface
The simplest way to start using GPF’s Python API is to import the GPFInstance
class and instantiate it:
from dae.gpf_instance.gpf_instance import GPFInstance
gpf_instance = GPFInstance.build()
This gpf_instance
object groups together a number of objects, each dedicated
to managing different parts of the underlying data. It can be used to interact
with the system as a whole.
For example, to list all studies configured in the startup GPF instance, use:
gpf_instance.get_genotype_data_ids()
This will return a list with the ids of all configured studies:
['denovo_helloworld',
'vcf_helloworld',
'helloworld']
To get a specific study and query it, you can use:
st = gpf_instance.get_genotype_data('helloworld')
vs = list(st.query_variants())
Note
The query_variants
method returns a Python iterator.
To get the basic information about variants found by the query_variants
method,
you can use:
for v in vs:
for aa in v.alt_alleles:
print(aa)
chr1:1287138 C->A f1
chr1:3602485 AC->A f1
chr1:12115730 G->A f1
chr1:20639952 C->T f2
chr1:21257524 C->T f2
chr14:21385738 C->T f1
chr14:21385738 C->T f2
chr14:21385954 A->C f2
chr14:21393173 T->C f1
chr14:21393702 C->T f2
chr14:21393860 G->A f1
chr14:21403023 G->A f1
chr14:21403023 G->A f2
chr14:21405222 T->C f2
chr14:21409888 T->C f1
chr14:21409888 T->C f2
chr14:21429019 C->T f1
chr14:21429019 C->T f2
chr14:21431306 G->A f1
chr14:21431623 A->C f2
chr14:21393540 GGAA->G f1
The query_variants
interface allows you to specify what kind of variants
you are interested in. For example, if you only need “splice-site” variants, you
can use:
st = gpf_instance.get_genotype_data('helloworld')
vs = st.query_variants(effect_types=['splice-site'])
vs = list(vs)
print(len(vs))
>> 2
Or, if you are interested in “splice-site” variants only in people with “prb” role, you can use:
vs = st.query_variants(effect_types=['splice-site'], roles='prb')
vs = list(vs)
len(vs)
>> 1