GPF Getting Started Guide
Setup
Prerequisites
This guide assumes that you are working on a recent Linux (or Mac OSX) machine.
The GPF system is distributed as a Conda package.
If you do not have a working version of Anaconda or Miniconda or Mamba, you must install one. We recommended using a Miniforge distribution.
Go to the Miniforge home page and follow the instruction for you platform.
Warning
The GPF system is not supported on Windows.
GPF Installation
The GPF system is developed in Python and supports Python 3.11 and up. The recommended way to set up a conda GPF environment.
Create an empty Conda environment named gpf:
mamba create -n gpf
To use this environment, you need to activate it using the following command:
mamba activate gpf
Install the gpf_wdae conda package into the already activated gpf environment:
mamba install \
-c conda-forge \
-c bioconda \
-c iossifovlab \
-c defaults \
gpf_wdae
This command is going to install GPF and all of its dependencies.
Getting the demonstration data
git clone https://github.com/iossifovlab/gpf-getting-started.git
Navigate to the newly-created directory:
cd gpf-getting-started
This repository provides a minimal instance and sample data to be imported.
Starting and stopping the GPF web interface
By default, the GPF system looks for a file gpf_instance.yaml
in the
current directory (and its parent directories). If GPF finds such a file, it
uses it as a configuration for the GPF instance. Otherwise,
GPF will look for a the DAE_DB_DIR
environment variable. If it is not set,
it throws an exception.
For this manual we recommend setting the DAE_DB_DIR
environment variable.
From within the gpf-getting-started
directory run the following command:
export DAE_DB_DIR=$(pwd)/minimal_instance
For this guide we use a gpf_instance.yaml
file that is already provided
in the minimal_instance
subdirectory:
# The id of the instance.
instance_id: minimal_instance
# The reference genome to use for this instance.
reference_genome:
resource_id: "hg38/genomes/GRCh38-hg38"
# The gene models to use for this instance.
gene_models:
resource_id: "hg38/gene_models/MANE/1.3"
# The annotation pipeline configuration to use. Uncomment to enable.
# annotation:
# config:
# - allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
# - allele_score: hg38/scores/ClinVar_20240730
GPF instance configuration requires a reference genome and gene models to annotate variants with effects on genes. For this giude we use HG38 reference genome and MANE 1.3 gene models.
If not specified otherwise, GPF uses the GPF Genomic Resources Repository (GRR) located at https://grr.iossifovlab.com/ to find the resources it needs.
The reference genome used by this GPF instance is hg38/genomes/GRCh38-hg38
and the gene models are hg38/gene_models/MANE/1.3
from the default GRR.
Note
For more on GPF instance configuration see GPF Instance Configuration.
Now we can run the GPF development web server and browse our empty GPF instance:
wgpf run
and browse the GPF development server at http://localhost:8000
.
The web interface will be mostly empty, as at this point there is no data imported into the instance.
To stop the development GPF web server, you should press Ctrl-C
- the usual
keybinding for stopping long-running Linux commands in a terminal.
Warning
The development web server run by wgpf run
used in this guide
is meant for development purposes only
and is not suitable for serving the GPF system in production.
Importing genotype data
Import Tools and Import Project
Importing genotype data into a GPF instance involves multiple steps.
The tool used to import genotype data is named import_genotypes
. This tool
expects an import project file that describes the import.
We support importing variants from multiple formats.
For this demonstration, we will be importing from the following formats:
List of de novo variants
Variant Call Format (VCF)
Example import of de novo variants
Let us import a small list of de novo variants.
A pedigree file that describes the families is needed -
input_genotype_data/example.ped
:
familyId personId dadId momId sex status
f1 f1.dad 0 0 M unaffected
f1 f1.mom 0 0 F unaffected
f1 f1.p1 f1.dad f1.mom M affected
f1 f1.s1 f1.dad f1.mom F unaffected
f2 f2.mom 0 0 F unaffected
f2 f2.dad 0 0 M unaffected
f2 f2.p1 f2.dad f2.mom F affected
We will also need the list of de novo variants
input_genotype_data/example.tsv
:
chrom pos ref alt person_id
chr14 21403214 T C f1.p1
chr14 21431459 G C f1.p1
chr14 21391016 A AT f2.p1
chr14 21403019 G A f2.p1
chr14 21393484 TCTTC T f2.p1
chr14 21409849 A G f1.p1
A project configuration file for importing this study -
input_genotype_data/denovo_example.yaml
- is also provided:
id: denovo_example
input:
pedigree:
file: example.ped
denovo:
files:
- example.tsv
To import this project run the following command:
import_genotypes input_genotype_data/denovo_example.yaml
Note
For more information on the import project configuration file, see Import Tools.
For more information on workgin with pedigree files, see Working With Pedigree Files Guide.
When the import finishes you can run the GPF development server using:
wgpf run
and browse the content of the GPF development server at
http://localhost:8000
The home page of the GPF system will show the imported study
denovo_example
.

If you follow the link to the study, and choose Dataset Statistics page, you will see some summary information for the imported study: families and individuals included in the study, types of families and rates of de novo variants.

Statistics of individuals by affected status and sex

Statistics of families by family type

Rate of de novo variants
If you select Genotype Browser page, you will be able to see the imported de novo variants. The default filters search for LGD de novo variants. It happens, that all de novo variants imported in the denovo_example study are LGD variants.
So, when you click Preview button all the imported variants will be shown.

Genotype Browser with de novo variants
Example import of VCF variants
Similar to the sample denovo variants, there are also sample variants in
VCF format. They can be found in input_genotype_data/example.vcf
and
the same pedigree file from before is used.
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr14>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT f1.mom f1.dad f1.p1 f1.s1 f2.mom f2.dad f2.p1
chr14 21385738 . C T . . . GT 0/0 0/1 0/1 0/0 0/0 0/1 0/0
chr14 21385954 . A C . . . GT 0/0 0/0 0/0 0/0 0/1 0/0 0/1
chr14 21393173 . T C . . . GT 0/1 0/0 0/0 0/1 0/0 0/0 0/0
chr14 21393702 . C T . . . GT 0/0 0/0 0/0 0/0 0/0 0/1 0/1
chr14 21393860 . G A . . . GT 0/0 0/1 0/1 0/1 0/0 0/0 0/0
chr14 21403023 . G A . . . GT 0/0 0/1 0/0 0/1 0/1 0/0 0/0
chr14 21405222 . T C . . . GT 0/0 0/0 0/0 0/0 0/0 0/1 0/0
chr14 21409888 . T C . . . GT 0/1 0/0 0/1 0/0 0/1 0/0 1/0
chr14 21429019 . C T . . . GT 0/0 0/1 0/1 0/0 0/0 0/1 0/1
chr14 21431306 . G A . . . GT 0/0 0/1 0/1 0/1 0/0 0/0 0/0
chr14 21431623 . A C . . . GT 0/0 0/0 0/0 0/0 0/1 1/1 1/1
chr14 21393540 . GGAA G . . . GT 0/1 0/1 1/1 0/0 0/0 0/0 0/0
chr14 21431499 . T C . . . GT 0/0 0/0 0/0 0/0 0/0 0/1 0/1
chr14 21402010 . G A . . . GT 0/0 0/1 0/1 0/0 0/0 0/0 0/0
A project configuration file is provided -
input_genotype_data/vcf_example.yaml
.
id: vcf_example
input:
pedigree:
file: example.ped
vcf:
files:
- example.vcf
To import them, run the following command:
import_genotypes input_genotype_data/vcf_example.yaml
When the import finishes you can run the GPF development server using:
wgpf run
and browse the content of the GPF development server at
http://localhost:8000
The GRF instance Home Page now includes the imported study vcf_example
.

If you follow the link to the vcf_example you will get to the Gene Browser page for the study. It happens that all imported VCF variants are located on CHD8 gene. Fill CHD8 in the Gene Symbol box and click Go button.

Gene Browser for CHD8 gene shows variants from vcf_example
study
Example of a dataset (group of genotype studies)
The already imported studies denovo_example
and vcf_example
have genomic variants for the same group of individuals example.ped
.
We can create a dataset (group of genotype studies) that include both studies.
To this end create a directory datasets/example_dataset
inside the GPF
instance directory minimal_instance
:
mkdir -p minimal_instance/datasets/example_dataset
and place the following configuration file example_dataset.yaml
inside
that directory:
id: example_dataset
name: Example Dataset
studies:
- denovo_example
- vcf_example
When ready with the configuration restart the wgpf
command. The home page
of the GPF instance will change and now will include the configured dataset
example_dataset
.

Home page of the GPF instance showing the example_dataset
Follow the link to the Example Dataset, choose the Gene Browser pageå and fill CHD8 in the Gene Symbol. Click Go button and now you will be able to see the variants from both studies.

Gene Browser for CHD8 gene shows variants from both studies -
denovo_example
and vcf_example
Getting Started with Annotation
The import of genotype data into a GPF instance always runs effect annotation. It is easy to extend the annotation of genotype data during the import.
To define the annotation used during the import into a GPF instance we have to add a configuration that defines the pipeline of annotators and resources to be used during the import.
In the public GPF Genomic Resources Repository (GRR) there is a collection of public genomic resources available for use with GPF system.
Let say that we want to annotate the genotype variants with GnomAD and ClinVar. We need to find the appropriate resources in the public GRR:
hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
- this is anallele_score
resource and the annotator by default produces one additional attributegnomad_v4_genome_ALL_af
that is the allele frequency for the variant (check the hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL page for more information about the resource);hg38/scores/ClinVar_20240730
- this is anallele_score
resource and the annotator by default produces two additional attributeCLNSIG
that is the aggregate germline classification for the variant andCLNDN
that is preferred disease name (check the hg38/scores/ClinVar_20240730 page for more information about the resource).
In order to use these resources in the GPF instance annotation, we need to
edit the GPF instance configuration (minimal_instance/gpf_instance.yaml
)
and add the following snipped the configuration file:
# The annotation pipeline configuration to use. Uncomment to enable.
annotation:
config:
- allele_score: hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
- allele_score: hg38/scores/ClinVar_20240730
Re-running GPF will automatically re-annotate any genotype data that is not up to date:
wgpf run
The variants in our Example Dataset
will now have additional attributes
that come from the annotation with GnomAD and ClinVar:
gnomad_v4_genome_ALL_af
;CLNSIG
;CLNDN
.
If we browse our Example Dataset
there is almost no difference.
The only difference is that now in the
genotype browser, the genomic scores section is not empty and we can query
our variants using the gnomad_v4_genome_ALL_af
, CLNSIG
and CLNDN
genomic scores.

Genotype browser using GnomAD and ClinVar genomic scores
Note
The attributes produced by the annotation can be used in the Genotype Browser preview table as described in Getting Started with Preview and Download Columns.
Getting Started with Phenotype Data
Importing phenotype data
To import phenotype data, the import_phenotypes
tool is used.
The tool requires an import project, a YAML file describing the contents of the phenotype data to be imported, along with configuration options on how to import them.
As an example, we are going to show how to import simulated phenotype data into our GPF instance.
Inside the input_phenotype_data
directory, the following data is provided:
pedigree.ped
is the phenotype data pedigree file.input_phenotype_data/pedigree.ped
:familyId personId dadId momId sex status role f1 f1.dad 0 0 1 1 dad f1 f1.mom 0 0 2 1 mom f1 f1.p1 f1.dad f1.mom 1 2 prb f1 f1.s1 f1.dad f1.mom 2 1 sib f2 f2.mom 0 0 2 1 mom f2 f2.dad 0 0 1 1 dad f2 f2.p1 f2.dad f2.mom 2 2 prb f2 f2.s1 f2.dad f2.mom 1 1 sib f3 f3.dad 0 0 1 1 dad f3 f3.mom 0 0 2 1 mom f3 f3.p1 f3.dad f3.mom 1 2 prb f3 f3.s1 f3.dad f3.mom 2 1 sib f4 f4.dad 0 0 1 1 dad f4 f4.mom 0 0 2 1 mom f4 f4.p1 f4.dad f4.mom 2 2 prb f4 f4.s1 f4.dad f4.mom 1 1 sib f5 f5.dad 0 0 1 1 dad f5 f5.mom 0 0 2 1 mom f5 f5.p1 f5.dad f5.mom 1 2 prb f5 f5.s1 f5.dad f5.mom 2 1 sib f6 f6.dad 0 0 1 1 dad f6 f6.mom 0 0 2 1 mom f6 f6.p1 f6.dad f6.mom 2 2 prb f6 f6.s1 f6.dad f6.mom 1 1 sib
instruments
contains the phenotype instruments and measures to be imported. There are two instruments in the example:input_phenotype_data/instruments/basic_medical.csv
:personId,age,weight,height,race f1.dad,50,200,180,white f1.mom,23,160,170,white f1.s1,2,20,79,white f1.p1,4,40,80,white f2.dad,32,230,170,white f2.mom,30,153,165,white f2.s1,12,80,130,white f2.p1,3,30,70,white f3.dad,45,175,165,white f3.mom,41,173,154,white f3.s1,23,170,180,white f3.p1,7,80,120,white f4.dad,25,190,185,white f4.mom,35,200,150,white f4.s1,17,160,165,white f4.p1,5,50,100,white f5.dad,31,250,176,asian f5.mom,39,180,154,asian f5.s1,11,130,150,asian f5.p1,5,55,100,asian f6.dad,30,200,173,affrican american f6.mom,27,140,178,affrican american f6.s1,1,15,30,affrican american f6.p1,8,80,130,affrican american
input_phenotype_data/instruments/iq.csv
:personId,verbal_iq,non_verbal_iq,diagnosis_notes f1.p1,60,45,"walked late, severe seizures" f1.s1,98,102, f2.p1,98,70,originally diagnosed as Asperger f2.s1,115,83, f3.p1,90,80, f3.s1,97,90, f4.p1,108,93,excels at school f4.s1,107,91, f5.p1,90,70,sleep abnormality f5.s1,105,115, f6.p1,85,92, f6.s1,95,101,
measure_descriptions.tsv
contains descriptions for the provided measures.input_phenotype_data/measure_descriptions.tsv
:instrumentName measureName description basic_medical age The individual's age in years basic_medical weight The individual's weight in punds basic_medical height The individual's height in centimeters basic_medical race The individual's race iq verbal_iq Verbal IQ iq non_verbal_iq Non verbal IQ
import_project.yaml
is the import project configuration that we will use to import this data.input_phenotype_data/import_project.yaml
:id: mini_pheno instrument_files: - instruments/basic_medical.csv - instruments/iq.csv data_dictionary: files: - path: measure_descriptions.tsv pedigree: pedigree.ped person_column: personId work_dir: work study_config: regressions: reg_1: display_name: "Age regression" measure_names: - age instrument_name: basic_medical jitter: 0.1
Note
For more information on how to import phenotype data see Phenotype Database Tools
To import the phenotype data, we will use the import_phenotypes
tool.
It will import the phenotype database directly to our GPF instance’s phenotype
storage:
import_phenotypes input_phenotype_data/import_project.yaml
When the import finishes you can run the GPF development server using:
wgpf run
Now on the GPF instance Home Page you should see the mini_pheno
phenotype
study.

Home page with imported phenotype study
If you follow the link, you will see the Phenotype Browser tab with the imported data.

Phenotype Browser tab with imported data
In the Phenotype Browser tab you can search for phenotype instruments and measures, see the aggregated figures for the measures, and download selected instruments and measures.
Configure a genotype study to use phenotype data
To demonstrate how a study is configured with a phenotype database, we will
be working with the already imported example_dataset
dataset.
The phenotype databases can be attached to one or more studies and/or datasets.
If you want to attach the mini_pheno
phenotype study to the
example_dataset
dataset,
you need to specify it in the dataset’s configuration file, which can be found
at minimal_instance/datasets/example_dataset/example_dataset.yaml
.
Add the following line to the configuration file:
phenotype_data: mini_pheno
When you restart the server, you should be able to see Phenotype Browser and Phenotype Tool tabs enabled for the Example Dataset dataset.
Additionally, in the Genotype Browser the Family Filters and Person Filters sections will have the Pheno Measures filters enabled.

Example Dataset genotype browser using Pheno Measures family filters
Getting Started with Preview and Download Columns
Configure genotype columns in Genotype Browser
Once you have annotated your variants the additional attributes produced by the annotation can be used in the variants preview table and in the variants download file. For each study and dataset you can specify which columns are shown in the variants’ table preview, as well as those which will be downloaded.
In our example the annotation produces three additional attributes:
gnomad_v4_genome_ALL_af
, CLNSIG
, and CLNDN
. Let us add these
attributes to the
variants preview table and the variants download file for the
example_dataset
dataset.
Edit the example_dataset.yaml
dataset configuration in
minimal_instance/datasets/example_dataset
and add the following section
at the end of the configuration file:
1genotype_browser:
2 columns:
3 genotype:
4 gnomad_v4_genome_af:
5 name: gnomAD v4 AF
6 source: gnomad_v4_genome_ALL_af
7 format: "%%.5f"
8 clinvar_clnsig:
9 name: CLNSIG
10 source: CLNSIG
11 clinvar_clndn:
12 name: CLNDN
13 source: CLNDN
14
15 column_groups:
16 gnomad_v4:
17 name: gnomAD v4
18 columns:
19 - gnomad_v4_genome_af
20
21 clinvar:
22 name: ClinVar
23 columns:
24 - clinvar_clnsig
25 - clinvar_clndn
26
27 preview_columns_ext:
28 - gnomad_v4
29 - clinvar
30
31 download_columns_ext:
32 - gnomad_v4_genome_af
33 - clinvar_clnsig
34 - clinvar_clndn
Lines 3-13 define the three new columns with values comming from the genotype data attributes:
gnomad_v4_genome_af
- is a column that uses the value of the attributegnomad_v4_genome_ALL_af
and formats it as a float with 5 decimal places. The display name of the column will be gnomAD v4 AF;clinvar_clnsig
- is a column that uses the value of the attributeCLNSIG
. The display name of the column will be CLNSIG;clinvar_clndn
- is a column that uses the value of the attributeCLNDN
. The display name of the column will be CLNDN.
In the preview table each column could show multiple values. In GPF when you want to show multiple values in single column, you need to define a column group.
The column group is a collection of columns that are
shown together in the preview table. The values in a column group are shown
in a single cell. The column group is defined in the
column_groups
section of the configuration file.
In lines 16-19 we define a column group
gnomad_v4
that contains the column
gnomad_v4_genome_af
.
In lines 21-25 we define a column group
clinvar
that contains the columns
clinvar_clnsig
and clinvar_clndn
.
In lines 27-29 we extend the preview table columns. The new column groups
gnomad_v4
and clinvar
will be added to the preview table.
In lines 32-34 we extend the download file columns. The columns
gnomad_v4_genome_af
, clinvar_clnsig
, clinvar_clndn
will be added
to the download file.
If we now stop the wgpf
tool and run it again, we will be able to see
the new columns in the preview table and in the download file.
From the GPF instance Home Page follow the link to the Example Dataset page and choose the Genotype Browser. Select all checkboxes in Present in Child, Present in Parent and Effect Types sections.

Then click the Preview button and will be able to see all the imported variants with their additional attributes comming from the annotation.

Example Dataset genotype browser displaying variants with additional columns gnomAD v4 and ClinVar.
Configure phenotype columns in Genotype Browser
The Genotype Browser allows you to add phenotype columns to the table preview and download file.
Phenotype columns show values from a phenotype database. To configure such a column you need to specify following attributes:
source
- the measure ID which values we are going to show in the column;role
- the role of the person in the family for which we are going toshow the phenotype measure value;
name
- the display name of the column in the table.
Let’s add a phenotype columns to the Genotype Browser preview table. To do this, you need to define them in the study’s config, in the genotype browser section of the configuration file.
1genotype_browser:
2 columns:
3 genotype:
4 gnomad_v4_genome_af:
5 name: gnomAD v4 AF
6 source: gnomad_v4_genome_ALL_af
7 format: "%%.5f"
8 clinvar_clnsig:
9 name: CLNSIG
10 source: CLNSIG
11 clinvar_clndn:
12 name: CLNDN
13 source: CLNDN
14
15 phenotype:
16 prb_verbal_iq:
17 role: prb
18 name: Verbal IQ
19 source: iq.verbal_iq
20
21 prb_non_verbal_iq:
22 role: prb
23 name: Non-Verbal IQ
24 source: iq.non_verbal_iq
25
26 column_groups:
27 gnomad_v4:
28 name: gnomAD v4
29 columns:
30 - gnomad_v4_genome_af
31
32 clinvar:
33 name: ClinVar
34 columns:
35 - clinvar_clnsig
36 - clinvar_clndn
37
38 proband_iq:
39 name: Proband IQ
40 columns:
41 - prb_verbal_iq
42 - prb_non_verbal_iq
43
44 preview_columns_ext:
45 - gnomad_v4
46 - clinvar
47 - proband_iq
48
49 download_columns_ext:
50 - gnomad_v4_genome_af
51 - clinvar_clnsig
52 - clinvar_clndn
53 - prb_verbal_iq
54 - prb_non_verbal_iq
Lines 15-24 define two new columns with values coming from the phenotype data attributes:
prb_verbal_iq
- is a column that uses the value of the phenotype measureiq.verbal_iq
for the family proband. The display name of the column will be Verbal IQ;prb_non_verbal_iq
- is a column that uses the value of the phenotype measureiq.non_verbal_iq
for the family proband. The display name of the column will be Non-Verbal IQ.
In the preview table each column could show multiple values. In GPF when you want to show multiple values in single column, you need to define a column group.
The column group is a collection of columns that are
shown together in the preview table. The values in a column group are shown
in a single cell. The column group is defined in the
column_groups
section of the configuration file.
In lines 38-42 we define a column group called proband_iq that contains the
columns prb_verbal_iq
and prb_non_verbal_iq
.
To add the new column group proband_iq
to the preview table, we need to
add it to the preview_columns_ext
section of the configuration file.
In line 47 we add the new column group proband_iq
at the end of the
preview table.
When you restart the server, go to the Genotype Browser tab of the
Example Dataset
dataset and select all checkboxes in Present in Child,
Present in Parent and Effect Types sections:

When you click on the Table Preview button, you will be able to see the new
column group proband_iq
in the preview table.

Example Dataset genotype browser using pheno measures columns
Example Usage of GPF Python Interface
The simplest way to start using GPF’s Python API is to import the GPFInstance
class and instantiate it:
from dae.gpf_instance.gpf_instance import GPFInstance
gpf_instance = GPFInstance.build()
This gpf_instance
object groups together a number of objects, each dedicated
to managing different parts of the underlying data. It can be used to interact
with the system as a whole.
For example, to list all studies configured in the startup GPF instance, use:
gpf_instance.get_genotype_data_ids()
This will return a list with the ids of all configured studies:
['denovo_example',
'vcf_example',
'example_dataset']
To get a specific study and query it, you can use:
st = gpf_instance.get_genotype_data('example_dataset')
vs = list(st.query_variants())
Note
The query_variants
method returns a Python iterator.
To get the basic information about variants found by the query_variants
method,
you can use:
for v in vs:
for aa in v.alt_alleles:
print(aa)
chr14:21391016 A->AT f2
chr14:21393484 TCTTC->T f2
chr14:21402010 G->A f1
chr14:21403019 G->A f2
chr14:21403214 T->C f1
chr14:21431459 G->C f1
chr14:21385738 C->T f1
chr14:21385738 C->T f2
chr14:21385954 A->C f2
chr14:21393173 T->C f1
chr14:21393702 C->T f2
chr14:21393860 G->A f1
chr14:21403023 G->A f1
chr14:21403023 G->A f2
chr14:21405222 T->C f2
chr14:21409888 T->C f1
chr14:21409888 T->C f2
chr14:21429019 C->T f1
chr14:21429019 C->T f2
chr14:21431306 G->A f1
chr14:21431623 A->C f2
chr14:21393540 GGAA->G f1
The query_variants
interface allows you to specify what kind of variants
you are interested in. For example, if you only need “synonymous” variants, you
can use:
st = gpf_instance.get_genotype_data('example_dataset')
vs = st.query_variants(effect_types=['synonymous'])
vs = list(vs)
len(vs)
>> 4
Or, if you are interested in “synonymous” variants only in people with “prb” role, you can use:
vs = st.query_variants(effect_types=['synonymous'], roles='prb')
vs = list(vs)
len(vs)
>> 1
Example import of real de Novo variants
Source of the data
As an example let us import de novo variants from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).
We will focus on de novo variants from the SSC collection published in the aforementioned paper.
To import these variants into the GPF system, we need a pedigree file describing the families and a list of de novo variants.
From the supplementary data for the paper can download the following files:
The list of sequenced children available from Supplementary Data 1.
The list of SNP and INDEL de novo variants is available from Supplementary Data 2.
Note
All the data files needed for this example are available in the
gpf-getting-started
repository under the subdirectory example_imports/denovo_and_cnv_import
.
Preprocess the Family Data
The list of children in Supplementary_Data_1.tsv.gz
contains a lot of data
that is not relevant for the import.
We are going to use only the first five
columns from that file that look as follows:
gunzip -c Supplementary_Data_1.tsv.gz | head | cut -f 1-5 | less -S -x 20
collection familyId personId affected status sex
SSC 11000 11000.p1 affected M
SSC 11000 11000.s1 unaffected F
SSC 11003 11003.p1 affected M
SSC 11003 11003.s1 unaffected F
SSC 11004 11004.p1 affected M
SSC 11004 11004.s1 unaffected M
SSC 11006 11006.p1 affected M
SSC 11006 11006.s1 unaffected M
SSC 11008 11008.p1 affected M
The first column contains the collection. This study contains data from SSC and AGRE collections. We are going to import only variants from the SSC collection.
The second column contains the family ID.
The third column contains the person’s ID.
The fourth column contains the affected status of the individual.
The fifth column contains the sex of the individual.
We need a pedigree file describing the family’s structure to import the data into GPF. The SupplementaryData1_Children.tsv.gz contains only the children; it does not include information about their parents. Fortunately for the SSC collection, it is not difficult to build the whole families’ structures from the information we have.
So, before starting the work on the import, we need to preprocess the list of children and transform it into a pedigree file.
For the SSC collection, if you have a family with ID`<fam_id>`, then the identifiers of the individuals in the family are going to be formed as follows:
mother -
<fam_id>.mo
;father -
<fam_id>.fa
;proband -
<fam_id>.p1
;first sibling -
<fam_id>.s1
;second sibling -
<fam_id>.s2
.
Another essential restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father, and siblings is unaffected.
Having this information, we can use the following Awk script to transform the list of children into a pedigree:
gunzip -c Supplementary_Data_1.tsv.gz | awk '
BEGIN {
OFS="\t"
print "familyId", "personId", "dadId", "momId", "status", "sex"
}
$1 == "SSC" {
fid = $2
if( fid in families == 0) {
families[fid] = 1
print fid, fid".mo", "0", "0", "unaffected", "F"
print fid, fid".fa", "0", "0", "unaffected", "M"
}
print fid, $3, fid".fa", fid".mo", $4, $5
}' > ssc_denovo.ped
If we run this script, it will read Supplementary_Data_1.tsv.gz
and produce
the appropriate pedigree file ssc_denovo.ped
.
Note
The resulting pedigree file is also available in the
gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import
.
Here is a fragment from the resulting pedigree file:
familyId personId dadId momId status sex
11000 11000.mo 0 0 unaffected F
11000 11000.fa 0 0 unaffected M
11000 11000.p1 11000.fa 11000.mo affected M
11000 11000.s1 11000.fa 11000.mo unaffected F
11003 11003.mo 0 0 unaffected F
11003 11003.fa 0 0 unaffected M
11003 11003.p1 11003.fa 11003.mo affected M
11003 11003.s1 11003.fa 11003.mo unaffected F
11004 11004.mo 0 0 unaffected F
11004 11004.fa 0 0 unaffected M
Preprocess the SNP and INDEL de Novo variants
The Supplementary_Data_2.tsv.gz file contains 255232 variants. For the import, we will use columns four and nine from this file:
gunzip -c Supplementary_Data_2.tsv.gz | head | cut -f 4,9 | less -S -x 20
personIds variant in VCF format
13210.p1 chr1:184268:G:A
12782.s1 chr1:191408:G:A
12972.s1 chr1:271774:AG:A
12420.p1 chr1:484721:AG:A
12518.p1,12518.s1 chr1:691130:T:C
13882.p1 chr1:738645:C:G
14039.s1 chr1:819832:G:T
13872.p1 chr1:824001:AAAAT:A
Using the following Awk script, we can transform this file into easy to import list of de Novo variants:
gunzip -c Supplementary_Data_2.tsv.gz | cut -f 4,9 | awk '
BEGIN{
OFS="\t"
print "chrom", "pos", "ref", "alt", "person_id"
}
NR > 1 {
split($2, v, ":")
print v[1], v[2], v[3], v[4], $1
}' > ssc_denovo.tsv
This script will produce a file named ssc_denovo.tsv
with the following
content:
chrom pos ref alt person_id
chr1 184268 G A 13210.p1
chr1 191408 G A 12782.s1
chr1 271774 AG A 12972.s1
chr1 484721 AG A 12420.p1
chr1 691130 T C 12518.p1,12518.s1
chr1 738645 C G 13882.p1
chr1 819832 G T 14039.s1
chr1 824001 AAAAT A 13872.p1
chr1 826779 T C 12132.s1
chr1 834505 G A 13801.p1
Note
The resulting ssc_denovo.tsv
file is also available in the
gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import/input_data
.
Caching GRR
Now we are about to import 255K variants. During the import, the GPF system will annotate these variants using the GRR resources from our public GRR.
For small studies with few variants this approach is quite convienient. However, for larger studies, it is better to cache the GRR resources locally.
To do this, we need to configure the GRR to use a local cache. Create a file
named .grr_definition.yaml
in your home directory with the following
content:
id: "seqpipe"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "<path_to_your_cache_dir>"
The cache_dir
parameter specifies the directory where the GRR resources
will be cached. The cache directory should be specified as an absolute path.
For example, /tmp/grr_cache
or /Users/lubo/grrCache
.
To download all the resources needed for out minimal_instance
, run
the following command from gpf-getting-started
directory:
grr_cache_repo -i minimal_instance/gpf_instance.yaml
Note
The grr_cache_repo
command will download all the resources needed for
the GPF instance. This may take a while, depending on your internet
connection and the number of resources you configuration require.
The resources will be downloaded to the directory specified in the
cache_dir
parameter in the .grr_definition.yaml
file.
For the gpf-getting-started
repository, the resources that will be
downloaded are:
hg38/genomes/GRCh38-hg38
hg38/gene_models/MANE/1.3
hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL
hg38/scores/ClinVar_20240730
The total size of the downloaded resources is about 15 GB.
Data Import of ssc_denovo
Now we have a pedigree file, ssc_denovo.ped
, and a list of de novo
variants, ssc_denovo.tsv
. Let us prepare an import project configuration
file, ssc_denovo.yaml
:
1id: ssc_denovo
2
3input:
4 pedigree:
5 file: ssc_denovo.ped
6
7 denovo:
8 files:
9 - ssc_denovo.tsv
10
11processing_config:
12 denovo: chromosome
When importing genotype data, we often need to instruct the import tool how to
split the import process into multiple jobs. For this purpose, we can use
processing_config
section of the import project. On lines 11-12 of the
ssc_denovo.yaml
file, we have defined the processing_config
section
that will split the import de Novo variants into jobs by chromosome. (For more
on import project configuration see Import Tools.)
Note
The project file ssc_denovo.yaml
is available in the the gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import
.
To import the study, from the gpf-getting-started
directory we should run:
time import_genotypes -v -j 10 example_imports/denovo_and_cnv_import/ssc_denovo.yaml
real 5m29.950s
user 31m52.320s
sys 1m41.755s
The -j 10
option instructs the import_genotypes
tool to use 10 threads
and the -v
option controls the verbosity of the output.
When the import finishes, we can run the development GPF server:
wgpf run
In the Home page of the GPF instance we should have the new study
ssc_denovo
.

Home page with the imported SSC de novo variants.
If you follow the link to the study, and choose the Genotype Browser tab, you will be able to query the imported variants.

Genotype browser for the SSC de novo variants.
Configure preview and download columns
While importing the SSC de novo variants, we were using the annotation defined in the minimal instance configuration file. So, all imported variants are annotated with GnomAD and ClinVar genomic scores.
We can use these scores to define additional columns in the preview table and the download file similar to Getting Started with Preview and Download Columns.
Edit the ssc_denovo
configuration file located at
minimal_instance/datasets/ssc_denovo/ssc_denovo.yaml
and add the following
snipped to the configuration file:
1genotype_browser:
2 columns:
3 genotype:
4 gnomad_v4_genome_af:
5 name: gnomAD v4 AF
6 source: gnomad_v4_genome_ALL_af
7 format: "%%.5f"
8 clinvar_clnsig:
9 name: CLNSIG
10 source: CLNSIG
11 clinvar_clndn:
12 name: CLNDN
13 source: CLNDN
14
15 column_groups:
16 gnomad_v4:
17 name: gnomAD v4
18 columns:
19 - gnomad_v4_genome_af
20
21 clinvar:
22 name: ClinVar
23 columns:
24 - clinvar_clnsig
25 - clinvar_clndn
26
27 preview_columns_ext:
28 - gnomad_v4
29 - clinvar
30
31 download_columns_ext:
32 - gnomad_v4_genome_af
33 - clinvar_clnsig
34 - clinvar_clndn
Now, restart the GPF development server:
wgpf run
Go to the Genotype Browser tab of the ssc_denovo
study and click
Preview Table button. The preview table should now contain the additional
columns for GnomAD and ClinVar genomic scores.

Genotype browser with additional columns for GnomAD and ClinVar genomic scores.
Example import of real CNV variants
Source of the data
As an example for import of CNV variants we will use data from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).
We already discussed the import of de Novo variants from this paper in Example import of real de Novo variants.
Now we will focus on the import of CNV variants from the same paper.
To import these variants into the GPF system, we need a pedigree file describing the families and a list of CNV variants.
From the supplementary data for the paper can download the following files:
The list of sequenced children available from Supplementary Data 1.
The list of CNV de novo variants is available from Supplementary Data 1.
Note
All the data files needed for this example are available in the
gpf-getting-started
repository under the subdirectory example_imports/denovo_and_cnv_import
.
We already discussed how to transform the list of children into a pedigree file in the Preprocess the Family Data section.
Now we need to prepare the CNV variants file.
Preprocess the CNV variants
The Supplementary_Data_4.tsv.gz file contains 376 CNV variants from SSC and AGRE collections.
For the import we will use the colums two, five, six and seven:
gunzip -c Supplementary_Data_4.tsv.gz | cut -f 2,5-7 | less -S -x 25
collection personIds location variant
SSC 12613.p1 chr1:1305145-1314126 duplication
AGRE AU2725301 chr1:3069177-4783791 duplication
SSC 13424.s1 chr1:3975501-3977800 deletion
SSC 12852.p1 chr1:6647401-6650500 deletion
SSC 13776.p1 chr1:8652301-8657600 deletion
SSC 13373.s1 chr1:9992001-9994100 deletion
SSC 14198.p1 chr1:12224601-12227300 deletion
SSC 13259.p1 chr1:15687701-15696200 deletion
SSC 14696.s1 chr1:30388501-30398807 deletion
Using the following Awk script we will filter only variants from SSC collection:
gunzip -c Supplementary_Data_4.tsv.gz | cut -f 2,5-7 | awk '
BEGIN{
OFS="\t"
print "location", "variant", "person_id"
}
$1 == "SSC" {
print $3, $4, $2
}' > ssc_cnv.tsv
This script will produce a file named ssc_cnv.tsv
with the following
content:
location variant person_id
chr1:1305145-1314126 duplication 12613.p1
chr1:3975501-3977800 deletion 13424.s1
chr1:6647401-6650500 deletion 12852.p1
chr1:8652301-8657600 deletion 13776.p1
chr1:9992001-9994100 deletion 13373.s1
chr1:12224601-12227300 deletion 14198.p1
chr1:15687701-15696200 deletion 13259.p1
chr1:30388501-30398807 deletion 14696.s1
chr1:40513501-40534200 deletion 14534.p1
chr1:40513501-40534200 deletion 14534.s1
Note
The resulting ssc_cnv.tsv
file is available in the
gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import/input_data
.
Data Import of ssc_cnv
Now we have a pedigree file, ssc_denovo.ped
, and a list of de novo
variants, ssc_cnv.tsv
. Let us prepare an import project configuration
file, ssc_cnv.yaml
:
1id: ssc_cnv
2
3input:
4 pedigree:
5 file: ssc_denovo.ped
6
7 cnv:
8 files:
9 - ssc_cnv.tsv
10
11 location: location
12 variant_type: variant
13 plus_values: duplication
14 minus_values: deletion
15 person_id: person_id
Lines 12-14 define how CNV variant is defined in the input file.
The variant
specifies the type of the variant and values deletion
and duplication
are used to define the CNV variant type.
Note
The project file ssc_cnv.yaml
is available in the the gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import
.
To import the study, from the gpf-getting-started
directory we should run:
time import_genotypes -v -j 1 example_imports/denovo_and_cnv_import/ssc_cnv.yaml
When the import finishes, we can run the development GPF server:
wgpf run
In the Home page of the GPF instance we should have the new
study ssc_cnv
.

Home page with the imported ssc_cnv
study.
If you follow the link to the study, and choose the Genotype Browser tab, you will be able to query the imported CNV variants.

Genotype browser for the SSC CNV variants.