Example import of real de Novo variants
Source of the data
As an example, let us import de novo variants from the following paper: Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021).
We will focus on de novo variants from the SSC collection published in the paper mentioned above.
To import these variants into the GPF system, we need a pedigree file describing the families and a list of de novo variants.
From the supplementary data for the paper, you can download the following files:
The list of sequenced children available from Supplementary Data 1
The list of SNP and INDEL de novo variants is available from Supplementary Data 2
Note
All the data files needed for this example are available in the
gpf-getting-started
repository under the subdirectory example_imports/denovo_and_cnv_import.
Preprocess the Family Data
The list of children in Supplementary_Data_1.tsv.gz contains a lot of data
that is not relevant for the import.
We are going to use only the first five
columns from that file that look as follows:
gunzip -c Supplementary_Data_1.tsv.gz | head | cut -f 1-5 | less -S -x 20
collection familyId personId affected status sex
SSC 11000 11000.p1 affected M
SSC 11000 11000.s1 unaffected F
SSC 11003 11003.p1 affected M
SSC 11003 11003.s1 unaffected F
SSC 11004 11004.p1 affected M
SSC 11004 11004.s1 unaffected M
SSC 11006 11006.p1 affected M
SSC 11006 11006.s1 unaffected M
SSC 11008 11008.p1 affected M
The first column contains the collection. This study includes data from the SSC and AGRE collections. We are going to import only variants from the SSC collection.
The second column contains the family ID.
The third column contains the person’s ID.
The fourth column contains the affected status of the individual.
The fifth column contains the sex of the individual.
We need a pedigree file describing the family’s structure to import the data into GPF. The SupplementaryData1_Children.tsv.gz contains only the children; it does not include information about their parents. Fortunately for the SSC collection, it is not difficult to build the whole families’ structures from the information we have.
So, before starting the work on the import, we need to preprocess the list of children and transform it into a pedigree file.
For the SSC collection, if you have a family with ID`<fam_id>`, then the identifiers of the individuals in the family are going to be formed as follows:
mother -
<fam_id>.mo;father -
<fam_id>.fa;proband -
<fam_id>.p1;first sibling -
<fam_id>.s1;second sibling -
<fam_id>.s2.
Another essential restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father, and siblings is unaffected.
Having this information, we can use the following Awk script to transform the list of children in a pedigree:
gunzip -c Supplementary_Data_1.tsv.gz | awk '
BEGIN {
OFS="\t"
print "familyId", "personId", "dadId", "momId", "status", "sex"
}
$1 == "SSC" {
fid = $2
if( fid in families == 0) {
families[fid] = 1
print fid, fid".mo", "0", "0", "unaffected", "F"
print fid, fid".fa", "0", "0", "unaffected", "M"
}
print fid, $3, fid".fa", fid".mo", $4, $5
}' > ssc_denovo.ped
If we run this script, it will read Supplementary_Data_1.tsv.gz and produce
the appropriate pedigree file ssc_denovo.ped.
Note
The resulting pedigree file is also available in the
gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import.
Here is a fragment from the resulting pedigree file:
Preprocess the SNP and INDEL de Novo variants
The Supplementary_Data_2.tsv.gz file contains 255232 variants. For the import, we will use columns four and nine from this file:
gunzip -c Supplementary_Data_2.tsv.gz | head | cut -f 4,9 | less -S -x 20
personIds variant in VCF format
13210.p1 chr1:184268:G:A
12782.s1 chr1:191408:G:A
12972.s1 chr1:271774:AG:A
12420.p1 chr1:484721:AG:A
12518.p1,12518.s1 chr1:691130:T:C
13882.p1 chr1:738645:C:G
14039.s1 chr1:819832:G:T
13872.p1 chr1:824001:AAAAT:A
Using the following Awk script, we can transform this file into easy to import the list of de Novo variants:
gunzip -c Supplementary_Data_2.tsv.gz | cut -f 4,9 | awk '
BEGIN{
OFS="\t"
print "chrom", "pos", "ref", "alt", "person_id"
}
NR > 1 {
split($2, v, ":")
print v[1], v[2], v[3], v[4], $1
}' > ssc_denovo.tsv
This script will produce a file named ssc_denovo.tsv with the following
content:
Note
The resulting ssc_denovo.tsv file is also available in the
gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import/input_data.
Caching GRR
Now we are about to import 255K variants. During the import, the GPF system will annotate these variants using the GRR resources from our public GRR. For small studies with few variants, this approach is quite convenient. However, for larger studies, it is better to cache the GRR resources locally.
To do this, we need to configure the GPF to use a local cache. Create a file
named .grr_definition.yaml in your home directory with the following
content:
id: "seqpipe"
type: "url"
url: "https://grr.iossifovlab.com"
cache_dir: "<path_to_your_cache_dir>"
The cache_dir parameter specifies the directory where the GRR resources
will be cached. The cache directory should be specified as an absolute path.
For example, /tmp/grr_cache or /Users/lubo/grrCache.
To download all the resources needed for our minimal_instance annotation,
run the following command from the gpf-getting-started directory:
grr_cache_repo -i minimal_instance/gpf_instance.yaml
Note
The grr_cache_repo command will download all the resources needed for
the GPF instance. This may take a while, depending on your internet
connection and the number of resources your configuration requires.
The resources will be downloaded to the directory specified in the
cache_dir parameter in the .grr_definition.yaml file.
For the gpf-getting-started repository, the resources that will be
downloaded are:
hg38/genomes/GRCh38-hg38hg38/gene_models/MANE/1.3hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALLhg38/scores/ClinVar_20240730
The total size of the downloaded resources is about 15 GB.
Data Import of ssc_denovo
Now we have a pedigree file, ssc_denovo.ped, and a list of de novo
variants, ssc_denovo.tsv. To import this data we need to prepare an import
project. The import project is already available in the example imports
directory example_imports/denovo_and_cnv_import/ssc_denovo.yaml:
When importing genotype data, we often need to instruct the import tool how to
split the import process into multiple jobs. For this purpose, we can use
processing_config section of the import project. On lines 11-12 of the
ssc_denovo.yaml file, we have defined the processing_config section
that will split the import de Novo variants into jobs by chromosome. (For more
on import project configuration, see import_tool.)
Note
The project file ssc_denovo.yaml is available in the the gpf-getting-started
repository under the subdirectory
example_imports/denovo_and_cnv_import.
To import the study, from the gpf-getting-started directory we should run:
time import_genotypes -v -j 10 example_imports/denovo_and_cnv_import/ssc_denovo.yaml
The -j 10 option instructs the import_genotypes tool to use 10 threads
and the -v option controls the verbosity of the output.
This command will take a while to run. The time it takes to run will depend on the number of variants in the input file and the number of threads used for the import.
Note
For example, on a MacBook Pro with the Apple M1 Pro chip, the import of the SSC de Novo variants took about 5 minutes:
real 5m29.950s
user 31m52.320s
sys 1m41.755s
When the import finishes, we can run the development GPF server:
wgpf run
In the Home page of the GPF instance, we should have the new study
ssc_denovo.
The home page has the imported SSC de Novo study.
If you follow the link to the study and choose the Genotype Browser tab, you will be able to query the imported variants.
Genotype browser for the SSC de novo variants.
Configure preview and download columns
While importing the SSC de novo variants, we used the annotation defined in the minimal instance configuration file. So, all imported variants are annotated with GnomAD and ClinVar genomic scores.
We can use these score values to define additional columns in the preview table and the download file similar to the Getting Started with Preview Columns.
Edit the ssc_denovo configuration file located at
minimal_instance/studies/ssc_denovo/ssc_denovo.yaml and add the following
snippet to the configuration file:
1genotype_browser:
2 column_groups:
3 frequency:
4 name: frequency
5 columns:
6 - allele_freq
7 - gnomad_v4_genome_ALL_af
8
9 clinvar:
10 name: ClinVar
11 columns:
12 - CLNSIG
13 - CLNDN
14
15 preview_columns_ext:
16 - clinvar
Now, restart the GPF development server:
wgpf run
Go to the Genotype Browser tab of the ssc_denovo study and click
Preview Table button. The preview table should now contain the additional
columns for GnomAD and ClinVar genomic scores.
Genotype browser with additional columns for GnomAD and ClinVar genomic scores.