Example import of de novo variants from Rates of contributory de novo mutation in high and low-risk autism families
Let us import de novo variants from the Yoon, S., Munoz, A., Yamrom, B. et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021)..
We will focus on de novo variants from the SSC collection published in the aforementioned paper. To import these variants into the GPF system we need a list of de novo variants and a pedigree file describing the families. The list of de novo variants is available from Supplementary Data 2. The pedigree file for this study is not available. Instead, we have a list of children available from Supplementary Data 1.
Let us first export these Excel spreadsheets into CSV files. Let us say that the
list of de novo variants from the SSC collection is saved into a file named
SupplementaryData2_SSC.tsv
and the list of children is saved into a TSV file
named SupplementaryData1_Children.tsv
.
Note
Input files for this example can be downloaded from
denovo-in-high-and-low-risk-papter.tar.gz
.
Preprocess the families data
To import the data into GPF we need a pedigree file describing the structure
of the families. The SupplementaryData1_Children.tsv
contains only the list
of children. There is no information about their parents. Fortunately for the
SSC collection it is not difficult to build the full families’ structures from
the information we have. For the SSC collection if you have a family with ID
<fam_id>
, then the identifiers of the individuals in the family are going to
be formed as follows:
mother -
<fam_id>.mo
;father -
<fam_id>.fa
;proband -
<fam_id>.p1
;first sibling -
<fam_id>.s1
;second sibling -
<fam_id>.s2
.
Another important restriction for SSC is that the only affected person in the
family is the proband. The affected status of the mother, father and
siblings are unaffected
.
Using all these conventions we can write a simple python script
build_ssc_pedigree.py
to convert
SupplementaryData1_Children.tsv
into a pedigree file ssc_denovo.ped
:
"""Converts SupplementaryData1_Children.tsv into a pedigree file."""
import pandas as pd
children = pd.read_csv("SupplementaryData1_Children.tsv", sep="\t")
ssc = children[children.collection == "SSC"]
# list of all individuals in SSC
persons = []
# each person is represented by a tuple:
# (familyId, personId, dadId, momId, status, sex)
for fam_id, members in ssc.groupby("familyId"):
persons.append((fam_id, f"{fam_id}.mo", "0", "0", "unaffected", "F"))
persons.append((fam_id, f"{fam_id}.fa", "0", "0", "unaffected", "F"))
for child in members.to_dict(orient="records"):
persons.append((
fam_id, child["personId"], f"{fam_id}.fa", f"{fam_id}.mo",
child["affected status"], child["sex"]))
with open("ssc_denovo.ped", "wt", encoding="utf8") as output:
output.write(
"\t".join(("familyId", "personId", "dadId", "momId", "status", "sex")))
output.write("\n")
for person in persons:
output.write("\t".join(person))
output.write("\n")
If we run this script it will read SupplementaryData1_Children.tsv
and
produce the appropriate pedigree file ssc_denovo.ped
.
Preprocess the variants data
The SupplementaryData2_SSC.tsv
file contains 255231 variants. To import so
many variants in in-memory genotype storage is not appropriate. For this
example we are going to use a subset of 10000 variants:
head -n 10001 SupplementaryData2_SSC.tsv > ssc_denovo.tsv
Data import of ssc_denovo
Now we have a pedigree file ssc_denovo.ped
and a list of de novo
variants ssc_denovo.tsv
. Let us prepare an import project configuration
file ssc_denovo.yaml
:
id: ssc_denovo
input:
pedigree:
file: ssc_denovo.ped
denovo:
files:
- ssc_denovo.tsv
person_id: personIds
variant: variant
location: location
To import the study we should run:
import_tools ssc_denovo.yaml
and when the import finishes we can run the development GPF server:
wgpf run
In the list of studies, we should have a new study ssc_denovo
.