.. _example_denovo_import: Example import of real de Novo variants ####################################### Source of the data ++++++++++++++++++ As an example, let us import de novo variants from the following paper: `Yoon, S., et al. Rates of contributory de novo mutation in high and low-risk autism families. Commun Biol 4, 1026 (2021). `_ We will focus on de novo variants from the SSC collection published in the paper mentioned above. To import these variants into the GPF system, we need a pedigree file describing the families and a list of de novo variants. From the supplementary data for the paper, you can download the following files: * The list of sequenced children available from `Supplementary Data 1 `_ * The list of SNP and INDEL de novo variants is available from `Supplementary Data 2 `_ .. note:: All the data files needed for this example are available in the `gpf-getting-started `_ repository under the subdirectory ``example_imports/denovo_and_cnv_import``. .. _example_denovo_pedigree: Preprocess the Family Data ++++++++++++++++++++++++++ The list of children in ``Supplementary_Data_1.tsv.gz`` contains a lot of data that is not relevant for the import. We are going to use only the first five columns from that file that look as follows: .. code-block:: bash gunzip -c Supplementary_Data_1.tsv.gz | head | cut -f 1-5 | less -S -x 20 .. code-block:: text collection familyId personId affected status sex SSC 11000 11000.p1 affected M SSC 11000 11000.s1 unaffected F SSC 11003 11003.p1 affected M SSC 11003 11003.s1 unaffected F SSC 11004 11004.p1 affected M SSC 11004 11004.s1 unaffected M SSC 11006 11006.p1 affected M SSC 11006 11006.s1 unaffected M SSC 11008 11008.p1 affected M * The first column contains the collection. This study includes data from the SSC and AGRE collections. We are going to import only variants from the SSC collection. * The second column contains the family ID. * The third column contains the person's ID. * The fourth column contains the affected status of the individual. * The fifth column contains the sex of the individual. We need a pedigree file describing the family's structure to import the data into GPF. The `SupplementaryData1_Children.tsv.gz` contains only the children; it does not include information about their parents. Fortunately for the SSC collection, it is not difficult to build the whole families' structures from the information we have. So, before starting the work on the import, we need to preprocess the list of children and transform it into a pedigree file. For the SSC collection, if you have a family with ID``, then the identifiers of the individuals in the family are going to be formed as follows: * mother - ``.mo``; * father - ``.fa``; * proband - ``.p1``; * first sibling - ``.s1``; * second sibling - ``.s2``. Another essential restriction for SSC is that the only affected person in the family is the proband. The affected status of the mother, father, and siblings is unaffected. Having this information, we can use the following `Awk` script to transform the list of children in a pedigree: .. code-block:: bash gunzip -c Supplementary_Data_1.tsv.gz | awk ' BEGIN { OFS="\t" print "familyId", "personId", "dadId", "momId", "status", "sex" } $1 == "SSC" { fid = $2 if( fid in families == 0) { families[fid] = 1 print fid, fid".mo", "0", "0", "unaffected", "F" print fid, fid".fa", "0", "0", "unaffected", "M" } print fid, $3, fid".fa", fid".mo", $4, $5 }' > ssc_denovo.ped If we run this script, it will read ``Supplementary_Data_1.tsv.gz`` and produce the appropriate pedigree file ``ssc_denovo.ped``. .. note:: The resulting pedigree file is also available in the `gpf-getting-started `_ repository under the subdirectory ``example_imports/denovo_and_cnv_import``. Here is a fragment from the resulting pedigree file: .. literalinclude:: gpf-getting-started/example_imports/denovo_and_cnv_import/ssc_denovo.ped :tab-width: 15 :lines: 1-11 Preprocess the SNP and INDEL de Novo variants +++++++++++++++++++++++++++++++++++++++++++++ The `Supplementary_Data_2.tsv.gz` file contains 255232 variants. For the import, we will use columns four and nine from this file: .. code-block:: bash gunzip -c Supplementary_Data_2.tsv.gz | head | cut -f 4,9 | less -S -x 20 .. code-block:: text personIds variant in VCF format 13210.p1 chr1:184268:G:A 12782.s1 chr1:191408:G:A 12972.s1 chr1:271774:AG:A 12420.p1 chr1:484721:AG:A 12518.p1,12518.s1 chr1:691130:T:C 13882.p1 chr1:738645:C:G 14039.s1 chr1:819832:G:T 13872.p1 chr1:824001:AAAAT:A Using the following `Awk` script, we can transform this file into easy to import the list of de Novo variants: .. code-block:: bash gunzip -c Supplementary_Data_2.tsv.gz | cut -f 4,9 | awk ' BEGIN{ OFS="\t" print "chrom", "pos", "ref", "alt", "person_id" } NR > 1 { split($2, v, ":") print v[1], v[2], v[3], v[4], $1 }' > ssc_denovo.tsv This script will produce a file named ``ssc_denovo.tsv`` with the following content: .. literalinclude:: gpf-getting-started/example_imports/denovo_and_cnv_import/ssc_denovo.tsv :tab-width: 15 :lines: 1-11 .. note:: The resulting ``ssc_denovo.tsv`` file is also available in the `gpf-getting-started `_ repository under the subdirectory ``example_imports/denovo_and_cnv_import/input_data``. Caching GRR +++++++++++ Now we are about to import 255K variants. During the import, the GPF system will annotate these variants using the GRR resources from our `public GRR. `_ For small studies with few variants, this approach is quite convenient. However, for larger studies, it is better to cache the GRR resources locally. To do this, we need to configure the GPF to use a local cache. Create a file named ``.grr_definition.yaml`` in your home directory with the following content: .. code-block:: yaml id: "seqpipe" type: "url" url: "https://grr.iossifovlab.com" cache_dir: "" The ``cache_dir`` parameter specifies the directory where the GRR resources will be cached. The cache directory should be specified as an absolute path. For example, ``/tmp/grr_cache`` or ``/Users/lubo/grrCache``. To download all the resources needed for our ``minimal_instance`` annotation, run the following command from the ``gpf-getting-started`` directory: .. code-block:: bash grr_cache_repo -i minimal_instance/gpf_instance.yaml .. note:: The ``grr_cache_repo`` command will download all the resources needed for the GPF instance. This may take a while, depending on your internet connection and the number of resources your configuration requires. The resources will be downloaded to the directory specified in the ``cache_dir`` parameter in the ``.grr_definition.yaml`` file. For the ``gpf-getting-started`` repository, the resources that will be downloaded are: * ``hg38/genomes/GRCh38-hg38`` * ``hg38/gene_models/MANE/1.3`` * ``hg38/variant_frequencies/gnomAD_4.1.0/genomes/ALL`` * ``hg38/scores/ClinVar_20240730`` The total size of the downloaded resources is about 15 GB. Data Import of ``ssc_denovo`` +++++++++++++++++++++++++++++ Now we have a pedigree file, ``ssc_denovo.ped``, and a list of de novo variants, ``ssc_denovo.tsv``. To import this data we need to prepare an import project. The import project is already available in the example imports directory ``example_imports/denovo_and_cnv_import/ssc_denovo.yaml``: .. literalinclude:: gpf-getting-started/example_imports/denovo_and_cnv_import/ssc_denovo.yaml :linenos: When importing genotype data, we often need to instruct the import tool how to split the import process into multiple jobs. For this purpose, we can use ``processing_config`` section of the import project. On lines 11-12 of the ``ssc_denovo.yaml`` file, we have defined the ``processing_config`` section that will split the import de Novo variants into jobs by chromosome. (For more on import project configuration, see :doc:`import_tool`.) .. note:: The project file ``ssc_denovo.yaml`` is available in the the ``gpf-getting-started`` repository under the subdirectory ``example_imports/denovo_and_cnv_import``. To import the study, from the ``gpf-getting-started`` directory we should run: .. code-block:: bash time import_genotypes -v -j 10 example_imports/denovo_and_cnv_import/ssc_denovo.yaml The ``-j 10`` option instructs the ``import_genotypes`` tool to use 10 threads and the ``-v`` option controls the verbosity of the output. This command will take a while to run. The time it takes to run will depend on the number of variants in the input file and the number of threads used for the import. .. note:: For example, on a MacBook Pro with the Apple M1 Pro chip, the import of the SSC de Novo variants took about 5 minutes: .. code-block:: bash real 5m29.950s user 31m52.320s sys 1m41.755s When the import finishes, we can run the development GPF server: .. code-block:: bash wgpf run In the `Home` page of the GPF instance, we should have the new study ``ssc_denovo``. .. figure:: getting_started_files/ssc_denovo_home_page.png The home page has the imported SSC de Novo study. If you follow the link to the study and choose the `Genotype Browser` tab, you will be able to query the imported variants. .. figure:: getting_started_files/ssc_denovo_genotype_browser.png Genotype browser for the SSC de novo variants. Configure preview and download columns ++++++++++++++++++++++++++++++++++++++ While importing the SSC de novo variants, we used the annotation defined in the minimal instance configuration file. So, all imported variants are annotated with GnomAD and ClinVar genomic scores. We can use these score values to define additional columns in the preview table and the download file similar to the :ref:`getting_started_with_preview_columns`. Edit the ``ssc_denovo`` configuration file located at ``minimal_instance/studies/ssc_denovo/ssc_denovo.yaml`` and add the following snippet to the configuration file: .. code-block:: yaml :linenos: genotype_browser: column_groups: frequency: name: frequency columns: - allele_freq - gnomad_v4_genome_ALL_af clinvar: name: ClinVar columns: - CLNSIG - CLNDN preview_columns_ext: - clinvar Now, restart the GPF development server: .. code:: bash wgpf run Go to the `Genotype Browser` tab of the ``ssc_denovo`` study and click `Preview Table` button. The preview table should now contain the additional columns for `GnomAD` and `ClinVar` genomic scores. .. figure:: getting_started_files/ssc_denovo_genotype_browser_with_annotated_columns.png Genotype browser with additional columns for GnomAD and ClinVar genomic scores.