Working With VCF Files Guide
Import data from the “1000 Genome Project”
The data used in this guide is from the “1000 Genome Project”. For more information, visit IGSR.
Begin by making a new directory, in which you can download and create files:
mkdir 1KGP
Navigate to it:
cd 1KGP
Download the data:
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/related_samples_vcf/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20130606_sample_info/20130606_sample_info.xlsx
The three downloaded files are:
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
- contains the variants data of the individualsALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz
- contains the variants data of the related individuals20130606_sample_info.xlsx
- contains information about all the examined individuals
Creating the pedigree file
Information about the individual’s relationships within their family can be found
in the spreadsheet file 20130606_sample_info.xlsx
, in the ‘Sample Info’ tab.
Let’s create a pedigree file for the family with id “PR05”. For more information
on working with pedigree files, refer to the
Working With Pedigrees Guide.
First, create the pedigree file:
touch PR05.ped
Then open it in a text editor and add the necessary columns - familyId, personId, momId, dadId, sex, status and role. Fill in the individuals and the values in each column, by referring to the spreadshee’s information. After the editing, the pedigree file should look like this:
familyId personId momId dadId sex status role
PR05 HG00731 0 0 M unspecified dad
PR05 HG00732 0 0 F unspecified mom
PR05 HG00733 HG00732 HG00731 M unspecified prb
Warning
The columns should be separated by tabs, not spaces.
Next, you have to standardize the pedigree file, using the ped2ped.py
tool:
ped2ped.py PR05.ped -o PR05_standardized.ped --ped-layout-mode generate
This command will generate a new pedigree file - PR05_standardized.ped
with
two newly added columns - sampleId and layout, which will be used
by the GPF system. Now the pedigree file is ready for importing.
Creating the VCF files
To extract the variant data for the individuals HG00731, HG00732 and HG00733 in a separate files, we will use Bcftools.
Let’s start with individual HG00733, whose data is in the
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz
file.
Using bcftools’ view --samples
argument, we can get the data for a specific individual.
Adding > HG00733.vcf in the end of the command will redirect the command’s result
into a new file, named HG00733.vcf
:
bcftools view \
--samples HG00733 \
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5_related_samples.20130502.genotypes.vcf.gz \
> HG00733.vcf
The data for individuals HG00731 and HG00732 is in the second vcf file -
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
.
To extract the variants data for the other two individuals in
a file named HG00731_HG00732.vcf
, run:
bcftools view \
--samples HG00731,HG00732 \
ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
> HG00731_HG00732.vcf
Importing the data into GPF
To import the collected data into the GPF system, it’s recommended to use the
impala_batch_import.py
tool. To do so, run:
impala_batch_import.py PR05.ped \
--vcf-files HG00731_HG00732.vcf HG00733.vcf \
--gs genotype_impala \
--id 1KGP \
-o parquet
Note
To see a list of it’s commands, use:
impala_batch_import.py --help
Navigate to the newly created parquet directory:
cd parquet
and run this command to initiate the importing:
make -j 10
This command will take some time to complete.
Afer it’s done, run the GPF web server:
wdaemanage.py runserver 0.0.0.0:8000
Now you should be able to see the “1KGP” dataset. To view the imported variants, navigate to the genotype_browser_ui tab and click on the Table Preview button.