Working With Pedigree Files Guide

Brief outline

The GPF system has flexible support for various input formats of pedigree files. When loaded the pedigree is kept in an internal representation. The ped2ped.py tool allows loading the input pedigree file and storing it into the canonical GPF pedigree representation.

Additionally, GPF has a tool draw_pedigree.py that could generate a PDF file with drawings of all pedigrees loaded from the input.

Input pedigree file

A pedigree file (usually found with a .ped extension) is a text file with delimiter-separated values (comma, tab, etc.). Each row in this file describes an individual.

The file should contain the following columns:

Family ID:: Contains family ID for the individual.
Person ID:: Contains person ID for the individual.
Father ID:: Contains the person ID of the individual’s father. If the individual is without a specified father this column contains 0 or is left empty.
Mother ID:: Contains the person ID of the individual’s mother. If the individual is without a specified mother this column contains 0 or is left empty.
Sex:: The sex of the individual.
Status:: The affected status of the individual - whether they are affected or not.

The pedigree file can contain more columns, that specify different attributes for individuals.

Canonical pedigree file

The canonical pedigree file matches closely the internal representation of the pedigree data. Each line in the pedigree file specifies the attributes of an individual.

The columns in a canonical pedigree file are:

familyId:

Contains IDs for families included in the pedigree file.

personId:

Contains IDs for individuals included in the pedigree file. The assumption is that person IDs are unique across the whole pedigree file.

momId:

Contains the ID of the individual’s mother. If a mother is not specified this column should contain 0 or should left be empty.

dadId:

Contains the ID of the individual’s father. If a father is not specified this column should contain 0 or should be left empty.

sex:

The sex of the individual. Supported values for the sex column are described in Supported values for sex

status:

The affected status of the individual. Supported values for the status column are described in Supported values for status

role:

The role of the individual. Supported values for the role column are described in Supported values for role

sampleId:

This column is used to map an individual to the sample ID used in the genotypes file(s) (e.g a VCF file(s)). If this column is not specified in the input pedigree, it will be created and values will coincide with the personId column.

layout:

This column is optional. The suported format is following: <rank>:<x>,<y>, where

<rank> is the rank of the individual in the family, where the individuals from the earliest generation in the family have a rank of 1, the individuals from the next generation have a rank of 2, et. For example in a nuclear family, the mother and father have a rank of 1 and the children have a rank of 2.
<x> is the x-coordinate of the individual icon.
<y> is the y-coordinate of the individual icon.

generated:

This column specifies if a given individual is generated. The supported values in this column are True and False.

When the pedigree file contains not full families, the GPF tools add individuals to the family to make the family full.

For example, if a family contains two individuals - mother and proband, - the GPF adds father to this family to make proper visualization of the family.

These additional individuals are marked as generated and are not used in any downstream analysis. Their use is purely for visualization purposes.

When the input pedigree contains any additional columns the GPF tools keep these columns in the canonical representation.

Possible input pedigree file structures

When the input pedigree file is given to the GPF system it tries to transform it into the canonical representation described in Canonical pedigree file.

GPF system uses individuals’ roles for various queries. When the role column is not present in the input pedigree file, the GPF system tries to deduce the role of each individual in respect to the family’s proband.

The GPF system has different strategies to infer the role of each individual. Which strategy to use depends on the input data.

Plain pedigree (familyId, personId, momId, dadId, sex, status)

Often, the pedigree does not contain a role column. In this case the GPF system uses the following approach:

Assign a role proband to the first affected child in each family.
The roles of all other members in the family are inferred with respect to the proband.

Note

If no proband is found, all the roles will be set to unknown.

Example: simple pedigree file

Let’s say we have the following input pedigree file:

familyId	personId	momId	dadId	sex	status
f1	f1.01	0	0	F	unaffected
f1	f1.02	0	0	M	unaffected
f1	f1.03	f1.01	f1.02	F	affected
f1	f1.04	f1.01	f1.02	M	affected
f1	f1.05	0	0	M	unaffected
f1	f1.06	f1.01	f1.05	F	unaffected

To assign roles to the members of family f1 the GPF system will look for the first affected child in the f1 family - this will be f1.03 and this individual will get a role proband. The mother and father of f1.03 will become with roles mom and dad and hence f1.01 is going to have the role mom and f1.02 - role dad. The sibling of f1.03 will have the role sib and hence f1.04 is going to have the role sib.

This process continues until all individuals in the family have their roles set.

familyId	personId	momId	dadId	sex	status	role
f1	f1.01	0	0	F	unaffected	mom
f1	f1.02	0	0	M	unaffected	dad
f1	f1.03	f1.01	f1.02	F	affected	prb
f1	f1.04	f1.01	f1.02	M	affected	sib
f1	f1.05	0	0	M	unaffected	step_dad
f1	f1.06	f1.01	f1.05	F	unaffected	maternal_half_sibling

Pedigree with proband column (familyId, personId, momId, dadId, sex, status, prb)

When the strategy described in Plain pedigree (familyId, personId, momId, dadId, sex, status) is not appropriate the GPF can use a pedigree file with a proband column, that specifies which individual in the family has the role proband.

The first individual in the family for whom the proband column has value True recivies the role proband.
The roles of all other individuals are inferred with respect to the proband.

Note

If no proband is indicated, the tools fallback into the strategy described in Plain pedigree (familyId, personId, momId, dadId, sex, status)

Note

If more than one proband is selected, the role prb is assigned to the first of them and the rest of the roles are inferred with respect to the first (in the pedigree file) proband.

Example: pedigree file with prb column

Let’s say we have the following input pedigree file:

familyId	personId	momId	dadId	sex	status	prb
f1	f1.01	0	0	F	unaffected	0
f1	f1.02	0	0	M	unaffected	0
f1	f1.03	f1.01	f1.02	F	affected	0
f1	f1.04	f1.01	f1.02	M	affected	1
f1	f1.05	0	0	M	unaffected	0
f1	f1.06	f1.01	f1.05	F	unaffected	0

Note the prb column that specifies which individual has the role proband. So the f1.04 recivies role prb. The mother and father of f1.04 will have roles mom and dad and hence f1.01 is going to have the role mom and f1.02 - role dad. The sibling of f1.04 will have the role sib and hence f1.03 is going to have the role sib.

This process continues until all individuals in the family have their roles set.

familyId	personId	momId	dadId	sex	status	role
f1	f1.01	0	0	F	unaffected	mom
f1	f1.02	0	0	M	unaffected	dad
f1	f1.03	f1.01	f1.02	F	affected	sib
f1	f1.04	f1.01	f1.02	M	affected	prb
f1	f1.05	0	0	M	unaffected	step_dad
f1	f1.06	f1.01	f1.05	F	unaffected	maternal_half_sibling

Pedigree with role column (familyId, personId, momId, dadId, sex, status, role)

When a role column is defined in the input pedigree it becomes the source of truth about individuals’ roles. Whatever is saved in this column is interpreted as the role of the individual.

Example: pedigree with role column

familyId	personId	momId	dadId	sex	status	role
f1	f1.01	0	0	F	unaffected	mom
f1	f1.02	0	0	M	unaffected	dad
f1	f1.03	f1.01	f1.02	F	affected	prb
f1	f1.04	f1.01	f1.02	M	affected	sib
f1	f1.05	0	0	M	unaffected	step_dad
f1	f1.06	f1.01	f1.05	F	unaffected	maternal_half_sibling

Full canonical pedigree

The canonical pedigree file contains the role column and so, the GPF system uses this column to assign the role of each individual.

Todo

The loader will be upset (ERROR) if the role is not one of the recognized, names or synonyms.

The loader will output a WARNING if no proband is assigned for a family (can be suppressed with an argument???) OR consider it an ERROR condition that can be suppressed with an argument.

The loader will output a WARNING if more than one proband is assigned for a family?? (can be suppressed with an argument???)

Preparing the pedigree data

The pedigree data may require preparation beforehand. This section describes the requirements for pedigree data that must be met to use the tools.

In some cases, the initial pedigree file must be expanded with additional individuals to correctly form some families. Following that, individuals must be connected to their parents from the newly added individuals.

We must ensure the values in the sex, status and role columns in the file are supported by the GPF system. You can see a list of the supported values here - Supported values for sex, Supported values for status, Supported values for role.

Also, these properties support synonyms, which are listed in the tables below:

Supported values for sex

Sex column canonical values	Synonyms (case insensitive)
F	female, F, 2
M	male, M, 1
U	unspecified, U, 0

Supported values for status

Sex column canonical values	Synonyms (case insensitive)
affected	affected, 2
unaffected	unaffected, 1
unspecified	unspecified, -, 0

Supported values for role

Role column canonical values	Synonyms (case insensitive)
prb	proband, prb
sib	sibling, younger sibling, older sibling, sib
maternal_grandmother	maternal grandmother, maternal_grandmother
maternal_grandfather	maternal grandfather, maternal_grandfather
paternal_grandmother	paternal grandmother, paternal_grandmother
paternal_grandfather	paternal grandfather, paternal_grandfather
mom	mom, mother
dad	dad, father
child	child
maternal_half_sibling	maternal half sibling, maternal_half_sibling
paternal_half_sibling	paternal half sibling, paternal_half_sibling
half_sibling	half sibling, half_sibling
maternal_aunt	maternal aunt, maternal_aunt
maternal_uncle	maternal uncle, maternal_uncle
paternal_aunt	paternal aunt, paternal_aunt
paternal_uncle	paternal uncle, paternal_uncle
maternal_cousin	maternal cousin, maternal_cousin
paternal_cousin	paternal cousin, paternal_cousin
step_mom	step mom, step_mom, step mother
step_dad	step dad, step_dad, step father
spouse	spouse
unknown	unknown

Common arguments for the pedigree tools

positional arguments:

<families filename> families filename in pedigree or simple family format

optional arguments:

--ped-family PED_FAMILY: specify the name of the column in the pedigree file that holds the ID of the family the person belongs to [default: familyId]
--ped-person PED_PERSON: specify the name of the column in the pedigree file that holds the person’s ID [default: personId]
--ped-mom PED_MOM: specify the name of the column in the pedigree file that holds the ID of the person’s mother [default: momId]
--ped-dad PED_DAD: specify the name of the column in the pedigree file that holds the ID of the person’s father [default: dadId]
--ped-sex PED_SEX: specify the name of the column in the pedigree file that holds the sex of the person [default: sex]
--ped-status PED_STATUS: specify the name of the column in the pedigree file that holds the status of the person [default: status]
--ped-role PED_ROLE: specify the name of the column in the pedigree file that holds the role of the person [default: role]
--ped-no-role: indicates that the provided pedigree file has no role column. If this argument is provided, the import tool will guess the roles of individuals and write them in a “role” column.
--ped-proband PED_PROBAND: specify the name of the column in the pedigree file that specifies persons with role proband; this column is used only when option –ped-no-role is specified. [default: None]
--ped-no-header: indicates that the provided pedigree file has no header. The pedigree column arguments will accept indices if this argument is given. [default: False]
--ped-file-format PED_FILE_FORMAT: Families file format. It should pedigree or simple for simple family format [default: pedigree]
--ped-layout-mode PED_LAYOUT_MODE: Layout mode specifies how pedigrees drawing of each family is handled. Available options are generate and load. When the layout mode option is set to generate` the loader tries to generate a layout for each family pedigree. When load is specified, the loader tries to load the layout from the layout column of the pedigree. [default: load]
--ped-sep PED_SEP: Families file field separator [default: t]
-o OUTPUT_FILENAME: specify the name of the output file

Transform a pedigree file into canonical GPF form

To transform a pedigree file into canonical GPF form you can use the ped2ped.py tool. To see the tool’s full functionality use:

ped2ped.py --help

To demonstrate how it works, we will use the sample data. To standardize the example_families.ped file use:

ped2ped.py example_families.ped \
--ped-layout-mode generate -o example_family_standardized.ped

The output example_family_standardized.ped file has two newly generated columns - sampleId and layout, which are used by the GPF system.

The ped2ped.py tool can also process pedigree files with noncanonical column names. For such cases, it has arguments that can be used to specify which column contains the family id, role, status, sex, etc. For example, see the case of the example_families_with_noncanonical_column_names.ped file:

ped2ped.py example_families_with_noncanonical_column_names.ped \
--ped-family Family_id --ped-person Person_id --ped-dad Dad_id --ped-mom Mom_id \
--ped-sex Sex --ped-status Status --ped-role Role \
--ped-layout-mode generate -o example_families_from_noncanonical_column_names.ped

The ped2ped.py tool can also process pedigree files without headers. One such file is example_families_without_header.ped. In this case, we have to map the column’s index to a specific column name. The same way we mapped ‘Family_id’ to the family id column in the upper example, here we map the first column to family id (Keep in mind the column indices begin from 0). See the example below:

ped2ped.py example_families_without_header.ped \
--ped-no-header --ped-family 0 --ped-person 1 --ped-dad 2 --ped-mom 3 \
--ped-sex 4 --ped-status 5 --ped-role 6 \
--ped-layout-mode generate -o example_families_from_no_header.ped

Visualize a pedigree file into a PDF file

To visualize a pedigree file into a PDF file, containing drawings of the family pedigrees you can use the draw_pedigrees.py tool. To see its full functionality use:

draw_pedigree.py --help

Notice that it shares a lot of common flags with the ped2ped.py tool. Similar to the ped2ped.py tool, it can also process pedigree files with noncanonically named columns or without a header.

In addition to that, it has a --mode flag, which supports two values:

report
the tool will generate a family pedigree drawing for each unique family structure family
families
the tool will generate a family pedigree drawing for every individual family

To demonstrate how to use the draw_pedigree.py tool we will visualize the example_families.ped file:

draw_pedigree.py example_families.ped -o example_families_visualization.pdf

This command outputs the example_families_visualization.pdf file with the pedigree drawings.