dae.parquet package

Subpackages

Submodules

dae.parquet.helpers module

dae.parquet.helpers.merge_parquets(in_files: list[str], out_file: str, *, delete_in_files: bool = True, row_group_size: int = 50000, parquet_version: str | None = None) → None[source]: Merge in_files into one large file called out_file.

dae.parquet.helpers.url_to_pyarrow_fs(filename: str, filesystem: AbstractFileSystem | None = None) → FileSystem[source]

Turn URL into pyarrow filesystem instance.

Parameters

filenamestr: The fsspec-compatible URL
filesystemfsspec.FileSystem: An fsspec filesystem for filename.

Returns

filesystempyarrow.fs.FileSystem: The new filesystem discovered from filename and filesystem.

dae.parquet.parquet_writer module

dae.parquet.parquet_writer.add_missing_parquet_fields(pps: schema, ped_df: DataFrame) → tuple[DataFrame, schema][source]: Add missing parquet fields.

dae.parquet.parquet_writer.append_meta_to_parquet(meta_filename: str, key: list[str], value: list[str]) → None[source]: Append key-value pair to meta data parquet file.

dae.parquet.parquet_writer.collect_pedigree_parquet_schema(ped_df: DataFrame) → Schema[source]: Build the pedigree parquet schema.

dae.parquet.parquet_writer.fill_family_bins(families: FamiliesData, partition_descriptor: PartitionDescriptor | None = None) → None[source]: Save families data into a parquet file.

dae.parquet.parquet_writer.merge_variants_parquets(partition_descriptor: PartitionDescriptor, variants_dir: str, partitions: list[tuple[str, str]], row_group_size: int = 25000, parquet_version: str | None = None) → None[source]: Merge parquet files in variants_dir.

dae.parquet.parquet_writer.pedigree_parquet_schema() → schema[source]: Return the schema for pedigree parquet file.

dae.parquet.parquet_writer.save_ped_df_to_parquet(ped_df: DataFrame, filename: str, filesystem: AbstractFileSystem | None = None, parquet_version: str | None = None) → None[source]: Save ped_df as a parquet file named filename.

dae.parquet.parquet_writer.serialize_summary_schema(annotation_attributes: list[AttributeInfo], partition_descriptor: PartitionDescriptor) → str[source]: Serialize the summary schema.

dae.parquet.parquet_writer.serialize_variants_data_schema(annotation_attributes: list[AttributeInfo]) → str[source]: Serialize the variants data schema.

dae.parquet.partition_descriptor module

class dae.parquet.partition_descriptor.PartitionDescriptor(chromosomes: list[str] | None = None, region_length: int = 0, family_bin_size: int = 0, coding_effect_types: list[str] | None = None, rare_boundary: float = 0)[source]

Bases: object

Class to represent partition of a genotype dataset.

dataset_family_partition() → list[tuple[str, str]][source]

Build family dataset partition for table creation.

When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.

dataset_summary_partition() → list[tuple[str, str]][source]

Build summary parquet dataset partition for table creation.

When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.

family_partition(allele: FamilyAllele, *, seen_as_denovo: bool) → list[tuple[str, str]][source]

Produce family partition for an allele.

The partition is returned as a list of tuples consiting of the name of the partition and the value.

Example: [

(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)

]

get_variant_partitions(chrom_lens: dict[str, int]) → tuple[list[list[tuple[str, str]]], list[list[tuple[str, str]]]][source]: Return the output summary and family variant partition names.

has_coding_bins() → bool[source]

has_family_bins() → bool[source]

has_family_partitions() → bool[source]: Check if partition applicable to family allele are defined.

has_frequency_bins() → bool[source]

has_partitions() → bool[source]: Equivalent to has_family_partitions method.

has_region_bins() → bool[source]

has_summary_partitions() → bool[source]: Check if partition applicable to summary allele are defined.

make_coding_bin(effect_types: Iterable[str]) → int[source]: Produce coding bin for given list of effect types.

make_family_bin(family_id: str) → int[source]: Produce family bin for given family ID.

make_frequency_bin(allele_count: int, allele_freq: float, *, is_denovo: bool = False) → str[source]

Produce frequency bin from allele count, frequency and de Novo flag.

Params are allele count, allele frequence and de Novo flag.

make_region_bin(chrom: str, pos: int) → str[source]: Produce region bin for given chromosome and position.

static parse(path_name: Path | str) → PartitionDescriptor[source]

Parse partition description from a file.

When the file name has a .conf suffix or is without suffix the format of the file is assumed to be python config file and it is parsed using the Python ConfigParser class.

When the file name has .yaml suffix the file is parsed using the YAML parser.

static parse_dict(config_dict: dict[str, Any]) → PartitionDescriptor[source]: Parse configuration dictionary and create a partion descriptor.

static parse_string(content: str, content_format: str = 'conf') → PartitionDescriptor[source]

Parse partition description from a string.

The supported formats are the Python config format and YAML. Example string content should be as follows.

Example Python config format: ` [region_bin] chromosomes = chr1, chr2 region_length = 10 [frequency_bin] rare_boundary = 5.0 [coding_bin] coding_effect_types = frame-shift,splice-site,nonsense,missense [family_bin] family_bin_size=10 `

Example YAML format: ``` region_bin:

chromosomes: chr1, chr2 region_length: 10

frequency_bin:: rare_boundary: 5.0
coding_bin:: coding_effect_types: frame-shift,splice-site,nonsense,missense
family_bin:: family_bin_size: 10

```

static partition_directory(output_dir: str, partition: list[tuple[str, str]]) → str[source]

Construct a partition dataset directory.

Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the directory name corresponding to the partition.

static partition_filename(prefix: str, partition: list[tuple[str, str]], bucket_index: int | None) → str[source]

Construct a partition dataset base filename.

Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the file name corresponding to the partition.

static path_to_partitions(raw_path: str) → list[tuple[str, str]][source]: Convert a path into the partitions it is composed of.

region_to_bins(region: Region, chrom_lens: dict[str, int]) → list[tuple[str, str]][source]: Provide a list of bins the given region intersects.

schema1_partition(allele: FamilyAllele) → list[tuple[str, str]][source]

Produce Schema1 family partition for an allele.

The partition is returned as a list of tuples consiting of the name of the partition and the value.

Example: [

(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)

]

serialize(output_format: str = 'conf') → str[source]: Serialize a partition descriptor into a string.

summary_partition(allele: SummaryAllele, *, seen_as_denovo: bool) → list[tuple[str, str]][source]

Produce summary partition for an allele.

The partition is returned as a list of tuples consiting of the name of the partition and the value.

Example: [

(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”),

]

to_dict() → dict[str, Any][source]: Convert the partition descriptor to a dict.

dae.parquet package

Subpackages

Submodules

dae.parquet.helpers module

Parameters

Returns

dae.parquet.parquet_writer module

dae.parquet.partition_descriptor module

Module contents