dae.parquet package
Subpackages
- dae.parquet.schema2 package
- Subpackages
- Submodules
- dae.parquet.schema2.parquet_io module
ContinuousParquetFileWriter
ContinuousParquetFileWriter.BATCH_ROWS
ContinuousParquetFileWriter.DEFAULT_COMPRESSION
ContinuousParquetFileWriter.append_family_allele()
ContinuousParquetFileWriter.append_summary_allele()
ContinuousParquetFileWriter.build_batch()
ContinuousParquetFileWriter.build_table()
ContinuousParquetFileWriter.close()
ContinuousParquetFileWriter.data_reset()
ContinuousParquetFileWriter.size()
VariantsParquetWriter
- dae.parquet.schema2.serializers module
AlleleParquetSerializer
AlleleParquetSerializer.ENUM_PROPERTIES
AlleleParquetSerializer.FAMILY_ALLELE_BASE_SCHEMA
AlleleParquetSerializer.SUMMARY_ALLELE_BASE_SCHEMA
AlleleParquetSerializer.build_family_allele_batch_dict()
AlleleParquetSerializer.build_family_schema()
AlleleParquetSerializer.build_summary_allele_batch_dict()
AlleleParquetSerializer.build_summary_schema()
AlleleParquetSerializer.schema_family
AlleleParquetSerializer.schema_summary
- dae.parquet.schema2.variant_serializers module
- Module contents
- dae.parquet.tests package
Submodules
dae.parquet.helpers module
dae.parquet.parquet_writer module
- dae.parquet.parquet_writer.add_missing_parquet_fields(pps: Schema, ped_df: DataFrame) tuple[DataFrame, Schema] [source]
Add missing parquet fields.
- dae.parquet.parquet_writer.append_meta_to_parquet(meta_filename: str, key: list[str], value: list[str]) None [source]
Append key-value pair to meta data parquet file.
- dae.parquet.parquet_writer.collect_pedigree_parquet_schema(ped_df: DataFrame) Schema [source]
Build the pedigree parquet schema.
- dae.parquet.parquet_writer.fill_family_bins(families: FamiliesData, partition_descriptor: PartitionDescriptor | None = None) None [source]
Save families data into a parquet file.
- dae.parquet.parquet_writer.merge_variants_parquets(partition_descriptor: PartitionDescriptor, variants_dir: str, partitions: list[tuple[str, str]], row_group_size: int = 50000, parquet_version: str | None = None) None [source]
Merge parquet files in variants_dir.
- dae.parquet.parquet_writer.pedigree_parquet_schema() Schema [source]
Return the schema for pedigree parquet file.
- dae.parquet.parquet_writer.save_ped_df_to_parquet(ped_df: DataFrame, filename: str, parquet_version: str | None = None) None [source]
Save ped_df as a parquet file named filename.
- dae.parquet.parquet_writer.serialize_summary_schema(annotation_attributes: list[AttributeInfo], partition_descriptor: PartitionDescriptor) str [source]
Serialize the summary schema.
dae.parquet.partition_descriptor module
- class dae.parquet.partition_descriptor.PartitionDescriptor(*, chromosomes: list[str] | None = None, region_length: int = 0, integer_region_bins: bool = False, family_bin_size: int = 0, coding_effect_types: list[str] | None = None, rare_boundary: float = 0)[source]
Bases:
object
Class to represent partition of a genotype dataset.
- dataset_family_partition() list[tuple[str, str]] [source]
Build family dataset partition for table creation.
When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.
- dataset_summary_partition() list[tuple[str, str]] [source]
Build summary parquet dataset partition for table creation.
When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.
- family_partition(allele: FamilyAllele, *, seen_as_denovo: bool) list[tuple[str, str]] [source]
Produce family partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)
]
- get_variant_partitions(chromosome_lengths: dict[str, int]) tuple[list[list[tuple[str, str]]], list[list[tuple[str, str]]]] [source]
Return the output summary and family variant partition names.
- has_summary_partitions() bool [source]
Check if partition applicable to summary allele are defined.
- make_all_region_bins(chromosome_lengths: dict[str, int]) list[str] [source]
Produce all region bins for all chromosomes.
- make_coding_bin(effect_types: Iterable[str]) int [source]
Produce coding bin for given list of effect types.
- make_frequency_bin(allele_count: int, allele_freq: float, *, is_denovo: bool = False) str [source]
Produce frequency bin from allele count, frequency and de Novo flag.
Params are allele count, allele frequence and de Novo flag.
- make_region_bin(chrom: str, pos: int) str [source]
Produce region bin for given chromosome and position.
- make_region_bins_regions(chromosomes: list[str], chromosome_lengths: dict[str, int]) dict[str, list[Region]] [source]
Generate region_bin to regions based on a partition descriptor.
- static parse(path_name: Path | str) PartitionDescriptor [source]
Parse partition description from a file.
When the file name has a .conf suffix or is without suffix the format of the file is assumed to be python config file and it is parsed using the Python ConfigParser class.
When the file name has .yaml suffix the file is parsed using the YAML parser.
- static parse_dict(config_dict: dict[str, Any]) PartitionDescriptor [source]
Parse configuration dictionary and create a partion descriptor.
- static parse_string(content: str, content_format: str = 'conf') PartitionDescriptor [source]
Parse partition description from a string.
The supported formats are the Python config format and YAML. Example string content should be as follows.
Example Python config format:
` [region_bin] chromosomes = chr1, chr2 region_length = 10 integer_region_bins = False [frequency_bin] rare_boundary = 5.0 [coding_bin] coding_effect_types = frame-shift,splice-site,nonsense,missense [family_bin] family_bin_size=10 `
Example YAML format: ``` region_bin:
chromosomes: chr1, chr2 region_length: 10 integer_region_bins: False
- frequency_bin:
rare_boundary: 5.0
- coding_bin:
coding_effect_types: frame-shift,splice-site,nonsense,missense
- family_bin:
family_bin_size: 10
- static partition_directory(output_dir: str, partition: list[tuple[str, str]]) str [source]
Construct a partition dataset directory.
Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the directory name corresponding to the partition.
- static partition_filename(prefix: str, partition: list[tuple[str, str]], bucket_index: int | None) str [source]
Construct a partition dataset base filename.
Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the file name corresponding to the partition.
- static path_to_partitions(raw_path: str) list[tuple[str, str]] [source]
Convert a path into the partitions it is composed of.
- region_to_region_bins(region: Region, chrom_lens: dict[str, int]) list[str] [source]
Provide a list of bins the given region intersects.
- schema1_partition(allele: FamilyAllele) list[tuple[str, str]] [source]
Produce Schema1 family partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)
]
- serialize(output_format: str = 'conf') str [source]
Serialize a partition descriptor into a string.
- summary_partition(allele: SummaryAllele, *, seen_as_denovo: bool) list[tuple[str, str]] [source]
Produce summary partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”),
]