dae.parquet package
Subpackages
- dae.parquet.schema2 package
- Submodules
- dae.parquet.schema2.annotation module
- dae.parquet.schema2.annotation_utils module
- dae.parquet.schema2.loader module
- dae.parquet.schema2.merge_parquet module
- dae.parquet.schema2.parquet_io module
ContinuousParquetFileWriter
ContinuousParquetFileWriter.BATCH_ROWS
ContinuousParquetFileWriter.DEFAULT_COMPRESSION
ContinuousParquetFileWriter.append_allele()
ContinuousParquetFileWriter.build_batch()
ContinuousParquetFileWriter.build_table()
ContinuousParquetFileWriter.close()
ContinuousParquetFileWriter.data_reset()
ContinuousParquetFileWriter.size()
VariantsParquetWriterDeprecated()
- dae.parquet.schema2.processing_pipeline module
AnnotationPipelineVariantsBatchFilter
AnnotationPipelineVariantsFilter
AnnotationPipelineVariantsFilterMixin
DeleteAttributesFromVariantFilter
DeleteAttributesFromVariantsBatchFilter
Schema2SummaryVariantsBatchSource
Schema2SummaryVariantsSource
VariantsBatchConsumer
VariantsBatchFilter
VariantsBatchPipelineProcessor
VariantsBatchSource
VariantsConsumer
VariantsFilter
VariantsLoaderBatchSource
VariantsLoaderSource
VariantsPipelineProcessor
VariantsSource
- dae.parquet.schema2.serializers module
- dae.parquet.schema2.variants_parquet_writer module
- Module contents
Submodules
dae.parquet.helpers module
dae.parquet.parquet_writer module
- dae.parquet.parquet_writer.add_missing_parquet_fields(pps: Schema, ped_df: DataFrame) tuple[DataFrame, Schema] [source]
Add missing parquet fields.
- dae.parquet.parquet_writer.append_meta_to_parquet(meta_filename: str, key: list[str], value: list[str]) None [source]
Append key-value pair to meta data parquet file.
- dae.parquet.parquet_writer.collect_pedigree_parquet_schema(ped_df: DataFrame) Schema [source]
Build the pedigree parquet schema.
- dae.parquet.parquet_writer.fill_family_bins(families: FamiliesData, partition_descriptor: PartitionDescriptor | None = None) None [source]
Save families data into a parquet file.
- dae.parquet.parquet_writer.merge_variants_parquets(partition_descriptor: PartitionDescriptor, variants_dir: str, partition: list[tuple[str, str]], row_group_size: int = 50000, parquet_version: str | None = None) None [source]
Merge parquet files in variants_dir.
- dae.parquet.parquet_writer.pedigree_parquet_schema() Schema [source]
Return the schema for pedigree parquet file.
- dae.parquet.parquet_writer.save_ped_df_to_parquet(ped_df: DataFrame, filename: str, parquet_version: str | None = None) None [source]
Save ped_df as a parquet file named filename.
- dae.parquet.parquet_writer.serialize_summary_schema(annotation_attributes: list[AttributeInfo], partition_descriptor: PartitionDescriptor) str [source]
Serialize the summary schema.
dae.parquet.partition_descriptor module
- class dae.parquet.partition_descriptor.Partition(region_bin: str | None = None, frequency_bin: str | None = None, coding_bin: str | None = None, family_bin: str | None = None)[source]
Bases:
object
Class to represent a partition of a genotype dataset.
- coding_bin: str | None = None
- family_bin: str | None = None
- frequency_bin: str | None = None
- static from_pylist(partition: list[tuple[str, str]]) Partition [source]
Create a partition from a list of tuples.
- region_bin: str | None = None
- class dae.parquet.partition_descriptor.PartitionDescriptor(*, chromosomes: list[str] | None = None, region_length: int = 0, integer_region_bins: bool = False, family_bin_size: int = 0, coding_effect_types: list[str] | None = None, rare_boundary: float = 0)[source]
Bases:
object
Class to represent partition of a genotype dataset.
- build_family_partitions(chromosome_lengths: dict[str, int]) list[Partition] [source]
Build summary partitions for all variants in the dataset.
- build_summary_partitions(chromosome_lengths: dict[str, int]) list[Partition] [source]
Build summary partitions for all variants in the dataset.
- family_partition(allele: FamilyAllele, *, seen_as_denovo: bool) list[tuple[str, str]] [source]
Produce family partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)
]
- family_partition_schema() list[tuple[str, str]] [source]
Build family dataset partition schema for table creation.
When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.
- get_variant_partitions(chromosome_lengths: dict[str, int]) tuple[list[list[tuple[str, str]]], list[list[tuple[str, str]]]] [source]
Return the output summary and family variant partition names.
- has_summary_partitions() bool [source]
Check if partition applicable to summary allele are defined.
- make_all_region_bins(chromosome_lengths: dict[str, int]) list[str] [source]
Produce all region bins for all chromosomes.
- make_coding_bin(effect_types: Iterable[str]) int [source]
Produce coding bin for given list of effect types.
- make_frequency_bin(allele_count: int, allele_freq: float, *, is_denovo: bool = False) str [source]
Produce frequency bin from allele count, frequency and de Novo flag.
Params are allele count, allele frequence and de Novo flag.
- make_region_bin(chrom: str, pos: int) str [source]
Produce region bin for given chromosome and position.
- make_region_bins_regions(chromosomes: list[str], chromosome_lengths: dict[str, int]) dict[str, list[Region]] [source]
Generate region_bin to regions based on a partition descriptor.
- static parse(path_name: Path | str) PartitionDescriptor [source]
Parse partition description from a file.
When the file name has a .conf suffix or is without suffix the format of the file is assumed to be python config file and it is parsed using the Python ConfigParser class.
When the file name has .yaml suffix the file is parsed using the YAML parser.
- static parse_dict(config_dict: dict[str, Any]) PartitionDescriptor [source]
Parse configuration dictionary and create a partion descriptor.
- static parse_string(content: str, content_format: str = 'conf') PartitionDescriptor [source]
Parse partition description from a string.
The supported formats are the Python config format and YAML. Example string content should be as follows.
Example Python config format:
` [region_bin] chromosomes = chr1, chr2 region_length = 10 integer_region_bins = False [frequency_bin] rare_boundary = 5.0 [coding_bin] coding_effect_types = frame-shift,splice-site,nonsense,missense [family_bin] family_bin_size=10 `
Example YAML format: ``` region_bin:
chromosomes: chr1, chr2 region_length: 10 integer_region_bins: False
- frequency_bin:
rare_boundary: 5.0
- coding_bin:
coding_effect_types: frame-shift,splice-site,nonsense,missense
- family_bin:
family_bin_size: 10
- static partition_directory(dataset_dir: str, partition: Partition | list[tuple[str, str]]) str [source]
Construct a partition dataset directory.
Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the directory name corresponding to the partition.
- static partition_filename(prefix: str, partition: Partition | list[tuple[str, str]], bucket_index: int | None) str [source]
Construct a partition dataset base filename.
Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the file name corresponding to the partition.
- static path_to_partitions(raw_path: str) list[tuple[str, str]] [source]
Convert a path into the partitions it is composed of.
- region_to_region_bins(region: Region, chrom_lens: dict[str, int]) list[str] [source]
Provide a list of bins the given region intersects.
- schema1_partition(allele: FamilyAllele) list[tuple[str, str]] [source]
Produce Schema1 family partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)
]
- serialize(output_format: str = 'conf') str [source]
Serialize a partition descriptor into a string.
- summary_partition(allele: SummaryAllele, *, seen_as_denovo: bool) list[tuple[str, str]] [source]
Produce summary partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”),
]