dae.parquet package
Subpackages
- dae.parquet.schema2 package
- Submodules
- dae.parquet.schema2.parquet_io module
ContinuousParquetFileWriter
ContinuousParquetFileWriter.BATCH_ROWS
ContinuousParquetFileWriter.DEFAULT_COMPRESSION
ContinuousParquetFileWriter.append_family_allele()
ContinuousParquetFileWriter.append_summary_allele()
ContinuousParquetFileWriter.build_batch()
ContinuousParquetFileWriter.build_table()
ContinuousParquetFileWriter.close()
ContinuousParquetFileWriter.data_reset()
ContinuousParquetFileWriter.size()
VariantsParquetWriter
- dae.parquet.schema2.serializers module
AlleleParquetSerializer
AlleleParquetSerializer.ENUM_PROPERTIES
AlleleParquetSerializer.FAMILY_ALLELE_BASE_SCHEMA
AlleleParquetSerializer.SUMMARY_ALLELE_BASE_SCHEMA
AlleleParquetSerializer.build_family_allele_batch_dict()
AlleleParquetSerializer.build_family_schema()
AlleleParquetSerializer.build_summary_allele_batch_dict()
AlleleParquetSerializer.build_summary_schema()
AlleleParquetSerializer.schema_family
AlleleParquetSerializer.schema_summary
- dae.parquet.schema2.variant_serializers module
- Module contents
- dae.parquet.tests package
- Submodules
- dae.parquet.tests.test_family_variant_serialization module
- dae.parquet.tests.test_parquet_helpers module
- dae.parquet.tests.test_parquet_io_write_summary_variants module
- dae.parquet.tests.test_parquet_writer module
- dae.parquet.tests.test_partition_descriptor module
- dae.parquet.tests.test_summary_variant_serialization module
- Module contents
Submodules
dae.parquet.helpers module
- dae.parquet.helpers.merge_parquets(in_files: list[str], out_file: str, *, delete_in_files: bool = True, row_group_size: int = 50000, parquet_version: str | None = None) None [source]
Merge in_files into one large file called out_file.
- dae.parquet.helpers.url_to_pyarrow_fs(filename: str, filesystem: AbstractFileSystem | None = None) FileSystem [source]
Turn URL into pyarrow filesystem instance.
Parameters
- filenamestr
The fsspec-compatible URL
- filesystemfsspec.FileSystem
An fsspec filesystem for
filename
.
Returns
- filesystempyarrow.fs.FileSystem
The new filesystem discovered from
filename
andfilesystem
.
dae.parquet.parquet_writer module
- dae.parquet.parquet_writer.add_missing_parquet_fields(pps: schema, ped_df: DataFrame) tuple[DataFrame, schema] [source]
Add missing parquet fields.
- dae.parquet.parquet_writer.append_meta_to_parquet(meta_filename: str, key: list[str], value: list[str]) None [source]
Append key-value pair to meta data parquet file.
- dae.parquet.parquet_writer.collect_pedigree_parquet_schema(ped_df: DataFrame) Schema [source]
Build the pedigree parquet schema.
- dae.parquet.parquet_writer.fill_family_bins(families: FamiliesData, partition_descriptor: PartitionDescriptor | None = None) None [source]
Save families data into a parquet file.
- dae.parquet.parquet_writer.merge_variants_parquets(partition_descriptor: PartitionDescriptor, variants_dir: str, partitions: list[tuple[str, str]], row_group_size: int = 25000, parquet_version: str | None = None) None [source]
Merge parquet files in variants_dir.
- dae.parquet.parquet_writer.pedigree_parquet_schema() schema [source]
Return the schema for pedigree parquet file.
- dae.parquet.parquet_writer.save_ped_df_to_parquet(ped_df: DataFrame, filename: str, filesystem: AbstractFileSystem | None = None, parquet_version: str | None = None) None [source]
Save ped_df as a parquet file named filename.
- dae.parquet.parquet_writer.serialize_summary_schema(annotation_attributes: list[AttributeInfo], partition_descriptor: PartitionDescriptor) str [source]
Serialize the summary schema.
- dae.parquet.parquet_writer.serialize_variants_data_schema(annotation_attributes: list[AttributeInfo]) str [source]
Serialize the variants data schema.
dae.parquet.partition_descriptor module
- class dae.parquet.partition_descriptor.PartitionDescriptor(chromosomes: list[str] | None = None, region_length: int = 0, family_bin_size: int = 0, coding_effect_types: list[str] | None = None, rare_boundary: float = 0)[source]
Bases:
object
Class to represent partition of a genotype dataset.
- dataset_family_partition() list[tuple[str, str]] [source]
Build family dataset partition for table creation.
When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.
- dataset_summary_partition() list[tuple[str, str]] [source]
Build summary parquet dataset partition for table creation.
When creating an Impala or BigQuery table it is helpful to have the list of partitions and types used in the parquet dataset.
- family_partition(allele: FamilyAllele, *, seen_as_denovo: bool) list[tuple[str, str]] [source]
Produce family partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)
]
- get_variant_partitions(chrom_lens: dict[str, int]) tuple[list[list[tuple[str, str]]], list[list[tuple[str, str]]]] [source]
Return the output summary and family variant partition names.
- has_summary_partitions() bool [source]
Check if partition applicable to summary allele are defined.
- make_coding_bin(effect_types: Iterable[str]) int [source]
Produce coding bin for given list of effect types.
- make_frequency_bin(allele_count: int, allele_freq: float, *, is_denovo: bool = False) str [source]
Produce frequency bin from allele count, frequency and de Novo flag.
Params are allele count, allele frequence and de Novo flag.
- make_region_bin(chrom: str, pos: int) str [source]
Produce region bin for given chromosome and position.
- static parse(path_name: Path | str) PartitionDescriptor [source]
Parse partition description from a file.
When the file name has a .conf suffix or is without suffix the format of the file is assumed to be python config file and it is parsed using the Python ConfigParser class.
When the file name has .yaml suffix the file is parsed using the YAML parser.
- static parse_dict(config_dict: dict[str, Any]) PartitionDescriptor [source]
Parse configuration dictionary and create a partion descriptor.
- static parse_string(content: str, content_format: str = 'conf') PartitionDescriptor [source]
Parse partition description from a string.
The supported formats are the Python config format and YAML. Example string content should be as follows.
Example Python config format:
` [region_bin] chromosomes = chr1, chr2 region_length = 10 [frequency_bin] rare_boundary = 5.0 [coding_bin] coding_effect_types = frame-shift,splice-site,nonsense,missense [family_bin] family_bin_size=10 `
Example YAML format: ``` region_bin:
chromosomes: chr1, chr2 region_length: 10
- frequency_bin:
rare_boundary: 5.0
- coding_bin:
coding_effect_types: frame-shift,splice-site,nonsense,missense
- family_bin:
family_bin_size: 10
- static partition_directory(output_dir: str, partition: list[tuple[str, str]]) str [source]
Construct a partition dataset directory.
Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the directory name corresponding to the partition.
- static partition_filename(prefix: str, partition: list[tuple[str, str]], bucket_index: int | None) str [source]
Construct a partition dataset base filename.
Given a partition in the format returned by summary_parition or family_partition methods, this function constructs the file name corresponding to the partition.
- static path_to_partitions(raw_path: str) list[tuple[str, str]] [source]
Convert a path into the partitions it is composed of.
- region_to_bins(region: Region, chrom_lens: dict[str, int]) list[tuple[str, str]] [source]
Provide a list of bins the given region intersects.
- schema1_partition(allele: FamilyAllele) list[tuple[str, str]] [source]
Produce Schema1 family partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”), (“family_bin”, “1)
]
- serialize(output_format: str = 'conf') str [source]
Serialize a partition descriptor into a string.
- summary_partition(allele: SummaryAllele, *, seen_as_denovo: bool) list[tuple[str, str]] [source]
Produce summary partition for an allele.
The partition is returned as a list of tuples consiting of the name of the partition and the value.
Example: [
(“region_bin”, “chr1_0”), (“frequency_bin”, “0”), (“coding_bin”, “1”),
]