dae.parquet.schema2 package

Submodules

dae.parquet.schema2.parquet_io module

class dae.parquet.schema2.parquet_io.ContinuousParquetFileWriter(filepath: str, annotation_schema: list[AttributeInfo], filesystem: AbstractFileSystem | None = None, row_group_size: int = 10000, schema: str = 'schema', blob_column: str | None = None)[source]

Bases: object

A continous parquet writer.

Class that automatically writes to a given parquet file when supplied enough data. Automatically dumps leftover data when closing into the file

BATCH_ROWS = 1000
DEFAULT_COMPRESSION = 'SNAPPY'
append_family_allele(allele: FamilyAllele, json_data: bytes) None[source]

Append the data for an entire variant to the correct file.

append_summary_allele(allele: SummaryAllele, json_data: bytes) None[source]

Append the data for an entire variant to the correct file.

build_batch() RecordBatch[source]
build_table() Table[source]
close() None[source]

Close the parquet writer and write any remaining data.

data_reset() None[source]
size() int[source]
class dae.parquet.schema2.parquet_io.VariantsParquetWriter(out_dir: str, annotation_schema: list[AttributeInfo], partition_descriptor: PartitionDescriptor, *, serializer: VariantsDataSerializer | None = None, bucket_index: int = 1, row_group_size: int = 10000, include_reference: bool = True, filesystem: AbstractFileSystem | None = None)[source]

Bases: object

Provide functions for storing variants into parquet dataset.

close() None[source]
write_dataset(full_variants_iterator: Iterator[tuple[SummaryVariant, list[FamilyVariant]]]) list[str][source]

Write variant to partitioned parquet dataset.

write_summary_variant(summary_variant: SummaryVariant, attributes: dict[str, Any] | None = None, sj_base_index: int | None = None) None[source]

Write a single summary variant to the correct parquet file.

dae.parquet.schema2.serializers module

class dae.parquet.schema2.serializers.AlleleParquetSerializer(annotation_schema: list[AttributeInfo], extra_attributes: list[str] | None = None)[source]

Bases: object

Serialize a bunch of alleles.

ENUM_PROPERTIES: ClassVar[dict[str, Any]] = {'allele_in_roles': <enum 'Role'>, 'allele_in_sexes': <enum 'Sex'>, 'allele_in_statuses': <enum 'Status'>, 'inheritance_in_members': <enum 'Inheritance'>, 'transmission_type': <enum 'TransmissionType'>, 'variant_type': <enum 'Type'>}
FAMILY_ALLELE_BASE_SCHEMA: ClassVar[dict[str, Any]] = {'allele_in_members': ListType(list<item: string>), 'allele_in_roles': DataType(int32), 'allele_in_sexes': DataType(int8), 'allele_in_statuses': DataType(int8), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'sj_index': DataType(int64), 'summary_index': DataType(int32)}
SUMMARY_ALLELE_BASE_SCHEMA: ClassVar[dict[str, Any]] = {'af_allele_count': DataType(int32), 'af_allele_freq': DataType(float), 'af_parents_called_count': DataType(int32), 'af_parents_called_percent': DataType(float), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene': ListType(list<item: struct<effect_gene_symbols: string, effect_types: string>>), 'end_position': DataType(int32), 'family_alleles_count': DataType(int32), 'family_variants_count': DataType(int32), 'position': DataType(int32), 'reference': DataType(string), 'seen_as_denovo': DataType(bool), 'seen_in_status': DataType(int8), 'sj_index': DataType(int64), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_type': DataType(int8)}
build_family_allele_batch_dict(allele: FamilyAllele, family_variant_data: bytes) dict[str, list[Any]][source]

Build a batch of family allele data in the form of a dict.

classmethod build_family_schema() Schema[source]

Build the schema for the family alleles.

build_summary_allele_batch_dict(allele: SummaryAllele, summary_variant_data: bytes) dict[str, Any][source]

Build a batch of summary allele data in the form of a dict.

classmethod build_summary_schema(annotation_schema: list[AttributeInfo]) Schema[source]

Build the schema for the summary alleles.

property schema_family: Schema

Lazy construct and return the schema for the family alleles.

property schema_summary: Schema

Lazy construct and return the schema for the summary alleles.

dae.parquet.schema2.variant_serializers module

class dae.parquet.schema2.variant_serializers.JsonVariantsDataSerializer(metadata: dict[str, Any] | None = None)[source]

Bases: VariantsDataSerializer

Serialize family and summary alleles to json.

deserialize_family_record(data: bytes) dict[str, Any][source]

Deserialize a family allele from a byte string.

deserialize_summary_record(data: bytes) list[dict[str, Any]][source]

Deserialize a summary allele from a byte string.

serialize_family(variant: FamilyVariant) bytes[source]

Serialize a family variant part to a byte string.

serialize_summary(variant: SummaryVariant) bytes[source]

Serialize a summary allele to a byte string.

class dae.parquet.schema2.variant_serializers.VariantsDataSerializer(metadata: dict[str, Any] | None = None)[source]

Bases: ABC

Interface for serializing family and summary alleles.

static build_serializer(metadata: dict[str, Any] | None = None) VariantsDataSerializer[source]

Build a serializer based on the metadata.

abstract deserialize_family_record(data: bytes) dict[str, Any][source]

Deserialize a family allele from a byte string.

abstract deserialize_summary_record(data: bytes) list[dict[str, Any]][source]

Deserialize a summary allele from a byte string.

abstract serialize_family(variant: FamilyVariant) bytes[source]

Serialize a family variant part to a byte string.

abstract serialize_summary(variant: SummaryVariant) bytes[source]

Serialize a summary allele to a byte string.

class dae.parquet.schema2.variant_serializers.ZstdIndexedVariantsDataSerializer(metadata: dict[str, Any] | None = None)[source]

Bases: VariantsDataSerializer

Serialize family and summary alleles to zstd.

classmethod build_serialization_meta(annotation_fields: list[str]) dict[str, Any][source]

Build the serialization schema.

deserialize_family_record(data: bytes) dict[str, Any][source]

Deserialize a family allele from a byte string.

deserialize_summary_record(data: bytes) list[dict[str, Any]][source]

Deserialize a summary allele from a byte string.

serialize_family(variant: FamilyVariant) bytes[source]

Serialize a family variant part to a byte string.

serialize_summary(variant: SummaryVariant) bytes[source]

Serialize a summary allele to a byte string.

Module contents