dae.parquet.schema2 package
Submodules
dae.parquet.schema2.parquet_io module
- class dae.parquet.schema2.parquet_io.ContinuousParquetFileWriter(filepath: str, annotation_schema: list[AttributeInfo], filesystem: AbstractFileSystem | None = None, row_group_size: int = 10000, schema: str = 'schema', blob_column: str | None = None)[source]
Bases:
object
A continous parquet writer.
Class that automatically writes to a given parquet file when supplied enough data. Automatically dumps leftover data when closing into the file
- BATCH_ROWS = 1000
- DEFAULT_COMPRESSION = 'SNAPPY'
- append_family_allele(allele: FamilyAllele, json_data: bytes) None [source]
Append the data for an entire variant to the correct file.
- append_summary_allele(allele: SummaryAllele, json_data: bytes) None [source]
Append the data for an entire variant to the correct file.
- class dae.parquet.schema2.parquet_io.VariantsParquetWriter(out_dir: str, annotation_schema: list[AttributeInfo], partition_descriptor: PartitionDescriptor, *, serializer: VariantsDataSerializer | None = None, bucket_index: int = 1, row_group_size: int = 10000, include_reference: bool = True, filesystem: AbstractFileSystem | None = None)[source]
Bases:
object
Provide functions for storing variants into parquet dataset.
- write_dataset(full_variants_iterator: Iterator[tuple[SummaryVariant, list[FamilyVariant]]]) list[str] [source]
Write variant to partitioned parquet dataset.
- write_summary_variant(summary_variant: SummaryVariant, attributes: dict[str, Any] | None = None, sj_base_index: int | None = None) None [source]
Write a single summary variant to the correct parquet file.
dae.parquet.schema2.serializers module
- class dae.parquet.schema2.serializers.AlleleParquetSerializer(annotation_schema: list[AttributeInfo], extra_attributes: list[str] | None = None)[source]
Bases:
object
Serialize a bunch of alleles.
- ENUM_PROPERTIES: ClassVar[dict[str, Any]] = {'allele_in_roles': <enum 'Role'>, 'allele_in_sexes': <enum 'Sex'>, 'allele_in_statuses': <enum 'Status'>, 'inheritance_in_members': <enum 'Inheritance'>, 'transmission_type': <enum 'TransmissionType'>, 'variant_type': <enum 'Type'>}
- FAMILY_ALLELE_BASE_SCHEMA: ClassVar[dict[str, Any]] = {'allele_in_members': ListType(list<item: string>), 'allele_in_roles': DataType(int32), 'allele_in_sexes': DataType(int8), 'allele_in_statuses': DataType(int8), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'family_id': DataType(string), 'family_index': DataType(int32), 'inheritance_in_members': DataType(int16), 'is_denovo': DataType(int8), 'sj_index': DataType(int64), 'summary_index': DataType(int32)}
- SUMMARY_ALLELE_BASE_SCHEMA: ClassVar[dict[str, Any]] = {'af_allele_count': DataType(int32), 'af_allele_freq': DataType(float), 'af_parents_called_count': DataType(int32), 'af_parents_called_percent': DataType(float), 'allele_index': DataType(int32), 'bucket_index': DataType(int32), 'chromosome': DataType(string), 'effect_gene': ListType(list<item: struct<effect_gene_symbols: string, effect_types: string>>), 'end_position': DataType(int32), 'family_alleles_count': DataType(int32), 'family_variants_count': DataType(int32), 'position': DataType(int32), 'reference': DataType(string), 'seen_as_denovo': DataType(bool), 'seen_in_status': DataType(int8), 'sj_index': DataType(int64), 'summary_index': DataType(int32), 'transmission_type': DataType(int8), 'variant_type': DataType(int8)}
- build_family_allele_batch_dict(allele: FamilyAllele, family_variant_data: bytes) dict[str, list[Any]] [source]
Build a batch of family allele data in the form of a dict.
- build_summary_allele_batch_dict(allele: SummaryAllele, summary_variant_data: bytes) dict[str, Any] [source]
Build a batch of summary allele data in the form of a dict.
- classmethod build_summary_schema(annotation_schema: list[AttributeInfo]) Schema [source]
Build the schema for the summary alleles.
- property schema_family: Schema
Lazy construct and return the schema for the family alleles.
- property schema_summary: Schema
Lazy construct and return the schema for the summary alleles.
dae.parquet.schema2.variant_serializers module
- class dae.parquet.schema2.variant_serializers.JsonVariantsDataSerializer(metadata: dict[str, Any] | None = None)[source]
Bases:
VariantsDataSerializer
Serialize family and summary alleles to json.
- deserialize_family_record(data: bytes) dict[str, Any] [source]
Deserialize a family allele from a byte string.
- deserialize_summary_record(data: bytes) list[dict[str, Any]] [source]
Deserialize a summary allele from a byte string.
- serialize_family(variant: FamilyVariant) bytes [source]
Serialize a family variant part to a byte string.
- serialize_summary(variant: SummaryVariant) bytes [source]
Serialize a summary allele to a byte string.
- class dae.parquet.schema2.variant_serializers.VariantsDataSerializer(metadata: dict[str, Any] | None = None)[source]
Bases:
ABC
Interface for serializing family and summary alleles.
- static build_serializer(metadata: dict[str, Any] | None = None) VariantsDataSerializer [source]
Build a serializer based on the metadata.
- abstract deserialize_family_record(data: bytes) dict[str, Any] [source]
Deserialize a family allele from a byte string.
- abstract deserialize_summary_record(data: bytes) list[dict[str, Any]] [source]
Deserialize a summary allele from a byte string.
- abstract serialize_family(variant: FamilyVariant) bytes [source]
Serialize a family variant part to a byte string.
- abstract serialize_summary(variant: SummaryVariant) bytes [source]
Serialize a summary allele to a byte string.