dae.genomic_resources.gene_models package

Submodules

dae.genomic_resources.gene_models.gene_models module

class dae.genomic_resources.gene_models.gene_models.GeneModels(resource: GenomicResource)[source]

Bases: ResourceConfigValidationMixin

Manage and query gene model data from genomic resources.

This class provides access to gene models loaded from various file formats (GTF, refFlat, refSeq, CCDS, etc.) and offers efficient querying by gene name or genomic location.

The class maintains three internal data structures: - transcript_models: Dict mapping transcript IDs to TranscriptModel objects - gene_models: Dict mapping gene names to lists of TranscriptModel objects - _tx_index: IntervalTree index for fast location-based queries

Attributes:

resource (GenomicResource): The genomic resource containing gene models. config (dict): Validated configuration from the resource. reference_genome_id (str | None): ID of the reference genome. gene_models (dict[str, list[TranscriptModel]]): Gene name to

transcript models mapping.

transcript_models (dict[str, TranscriptModel]): Transcript ID to: transcript model mapping.

Example:

>>> from dae.genomic_resources.gene_models.gene_models_factory import \
...     build_gene_models_from_file
>>> gene_models = build_gene_models_from_file("genes.gtf")
>>> gene_models.load()
>>> # Query by gene name
>>> tp53_transcripts = gene_models.gene_models_by_gene_name("TP53")
>>> # Query by location
>>> transcripts = gene_models.gene_models_by_location("chr17", 7676592)

Note:

The gene models must be loaded using the load() method before queries can be performed. The class is thread-safe for concurrent access.

close() → None[source]

gene_models_by_gene_name(name: str) → list[TranscriptModel] | None[source]

Retrieve all transcript models for a specific gene.

Args:

name (str): The gene name/symbol to search for.

Returns:

list[TranscriptModel] | None: List of transcript models for the: gene, or None if the gene is not found.

Example:

>>> transcripts = gene_models.gene_models_by_gene_name("BRCA1")
>>> if transcripts:
...     print(f"BRCA1 has {len(transcripts)} transcript variants")

gene_models_by_location(chrom: str, pos_begin: int, pos_end: int | None = None) → list[TranscriptModel][source]

Retrieve transcripts overlapping a genomic position or region.

This method uses an interval tree index for efficient querying of transcripts by genomic coordinates.

Args:

chrom (str): The chromosome name (e.g., “chr1”, “17”). pos_begin (int): The start position (1-based, inclusive). pos_end (int | None): The end position (1-based, inclusive).

If None, queries a single position.

Returns:

list[TranscriptModel]: List of TranscriptModel objects whose: transcript regions overlap the query position/region. Returns empty list if no overlaps found.

Example:

>>> # Query single position
>>> models = gene_models.gene_models_by_location("chr17", 7676592)
>>> # Query region
>>> models = gene_models.gene_models_by_location(
...     "chr17", 7661779, 7687550
... )
>>> for tm in models:
...     print(f"{tm.gene}: {tm.tr_id}")

Note:

Positions are swapped automatically if pos_end < pos_begin.

gene_names() → list[str][source]

Get list of all gene names in the loaded gene models.

Returns:

list[str]: List of gene names (symbols).

Example:

>>> gene_models.load()
>>> genes = gene_models.gene_names()
>>> print(f"Loaded {len(genes)} genes")

static get_schema() → dict[str, Any][source]: Return schema to be used for config validation.

has_chromosome(chrom: str) → bool[source]

Check if a chromosome has any gene models.

Args:

chrom (str): The chromosome name to check.

Returns:

bool: True if the chromosome has gene models, False otherwise.

Example:

>>> if gene_models.has_chromosome("chr1"):
...     print("Chromosome 1 has gene annotations")

is_loaded() → bool[source]

Check if gene models have been loaded.

Returns:

bool: True if load() has been called and completed, False otherwise.

Example:

>>> if not gene_models.is_loaded():
...     gene_models.load()

static join_gene_models(*gene_models: GeneModels) → GeneModels[source]

Merge multiple gene models into a single GeneModels object.

This combines transcript models from multiple sources into one unified gene models object.

Args:

*gene_models (GeneModels): Two or more GeneModels objects to merge.

Returns:

GeneModels: New GeneModels object containing all transcripts.

Raises:

ValueError: If fewer than 2 gene models provided.

Example:

>>> gm1 = build_gene_models_from_file("genes1.gtf")
>>> gm2 = build_gene_models_from_file("genes2.gtf")
>>> merged = GeneModels.join_gene_models(gm1, gm2)

Note:

Transcript IDs should be unique across all input gene models.

load() → GeneModels[source]

Load gene models from the genomic resource.

This method parses the gene model file and builds internal indexes for efficient querying. It is thread-safe and will only load once.

Returns:

GeneModels: Self, for method chaining.

Example:

>>> gene_models = build_gene_models_from_file("genes.gtf")
>>> gene_models.load()
>>> num_transcripts = len(gene_models.transcript_models)
>>> print(f"Loaded {num_transcripts} transcripts")

Note:

Calling load() multiple times is safe - subsequent calls return immediately if already loaded.

property resource_id: str

dae.genomic_resources.gene_models.gene_models.create_regions_from_genes(gene_models: GeneModels, genes: list[str], regions: list[Region] | None, gene_regions_heuristic_cutoff: int = 20, gene_regions_heuristic_extend: int = 20000) → list[Region] | None[source]

Produce a list of regions from given gene symbols.

If given a list of regions, will merge the newly-created regions from the genes with the provided ones.

dae.genomic_resources.gene_models.gene_models_factory module

dae.genomic_resources.gene_models.gene_models_factory.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None, chrom_mapping_file_name: str | None = None) → GeneModels[source]: Load gene models from local filesystem.

dae.genomic_resources.gene_models.gene_models_factory.build_gene_models_from_resource(resource: GenomicResource | None) → GeneModels[source]: Load gene models from a genomic resource.

dae.genomic_resources.gene_models.gene_models_factory.build_gene_models_from_resource_id(resource_id: str, grr: GenomicResourceRepo | None = None) → GeneModels[source]: Load gene models from a genomic resource id.

dae.genomic_resources.gene_models.parsers module

dae.genomic_resources.gene_models.parsers.get_parser(fileformat: str) → Callable[[IO, dict[str, str] | None, int | None], dict[str, TranscriptModel] | None] | None[source]: Get gene models parser based on file format.

dae.genomic_resources.gene_models.parsers.infer_gene_model_parser(infile: IO, file_format: str | None = None) → str | None[source]: Infer gene models file format.

dae.genomic_resources.gene_models.parsers.load_gene_mapping(resource: GenomicResource) → dict[str, str][source]

Load alternative names for genes.

Assume that its first line has two column names

dae.genomic_resources.gene_models.parsers.load_transcript_models(resource: GenomicResource) → dict[str, TranscriptModel][source]: Load gene models.

dae.genomic_resources.gene_models.parsers.parse_ccds_gene_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]: Parse CCDS gene models file format.

dae.genomic_resources.gene_models.parsers.parse_default_gene_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]: Parse default gene models file format.

dae.genomic_resources.gene_models.parsers.parse_gtf_gene_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]: Parse GTF gene models file format.

dae.genomic_resources.gene_models.parsers.parse_known_gene_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]: Parse known gene models file format.

dae.genomic_resources.gene_models.parsers.parse_raw(infile: IO, expected_columns: list[str], nrows: int | None = None, comment: str | None = None) → DataFrame | None[source]: Parse raw gene models data based on expected columns.

dae.genomic_resources.gene_models.parsers.parse_ref_flat_gene_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]: Parse refFlat gene models file format.

dae.genomic_resources.gene_models.parsers.parse_ref_seq_gene_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]: Parse refSeq gene models file format.

dae.genomic_resources.gene_models.parsers.parse_ucscgenepred_models_format(infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) → dict[str, TranscriptModel] | None[source]

Parse UCSC gene prediction models file fomrat.

table genePred “A gene prediction.”

( string name; “Name of gene” string chrom; “Chromosome name” char[1] strand; “+ or - for strand” uint txStart; “Transcription start position” uint txEnd; “Transcription end position” uint cdsStart; “Coding region start” uint cdsEnd; “Coding region end” uint exonCount; “Number of exons” uint[exonCount] exonStarts; “Exon start positions” uint[exonCount] exonEnds; “Exon end positions” )

table genePredExt “A gene prediction with some additional info.”

( string name; “Name of gene (usually transcript_id from

GTF)”

string chrom; “Chromosome name” char[1] strand; “+ or - for strand” uint txStart; “Transcription start position” uint txEnd; “Transcription end position” uint cdsStart; “Coding region start” uint cdsEnd; “Coding region end” uint exonCount; “Number of exons” uint[exonCount] exonStarts; “Exon start positions” uint[exonCount] exonEnds; “Exon end positions” int score; “Score” string name2; “Alternate name (e.g. gene_id from GTF)” string cdsStartStat; “Status of CDS start annotation (none,

unknown, incomplete, or complete)”

string cdsEndStat; “Status of CDS end annotation
(none, unknown, incomplete, or complete)”

lstring exonFrames; “Exon frame offsets {0,1,2}” )

dae.genomic_resources.gene_models.parsers.probe_columns(infile: IO, expected_columns: list[str], comment: str | None = None) → bool[source]: Probe gene models file based on expected columns.

dae.genomic_resources.gene_models.parsers.probe_header(infile: IO, expected_columns: list[str], comment: str | None = None) → bool[source]: Probe gene models file header based on expected columns.

dae.genomic_resources.gene_models.serialization module

dae.genomic_resources.gene_models.serialization.build_gtf_record(transcript: TranscriptModel, feature: str, start: int, stop: int, attrs: str) → tuple[tuple[str, int, int, int], str][source]: Build an indexed GTF format record for a feature.

dae.genomic_resources.gene_models.serialization.calc_frame_for_gtf_cds_feature(transcript: TranscriptModel, region: BedRegion) → int[source]: Calculate frame for the given feature.

dae.genomic_resources.gene_models.serialization.collect_gtf_cds_regions(strand: str, cds_regions: list[BedRegion]) → list[BedRegion][source]: Returns list of all regions that represent the CDS.

dae.genomic_resources.gene_models.serialization.collect_gtf_start_codon_regions(strand: str, cds_regions: list[BedRegion]) → list[BedRegion][source]: Returns list of all regions that represent the start codon.

dae.genomic_resources.gene_models.serialization.collect_gtf_stop_codon_regions(strand: str, cds_regions: list[BedRegion]) → list[BedRegion][source]: Returns list of all regions that represent the stop codon.

dae.genomic_resources.gene_models.serialization.find_exon_cds_region_for_gtf_cds_feature(transcript: TranscriptModel, region: BedRegion) → tuple[Exon, BedRegion][source]: Find exon and CDS region that contains the given feature.

dae.genomic_resources.gene_models.serialization.gene_models_to_gtf(gene_models: GeneModels, *, sort_by_position: bool = True) → StringIO[source]: Output a GTF format string representation.

dae.genomic_resources.gene_models.serialization.get_exon_number_for(transcript: TranscriptModel, start: int, stop: int) → int[source]

Get the exon number for a genomic region.

Returns the exon number (in transcript order) that overlaps the given genomic coordinates.

Args:

start (int): Start position (1-based). stop (int): End position (1-based).

Returns:

int: Exon number (1-based) in transcript orientation.: Returns 0 if no overlapping exon found.

Example:

>>> # For a region within the second exon of a + strand transcript
>>> exon_num = transcript.get_exon_number_for(1000, 1050)
>>> print(f"Region is in exon {exon_num}")

Note:

Exon numbering is strand-aware: - Positive strand: numbered 5’ to 3’ (exon 1 is first) - Negative strand: numbered 5’ to 3’ (exon 1 is last in genome)

dae.genomic_resources.gene_models.serialization.gtf_canonical_index(index: tuple[str, int, int, int]) → tuple[source]

dae.genomic_resources.gene_models.serialization.save_as_default_gene_models(gene_models: GeneModels, output_filename: str, *, gzipped: bool = True) → None[source]: Save gene models in a file in default file format.

dae.genomic_resources.gene_models.serialization.transcript_to_gtf(transcript: TranscriptModel) → list[tuple[tuple[str, int, int, int], str]][source]: Output an indexed list of GTF-formatted features of a transcript.

dae.genomic_resources.gene_models.transcript_models module

class dae.genomic_resources.gene_models.transcript_models.Exon(start: int, stop: int, frame: int | None = None)[source]

Bases: object

Represent a single exon within a transcript.

An exon is a segment of a transcript that is retained in the mature RNA after splicing. This class stores the genomic coordinates and codon reading frame of an exon.

Attributes:

start (int): Genomic start position (1-based, inclusive). stop (int): Genomic end position (1-based, inclusive). frame (int | None): Codon reading frame (0, 1, or 2) for coding

exons, or None for non-coding exons or UTR regions.

Example:

>>> exon = Exon(start=100, stop=200, frame=0)
>>> print(exon.start, exon.stop)
100 200
>>> exon.contains((150, 160))
True

contains(region: tuple[int, int]) → bool[source]

Check if this exon fully contains a genomic region.

Args:

region (tuple[int, int]): A (start, stop) position tuple to check.

Returns:

bool: True if the region is fully contained within this exon.

Example:

>>> exon = Exon(100, 200)
>>> exon.contains((150, 160))
True
>>> exon.contains((50, 250))
False

property frame: int | None

property start: int

property stop: int

class dae.genomic_resources.gene_models.transcript_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, *, tx: tuple[int, int], cds: tuple[int, int], exons: list[Exon] | None = None, attributes: dict[str, Any] | None = None)[source]

Bases: object

Represent a transcript with all its structural features.

A transcript model contains complete information about a gene transcript, including its genomic location, exon structure, coding regions, and additional attributes from the source annotation.

Attributes:

gene (str): Gene name/symbol (e.g., “TP53”). tr_id (str): Transcript identifier, unique within the gene models. tr_name (str): Original transcript name from source annotation. chrom (str): Chromosome name (e.g., “chr17”, “17”). strand (str): Strand orientation (“+” or “-“). tx (tuple[int, int]): Transcript start and end positions

(1-based, closed interval).

cds (tuple[int, int]): Coding sequence start and end positions: (1-based, closed interval). For non-coding transcripts, cds[0] >= cds[1].

exons (list[Exon]): List of Exon objects in genomic order. attributes (dict[str, Any]): Additional annotation attributes

(e.g., gene_biotype, gene_version).

Example:

>>> from dae.genomic_resources.gene_models.transcript_models import \
...     TranscriptModel, Exon
>>> tm = TranscriptModel(
...     gene="TP53",
...     tr_id="ENST00000269305",
...     tr_name="TP53-201",
...     chrom="17",
...     strand="-",
...     tx=(7661779, 7687550),
...     cds=(7668402, 7687490),
...     exons=[Exon(7661779, 7661822), Exon(7668402, 7669690)],
...     attributes={"gene_biotype": "protein_coding"},
... )
>>> print(f"Coding: {tm.is_coding()}")
Coding: True
>>> regions = tm.cds_regions()
>>> print(f"CDS has {len(regions)} regions")

Note:

All coordinates use 1-based, closed intervals
CDS includes both start and stop codons
Exons should be in genomic order (not necessarily 5’ to 3’)

all_regions(ss_extend: int = 0, prom: int = 0) → list[BedRegion][source]

Get all transcript regions with optional extensions.

Returns all exonic regions, optionally extending into splice sites and promoter regions.

Args:

ss_extend (int): Number of bases to extend into splice sites: at coding exon boundaries. Default is 0.
prom (int): Number of bases to extend into promoter region: upstream of transcription start. Default is 0.

Returns:

list[BedRegion]: List of all transcript regions, potentially: extended based on parameters.

Example:

>>> # Basic exonic regions
>>> regions = transcript.all_regions()
>>> # With splice site extension
>>> regions = transcript.all_regions(ss_extend=3)
>>> # With promoter region
>>> regions = transcript.all_regions(prom=2000)

Note:

Promoter extension is strand-aware: extends upstream of the transcription start (before first exon for +, after last for -).

calc_frames() → list[int][source]

Calculate reading frame for each exon.

Computes the codon reading frame (0, 1, or 2) for each exon based on the CDS coordinates and strand orientation.

Returns:

list[int]: Reading frame for each exon. Values are:

0, 1, or 2 for coding exons (bases into current codon)
-1 for non-coding exons or non-coding transcripts

Example:

>>> frames = transcript.calc_frames()
>>> for exon, frame in zip(transcript.exons, frames):
...     if frame >= 0:
...         print(f"Exon {exon.start}-{exon.stop}: frame {frame}")

Note:

Frame calculation is strand-aware and considers exon order. Use update_frames() to set frame attribute on Exon objects.

cds_len() → int[source]

cds_regions(ss_extend: int = 0) → list[BedRegion][source]

Compute coding sequence (CDS) regions.

Extracts the portions of exons that contain coding sequence, optionally extending into splice sites.

Args:

ss_extend (int): Number of bases to extend into splice sites: at exon boundaries. Default is 0 (no extension).

Returns:

list[BedRegion]: List of BedRegion objects representing CDS: segments. Returns empty list for non-coding transcripts.

Example:

>>> cds_regions = transcript.cds_regions()
>>> for region in cds_regions:
...     print(f"{region.chrom}:{region.start}-{region.stop}")
>>> # With splice site extension
>>> extended = transcript.cds_regions(ss_extend=3)

Note:

CDS regions include both start and stop codons. Use collect_gtf_cds_regions() from serialization module to exclude the stop codon for GTF format.

is_coding() → bool[source]

Check if this transcript is protein-coding.

Returns:

bool: True if the transcript has a coding region (CDS),: False for non-coding transcripts.

Example:

>>> if transcript.is_coding():
...     cds_len = transcript.cds_len()
...     print(f"CDS length: {cds_len}bp")

test_frames() → bool[source]

Verify that exon frames are correctly set.

Compares the frame attribute of each exon with the calculated frame to ensure consistency.

Returns:

bool: True if all exon frames match calculated values,: False otherwise.

Example:

>>> transcript.update_frames()
>>> assert transcript.test_frames()

total_len() → int[source]

update_frames() → None[source]

Update the frame attribute of all exons.

Calculates reading frames using calc_frames() and updates the frame attribute of each Exon object.

Example:

>>> transcript.update_frames()
>>> for exon in transcript.exons:
...     print(f"Exon frame: {exon.frame}")

Note:

This modifies the Exon objects in place.

utr3_len() → int[source]

utr3_regions() → list[BedRegion][source]

Get 3’ untranslated region (3’ UTR) segments.

The 3’ UTR extends from the stop codon (translation end) to the transcription end. Strand orientation is considered.

Returns:

list[BedRegion]: List of 3’ UTR regions. Returns empty list for: non-coding transcripts.

Example:

>>> utr3 = transcript.utr3_regions()
>>> utr3_length = sum(r.stop - r.start + 1 for r in utr3)
>>> print(f"3' UTR: {utr3_length}bp")

Note:

For positive strand: regions after CDS end. For negative strand: regions before CDS start.

utr5_len() → int[source]

utr5_regions() → list[BedRegion][source]

Get 5’ untranslated region (5’ UTR) segments.

The 5’ UTR extends from the transcription start to the start codon (translation start). Strand orientation is considered.

Returns:

list[BedRegion]: List of 5’ UTR regions. Returns empty list for: non-coding transcripts.

Example:

>>> utr5 = transcript.utr5_regions()
>>> utr5_length = sum(r.stop - r.start + 1 for r in utr5)
>>> print(f"5' UTR: {utr5_length}bp")

Note:

For positive strand: regions before CDS start. For negative strand: regions after CDS end.

Module contents

class dae.genomic_resources.gene_models.Exon(start: int, stop: int, frame: int | None = None)[source]

Bases: object

Represent a single exon within a transcript.

An exon is a segment of a transcript that is retained in the mature RNA after splicing. This class stores the genomic coordinates and codon reading frame of an exon.

Attributes:

start (int): Genomic start position (1-based, inclusive). stop (int): Genomic end position (1-based, inclusive). frame (int | None): Codon reading frame (0, 1, or 2) for coding

exons, or None for non-coding exons or UTR regions.

Example:

>>> exon = Exon(start=100, stop=200, frame=0)
>>> print(exon.start, exon.stop)
100 200
>>> exon.contains((150, 160))
True

contains(region: tuple[int, int]) → bool[source]

Check if this exon fully contains a genomic region.

Args:

region (tuple[int, int]): A (start, stop) position tuple to check.

Returns:

bool: True if the region is fully contained within this exon.

Example:

>>> exon = Exon(100, 200)
>>> exon.contains((150, 160))
True
>>> exon.contains((50, 250))
False

property frame: int | None

property start: int

property stop: int

class dae.genomic_resources.gene_models.GeneModels(resource: GenomicResource)[source]

Bases: ResourceConfigValidationMixin

Manage and query gene model data from genomic resources.

This class provides access to gene models loaded from various file formats (GTF, refFlat, refSeq, CCDS, etc.) and offers efficient querying by gene name or genomic location.

The class maintains three internal data structures: - transcript_models: Dict mapping transcript IDs to TranscriptModel objects - gene_models: Dict mapping gene names to lists of TranscriptModel objects - _tx_index: IntervalTree index for fast location-based queries

Attributes:

resource (GenomicResource): The genomic resource containing gene models. config (dict): Validated configuration from the resource. reference_genome_id (str | None): ID of the reference genome. gene_models (dict[str, list[TranscriptModel]]): Gene name to

transcript models mapping.

transcript_models (dict[str, TranscriptModel]): Transcript ID to: transcript model mapping.

Example:

>>> from dae.genomic_resources.gene_models.gene_models_factory import \
...     build_gene_models_from_file
>>> gene_models = build_gene_models_from_file("genes.gtf")
>>> gene_models.load()
>>> # Query by gene name
>>> tp53_transcripts = gene_models.gene_models_by_gene_name("TP53")
>>> # Query by location
>>> transcripts = gene_models.gene_models_by_location("chr17", 7676592)

Note:

The gene models must be loaded using the load() method before queries can be performed. The class is thread-safe for concurrent access.

close() → None[source]

gene_models_by_gene_name(name: str) → list[TranscriptModel] | None[source]

Retrieve all transcript models for a specific gene.

Args:

name (str): The gene name/symbol to search for.

Returns:

list[TranscriptModel] | None: List of transcript models for the: gene, or None if the gene is not found.

Example:

>>> transcripts = gene_models.gene_models_by_gene_name("BRCA1")
>>> if transcripts:
...     print(f"BRCA1 has {len(transcripts)} transcript variants")

gene_models_by_location(chrom: str, pos_begin: int, pos_end: int | None = None) → list[TranscriptModel][source]

Retrieve transcripts overlapping a genomic position or region.

This method uses an interval tree index for efficient querying of transcripts by genomic coordinates.

Args:

chrom (str): The chromosome name (e.g., “chr1”, “17”). pos_begin (int): The start position (1-based, inclusive). pos_end (int | None): The end position (1-based, inclusive).

If None, queries a single position.

Returns:

list[TranscriptModel]: List of TranscriptModel objects whose: transcript regions overlap the query position/region. Returns empty list if no overlaps found.

Example:

>>> # Query single position
>>> models = gene_models.gene_models_by_location("chr17", 7676592)
>>> # Query region
>>> models = gene_models.gene_models_by_location(
...     "chr17", 7661779, 7687550
... )
>>> for tm in models:
...     print(f"{tm.gene}: {tm.tr_id}")

Note:

Positions are swapped automatically if pos_end < pos_begin.

gene_names() → list[str][source]

Get list of all gene names in the loaded gene models.

Returns:

list[str]: List of gene names (symbols).

Example:

>>> gene_models.load()
>>> genes = gene_models.gene_names()
>>> print(f"Loaded {len(genes)} genes")

static get_schema() → dict[str, Any][source]: Return schema to be used for config validation.

has_chromosome(chrom: str) → bool[source]

Check if a chromosome has any gene models.

Args:

chrom (str): The chromosome name to check.

Returns:

bool: True if the chromosome has gene models, False otherwise.

Example:

>>> if gene_models.has_chromosome("chr1"):
...     print("Chromosome 1 has gene annotations")

is_loaded() → bool[source]

Check if gene models have been loaded.

Returns:

bool: True if load() has been called and completed, False otherwise.

Example:

>>> if not gene_models.is_loaded():
...     gene_models.load()

static join_gene_models(*gene_models: GeneModels) → GeneModels[source]

Merge multiple gene models into a single GeneModels object.

This combines transcript models from multiple sources into one unified gene models object.

Args:

*gene_models (GeneModels): Two or more GeneModels objects to merge.

Returns:

GeneModels: New GeneModels object containing all transcripts.

Raises:

ValueError: If fewer than 2 gene models provided.

Example:

>>> gm1 = build_gene_models_from_file("genes1.gtf")
>>> gm2 = build_gene_models_from_file("genes2.gtf")
>>> merged = GeneModels.join_gene_models(gm1, gm2)

Note:

Transcript IDs should be unique across all input gene models.

load() → GeneModels[source]

Load gene models from the genomic resource.

This method parses the gene model file and builds internal indexes for efficient querying. It is thread-safe and will only load once.

Returns:

GeneModels: Self, for method chaining.

Example:

>>> gene_models = build_gene_models_from_file("genes.gtf")
>>> gene_models.load()
>>> num_transcripts = len(gene_models.transcript_models)
>>> print(f"Loaded {num_transcripts} transcripts")

Note:

Calling load() multiple times is safe - subsequent calls return immediately if already loaded.

property resource_id: str

class dae.genomic_resources.gene_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, *, tx: tuple[int, int], cds: tuple[int, int], exons: list[Exon] | None = None, attributes: dict[str, Any] | None = None)[source]

Bases: object

Represent a transcript with all its structural features.

A transcript model contains complete information about a gene transcript, including its genomic location, exon structure, coding regions, and additional attributes from the source annotation.

Attributes:

gene (str): Gene name/symbol (e.g., “TP53”). tr_id (str): Transcript identifier, unique within the gene models. tr_name (str): Original transcript name from source annotation. chrom (str): Chromosome name (e.g., “chr17”, “17”). strand (str): Strand orientation (“+” or “-“). tx (tuple[int, int]): Transcript start and end positions

(1-based, closed interval).

cds (tuple[int, int]): Coding sequence start and end positions: (1-based, closed interval). For non-coding transcripts, cds[0] >= cds[1].

exons (list[Exon]): List of Exon objects in genomic order. attributes (dict[str, Any]): Additional annotation attributes

(e.g., gene_biotype, gene_version).

Example:

>>> from dae.genomic_resources.gene_models.transcript_models import \
...     TranscriptModel, Exon
>>> tm = TranscriptModel(
...     gene="TP53",
...     tr_id="ENST00000269305",
...     tr_name="TP53-201",
...     chrom="17",
...     strand="-",
...     tx=(7661779, 7687550),
...     cds=(7668402, 7687490),
...     exons=[Exon(7661779, 7661822), Exon(7668402, 7669690)],
...     attributes={"gene_biotype": "protein_coding"},
... )
>>> print(f"Coding: {tm.is_coding()}")
Coding: True
>>> regions = tm.cds_regions()
>>> print(f"CDS has {len(regions)} regions")

Note:

All coordinates use 1-based, closed intervals
CDS includes both start and stop codons
Exons should be in genomic order (not necessarily 5’ to 3’)

all_regions(ss_extend: int = 0, prom: int = 0) → list[BedRegion][source]

Get all transcript regions with optional extensions.

Returns all exonic regions, optionally extending into splice sites and promoter regions.

Args:

ss_extend (int): Number of bases to extend into splice sites: at coding exon boundaries. Default is 0.
prom (int): Number of bases to extend into promoter region: upstream of transcription start. Default is 0.

Returns:

list[BedRegion]: List of all transcript regions, potentially: extended based on parameters.

Example:

>>> # Basic exonic regions
>>> regions = transcript.all_regions()
>>> # With splice site extension
>>> regions = transcript.all_regions(ss_extend=3)
>>> # With promoter region
>>> regions = transcript.all_regions(prom=2000)

Note:

Promoter extension is strand-aware: extends upstream of the transcription start (before first exon for +, after last for -).

calc_frames() → list[int][source]

Calculate reading frame for each exon.

Computes the codon reading frame (0, 1, or 2) for each exon based on the CDS coordinates and strand orientation.

Returns:

list[int]: Reading frame for each exon. Values are:

0, 1, or 2 for coding exons (bases into current codon)
-1 for non-coding exons or non-coding transcripts

Example:

>>> frames = transcript.calc_frames()
>>> for exon, frame in zip(transcript.exons, frames):
...     if frame >= 0:
...         print(f"Exon {exon.start}-{exon.stop}: frame {frame}")

Note:

Frame calculation is strand-aware and considers exon order. Use update_frames() to set frame attribute on Exon objects.

cds_len() → int[source]

cds_regions(ss_extend: int = 0) → list[BedRegion][source]

Compute coding sequence (CDS) regions.

Extracts the portions of exons that contain coding sequence, optionally extending into splice sites.

Args:

ss_extend (int): Number of bases to extend into splice sites: at exon boundaries. Default is 0 (no extension).

Returns:

list[BedRegion]: List of BedRegion objects representing CDS: segments. Returns empty list for non-coding transcripts.

Example:

>>> cds_regions = transcript.cds_regions()
>>> for region in cds_regions:
...     print(f"{region.chrom}:{region.start}-{region.stop}")
>>> # With splice site extension
>>> extended = transcript.cds_regions(ss_extend=3)

Note:

CDS regions include both start and stop codons. Use collect_gtf_cds_regions() from serialization module to exclude the stop codon for GTF format.

is_coding() → bool[source]

Check if this transcript is protein-coding.

Returns:

bool: True if the transcript has a coding region (CDS),: False for non-coding transcripts.

Example:

>>> if transcript.is_coding():
...     cds_len = transcript.cds_len()
...     print(f"CDS length: {cds_len}bp")

test_frames() → bool[source]

Verify that exon frames are correctly set.

Compares the frame attribute of each exon with the calculated frame to ensure consistency.

Returns:

bool: True if all exon frames match calculated values,: False otherwise.

Example:

>>> transcript.update_frames()
>>> assert transcript.test_frames()

total_len() → int[source]

update_frames() → None[source]

Update the frame attribute of all exons.

Calculates reading frames using calc_frames() and updates the frame attribute of each Exon object.

Example:

>>> transcript.update_frames()
>>> for exon in transcript.exons:
...     print(f"Exon frame: {exon.frame}")

Note:

This modifies the Exon objects in place.

utr3_len() → int[source]

utr3_regions() → list[BedRegion][source]

Get 3’ untranslated region (3’ UTR) segments.

The 3’ UTR extends from the stop codon (translation end) to the transcription end. Strand orientation is considered.

Returns:

list[BedRegion]: List of 3’ UTR regions. Returns empty list for: non-coding transcripts.

Example:

>>> utr3 = transcript.utr3_regions()
>>> utr3_length = sum(r.stop - r.start + 1 for r in utr3)
>>> print(f"3' UTR: {utr3_length}bp")

Note:

For positive strand: regions after CDS end. For negative strand: regions before CDS start.

utr5_len() → int[source]

utr5_regions() → list[BedRegion][source]

Get 5’ untranslated region (5’ UTR) segments.

The 5’ UTR extends from the transcription start to the start codon (translation start). Strand orientation is considered.

Returns:

list[BedRegion]: List of 5’ UTR regions. Returns empty list for: non-coding transcripts.

Example:

>>> utr5 = transcript.utr5_regions()
>>> utr5_length = sum(r.stop - r.start + 1 for r in utr5)
>>> print(f"5' UTR: {utr5_length}bp")

Note:

For positive strand: regions before CDS start. For negative strand: regions after CDS end.

dae.genomic_resources.gene_models.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None, chrom_mapping_file_name: str | None = None) → GeneModels[source]: Load gene models from local filesystem.

dae.genomic_resources.gene_models.build_gene_models_from_resource(resource: GenomicResource | None) → GeneModels[source]: Load gene models from a genomic resource.

dae.genomic_resources.gene_models.build_gene_models_from_resource_id(resource_id: str, grr: GenomicResourceRepo | None = None) → GeneModels[source]: Load gene models from a genomic resource id.