dae.genomic_resources.gene_models package

Subpackages

Submodules

dae.genomic_resources.gene_models.gene_models module

class dae.genomic_resources.gene_models.gene_models.Exon(start: int, stop: int, frame: int | None = None)[source]

Bases: object

Provides exon model.

contains(region: tuple[int, int]) bool[source]
class dae.genomic_resources.gene_models.gene_models.GeneModels(resource: GenomicResource)[source]

Bases: ResourceConfigValidationMixin

Provides class for gene models.

add_transcript_model(transcript_model: TranscriptModel) None[source]

Add a transcript model to the gene models.

gene_models_by_gene_name(name: str) list[TranscriptModel] | None[source]
gene_models_by_location(chrom: str, pos1: int, pos2: int | None = None) list[TranscriptModel][source]

Retrieve TranscriptModel objects based on genomic position(s).

Args:

chrom (str): The chromosome name. pos1 (int): The starting genomic position. pos2 (Optional[int]): The ending genomic position. If not provided,

only models that contain pos1 will be returned.

Returns:
list[TranscriptModel]: A list of TranscriptModel objects that

match the given location criteria.

gene_names() list[str][source]
static get_schema() dict[str, Any][source]

Return schema to be used for config validation.

is_loaded() bool[source]
load() GeneModels[source]

Load gene models.

relabel_chromosomes(relabel: dict[str, str] | None = None, map_file: str | None = None) None[source]

Relabel chromosomes in gene model.

reset() None[source]
property resource_id: str
update_indexes() None[source]
class dae.genomic_resources.gene_models.gene_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, tx: tuple[int, int], cds: tuple[int, int], exons: list[Exon] | None = None, attributes: dict[str, Any] | None = None)[source]

Bases: object

Provides transcript model.

all_regions(ss_extend: int = 0, prom: int = 0) list[BedRegion][source]

Build and return list of regions.

calc_frames() list[int][source]

Calculate codon frames.

cds_len() int[source]
cds_regions(ss_extend: int = 0) list[BedRegion][source]

Compute CDS regions.

get_exon_number_for(start: int, stop: int) int[source]
is_coding() bool[source]
test_frames() bool[source]
total_len() int[source]
update_frames() None[source]

Update codon frames.

utr3_len() int[source]
utr3_regions() list[BedRegion][source]

Build and return list of UTR3 regions.

utr5_len() int[source]
utr5_regions() list[BedRegion][source]

Build list of UTR5 regions.

dae.genomic_resources.gene_models.gene_models.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None) GeneModels[source]

Load gene models from local filesystem.

dae.genomic_resources.gene_models.gene_models.build_gene_models_from_resource(resource: GenomicResource | None) GeneModels[source]

Load gene models from a genomic resource.

dae.genomic_resources.gene_models.gene_models.build_gene_models_from_resource_id(resource_id: str, grr: GenomicResourceRepo | None = None) GeneModels[source]
dae.genomic_resources.gene_models.gene_models.create_regions_from_genes(gene_models: GeneModels, genes: list[str], regions: list[Region] | None, gene_regions_heuristic_cutoff: int = 20, gene_regions_heuristic_extend: int = 20000) list[Region] | None[source]

Produce a list of regions from given gene symbols.

If given a list of regions, will merge the newly-created regions from the genes with the provided ones.

dae.genomic_resources.gene_models.gene_models.join_gene_models(*gene_models: GeneModels) GeneModels[source]

Join muliple gene models into a single gene models object.

dae.genomic_resources.gene_models.parsing module

dae.genomic_resources.gene_models.parsing.get_parser(fileformat: str) Callable[[GeneModels, IO, dict[str, str] | None, int | None], bool] | None[source]

Get gene models parser based on file format.

dae.genomic_resources.gene_models.parsing.infer_gene_model_parser(gene_models: GeneModels, infile: IO, file_format: str | None = None) str | None[source]

Infer gene models file format.

dae.genomic_resources.gene_models.parsing.load_gene_mapping(infile: IO) dict[str, str][source]

Load alternative names for genes.

Assume that its first line has two column names

dae.genomic_resources.gene_models.parsing.load_gene_models(gene_models: GeneModels) GeneModels[source]

Load gene models.

dae.genomic_resources.gene_models.parsing.parse_ccds_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse CCDS gene models file format.

dae.genomic_resources.gene_models.parsing.parse_default_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse default gene models file format.

dae.genomic_resources.gene_models.parsing.parse_gtf_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse GTF gene models file format.

dae.genomic_resources.gene_models.parsing.parse_known_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse known gene models file format.

dae.genomic_resources.gene_models.parsing.parse_raw(infile: IO, expected_columns: list[str], nrows: int | None = None, comment: str | None = None) DataFrame | None[source]

Parse raw gene models data based on expected columns.

dae.genomic_resources.gene_models.parsing.parse_ref_flat_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse refFlat gene models file format.

dae.genomic_resources.gene_models.parsing.parse_ref_seq_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse refSeq gene models file format.

dae.genomic_resources.gene_models.parsing.parse_ucscgenepred_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool[source]

Parse UCSC gene prediction models file fomrat.

table genePred “A gene prediction.”

( string name; “Name of gene” string chrom; “Chromosome name” char[1] strand; “+ or - for strand” uint txStart; “Transcription start position” uint txEnd; “Transcription end position” uint cdsStart; “Coding region start” uint cdsEnd; “Coding region end” uint exonCount; “Number of exons” uint[exonCount] exonStarts; “Exon start positions” uint[exonCount] exonEnds; “Exon end positions” )

table genePredExt “A gene prediction with some additional info.”

( string name; “Name of gene (usually transcript_id from

GTF)”

string chrom; “Chromosome name” char[1] strand; “+ or - for strand” uint txStart; “Transcription start position” uint txEnd; “Transcription end position” uint cdsStart; “Coding region start” uint cdsEnd; “Coding region end” uint exonCount; “Number of exons” uint[exonCount] exonStarts; “Exon start positions” uint[exonCount] exonEnds; “Exon end positions” int score; “Score” string name2; “Alternate name (e.g. gene_id from GTF)” string cdsStartStat; “Status of CDS start annotation (none,

unknown, incomplete, or complete)”

string cdsEndStat; “Status of CDS end annotation

(none, unknown, incomplete, or complete)”

lstring exonFrames; “Exon frame offsets {0,1,2}” )

dae.genomic_resources.gene_models.parsing.probe_columns(infile: IO, expected_columns: list[str], comment: str | None = None) bool[source]

Probe gene models file based on expected columns.

dae.genomic_resources.gene_models.parsing.probe_header(infile: IO, expected_columns: list[str], comment: str | None = None) bool[source]

Probe gene models file header based on expected columns.

dae.genomic_resources.gene_models.serialization module

dae.genomic_resources.gene_models.serialization.build_gtf_record(transcript: TranscriptModel, feature: str, start: int, stop: int, attrs: str) tuple[tuple[str, int, int, int], str][source]

Build an indexed GTF format record for a feature.

dae.genomic_resources.gene_models.serialization.calc_frame_for_gtf_cds_feature(transcript: TranscriptModel, region: BedRegion) int[source]

Calculate frame for the given feature.

dae.genomic_resources.gene_models.serialization.collect_cds_regions(transcript: TranscriptModel) tuple[list[BedRegion], list[BedRegion], list[BedRegion]][source]

Returns a tuple of start codon regions, normal coding regions and stop codon regions for a given transcript.

Deprecated since version This: function was split into multiple specialized functions.

dae.genomic_resources.gene_models.serialization.collect_gtf_cds_regions(strand: str, cds_regions: list[BedRegion]) list[BedRegion][source]

Returns list of all regions that represent the CDS.

dae.genomic_resources.gene_models.serialization.collect_gtf_start_codon_regions(strand: str, cds_regions: list[BedRegion]) list[BedRegion][source]

Returns list of all regions that represent the start codon.

dae.genomic_resources.gene_models.serialization.collect_gtf_stop_codon_regions(strand: str, cds_regions: list[BedRegion]) list[BedRegion][source]

Returns list of all regions that represent the stop codon.

dae.genomic_resources.gene_models.serialization.find_exon_cds_region_for_gtf_cds_feature(transcript: TranscriptModel, region: BedRegion) tuple[Exon, BedRegion][source]

Find exon and CDS region that contains the given feature.

dae.genomic_resources.gene_models.serialization.gene_models_to_gtf(gene_models: GeneModels, *, sort_by_position: bool = True) StringIO[source]

Output a GTF format string representation.

dae.genomic_resources.gene_models.serialization.gtf_canonical_index(index: tuple[str, int, int, int]) tuple[source]
dae.genomic_resources.gene_models.serialization.save_as_default_gene_models(gene_models: GeneModels, output_filename: str, *, gzipped: bool = True) None[source]

Save gene models in a file in default file format.

dae.genomic_resources.gene_models.serialization.transcript_to_gtf(transcript: TranscriptModel) list[tuple[tuple[str, int, int, int], str]][source]

Output an indexed list of GTF-formatted features of a transcript.

Module contents

class dae.genomic_resources.gene_models.Exon(start: int, stop: int, frame: int | None = None)[source]

Bases: object

Provides exon model.

contains(region: tuple[int, int]) bool[source]
class dae.genomic_resources.gene_models.GeneModels(resource: GenomicResource)[source]

Bases: ResourceConfigValidationMixin

Provides class for gene models.

add_transcript_model(transcript_model: TranscriptModel) None[source]

Add a transcript model to the gene models.

gene_models_by_gene_name(name: str) list[TranscriptModel] | None[source]
gene_models_by_location(chrom: str, pos1: int, pos2: int | None = None) list[TranscriptModel][source]

Retrieve TranscriptModel objects based on genomic position(s).

Args:

chrom (str): The chromosome name. pos1 (int): The starting genomic position. pos2 (Optional[int]): The ending genomic position. If not provided,

only models that contain pos1 will be returned.

Returns:
list[TranscriptModel]: A list of TranscriptModel objects that

match the given location criteria.

gene_names() list[str][source]
static get_schema() dict[str, Any][source]

Return schema to be used for config validation.

is_loaded() bool[source]
load() GeneModels[source]

Load gene models.

relabel_chromosomes(relabel: dict[str, str] | None = None, map_file: str | None = None) None[source]

Relabel chromosomes in gene model.

reset() None[source]
property resource_id: str
update_indexes() None[source]
class dae.genomic_resources.gene_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, tx: tuple[int, int], cds: tuple[int, int], exons: list[Exon] | None = None, attributes: dict[str, Any] | None = None)[source]

Bases: object

Provides transcript model.

all_regions(ss_extend: int = 0, prom: int = 0) list[BedRegion][source]

Build and return list of regions.

calc_frames() list[int][source]

Calculate codon frames.

cds_len() int[source]
cds_regions(ss_extend: int = 0) list[BedRegion][source]

Compute CDS regions.

get_exon_number_for(start: int, stop: int) int[source]
is_coding() bool[source]
test_frames() bool[source]
total_len() int[source]
update_frames() None[source]

Update codon frames.

utr3_len() int[source]
utr3_regions() list[BedRegion][source]

Build and return list of UTR3 regions.

utr5_len() int[source]
utr5_regions() list[BedRegion][source]

Build list of UTR5 regions.

dae.genomic_resources.gene_models.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None) GeneModels[source]

Load gene models from local filesystem.

dae.genomic_resources.gene_models.build_gene_models_from_resource(resource: GenomicResource | None) GeneModels[source]

Load gene models from a genomic resource.

dae.genomic_resources.gene_models.build_gene_models_from_resource_id(resource_id: str, grr: GenomicResourceRepo | None = None) GeneModels[source]
dae.genomic_resources.gene_models.create_regions_from_genes(gene_models: GeneModels, genes: list[str], regions: list[Region] | None, gene_regions_heuristic_cutoff: int = 20, gene_regions_heuristic_extend: int = 20000) list[Region] | None[source]

Produce a list of regions from given gene symbols.

If given a list of regions, will merge the newly-created regions from the genes with the provided ones.

dae.genomic_resources.gene_models.gene_models_to_gtf(gene_models: GeneModels, *, sort_by_position: bool = True) StringIO[source]

Output a GTF format string representation.

dae.genomic_resources.gene_models.join_gene_models(*gene_models: GeneModels) GeneModels[source]

Join muliple gene models into a single gene models object.

dae.genomic_resources.gene_models.save_as_default_gene_models(gene_models: GeneModels, output_filename: str, *, gzipped: bool = True) None[source]

Save gene models in a file in default file format.