dae.genomic_resources.gene_models package
Subpackages
- dae.genomic_resources.gene_models.tests package
- Submodules
- dae.genomic_resources.gene_models.tests.test_gene_models module
- dae.genomic_resources.gene_models.tests.test_gene_models_gtf_serialization module
- dae.genomic_resources.gene_models.tests.test_gene_models_impl module
- dae.genomic_resources.gene_models.tests.test_gene_models_resource module
- Module contents
Submodules
dae.genomic_resources.gene_models.gene_models module
- class dae.genomic_resources.gene_models.gene_models.Exon(start: int, stop: int, frame: int | None = None)[source]
Bases:
object
Provides exon model.
- class dae.genomic_resources.gene_models.gene_models.GeneModels(resource: GenomicResource)[source]
Bases:
ResourceConfigValidationMixin
Provides class for gene models.
- add_transcript_model(transcript_model: TranscriptModel) None [source]
Add a transcript model to the gene models.
- gene_models_by_gene_name(name: str) list[TranscriptModel] | None [source]
- gene_models_by_location(chrom: str, pos1: int, pos2: int | None = None) list[TranscriptModel] [source]
Retrieve TranscriptModel objects based on genomic position(s).
- Args:
chrom (str): The chromosome name. pos1 (int): The starting genomic position. pos2 (Optional[int]): The ending genomic position. If not provided,
only models that contain pos1 will be returned.
- Returns:
- list[TranscriptModel]: A list of TranscriptModel objects that
match the given location criteria.
- load() GeneModels [source]
Load gene models.
- relabel_chromosomes(relabel: dict[str, str] | None = None, map_file: str | None = None) None [source]
Relabel chromosomes in gene model.
- property resource_id: str
- class dae.genomic_resources.gene_models.gene_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, tx: tuple[int, int], cds: tuple[int, int], exons: list[Exon] | None = None, attributes: dict[str, Any] | None = None)[source]
Bases:
object
Provides transcript model.
- dae.genomic_resources.gene_models.gene_models.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None) GeneModels [source]
Load gene models from local filesystem.
- dae.genomic_resources.gene_models.gene_models.build_gene_models_from_resource(resource: GenomicResource | None) GeneModels [source]
Load gene models from a genomic resource.
- dae.genomic_resources.gene_models.gene_models.build_gene_models_from_resource_id(resource_id: str, grr: GenomicResourceRepo | None = None) GeneModels [source]
- dae.genomic_resources.gene_models.gene_models.create_regions_from_genes(gene_models: GeneModels, genes: list[str], regions: list[Region] | None, gene_regions_heuristic_cutoff: int = 20, gene_regions_heuristic_extend: int = 20000) list[Region] | None [source]
Produce a list of regions from given gene symbols.
If given a list of regions, will merge the newly-created regions from the genes with the provided ones.
- dae.genomic_resources.gene_models.gene_models.join_gene_models(*gene_models: GeneModels) GeneModels [source]
Join muliple gene models into a single gene models object.
dae.genomic_resources.gene_models.parsing module
- dae.genomic_resources.gene_models.parsing.get_parser(fileformat: str) Callable[[GeneModels, IO, dict[str, str] | None, int | None], bool] | None [source]
Get gene models parser based on file format.
- dae.genomic_resources.gene_models.parsing.infer_gene_model_parser(gene_models: GeneModels, infile: IO, file_format: str | None = None) str | None [source]
Infer gene models file format.
- dae.genomic_resources.gene_models.parsing.load_gene_mapping(infile: IO) dict[str, str] [source]
Load alternative names for genes.
Assume that its first line has two column names
- dae.genomic_resources.gene_models.parsing.load_gene_models(gene_models: GeneModels) GeneModels [source]
Load gene models.
- dae.genomic_resources.gene_models.parsing.parse_ccds_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse CCDS gene models file format.
- dae.genomic_resources.gene_models.parsing.parse_default_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse default gene models file format.
- dae.genomic_resources.gene_models.parsing.parse_gtf_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse GTF gene models file format.
- dae.genomic_resources.gene_models.parsing.parse_known_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse known gene models file format.
- dae.genomic_resources.gene_models.parsing.parse_raw(infile: IO, expected_columns: list[str], nrows: int | None = None, comment: str | None = None) DataFrame | None [source]
Parse raw gene models data based on expected columns.
- dae.genomic_resources.gene_models.parsing.parse_ref_flat_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse refFlat gene models file format.
- dae.genomic_resources.gene_models.parsing.parse_ref_seq_gene_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse refSeq gene models file format.
- dae.genomic_resources.gene_models.parsing.parse_ucscgenepred_models_format(gene_models: GeneModels, infile: IO, gene_mapping: dict[str, str] | None = None, nrows: int | None = None) bool [source]
Parse UCSC gene prediction models file fomrat.
table genePred “A gene prediction.”
( string name; “Name of gene” string chrom; “Chromosome name” char[1] strand; “+ or - for strand” uint txStart; “Transcription start position” uint txEnd; “Transcription end position” uint cdsStart; “Coding region start” uint cdsEnd; “Coding region end” uint exonCount; “Number of exons” uint[exonCount] exonStarts; “Exon start positions” uint[exonCount] exonEnds; “Exon end positions” )
table genePredExt “A gene prediction with some additional info.”
( string name; “Name of gene (usually transcript_id from
GTF)”
string chrom; “Chromosome name” char[1] strand; “+ or - for strand” uint txStart; “Transcription start position” uint txEnd; “Transcription end position” uint cdsStart; “Coding region start” uint cdsEnd; “Coding region end” uint exonCount; “Number of exons” uint[exonCount] exonStarts; “Exon start positions” uint[exonCount] exonEnds; “Exon end positions” int score; “Score” string name2; “Alternate name (e.g. gene_id from GTF)” string cdsStartStat; “Status of CDS start annotation (none,
unknown, incomplete, or complete)”
- string cdsEndStat; “Status of CDS end annotation
(none, unknown, incomplete, or complete)”
lstring exonFrames; “Exon frame offsets {0,1,2}” )
dae.genomic_resources.gene_models.serialization module
- dae.genomic_resources.gene_models.serialization.build_gtf_record(transcript: TranscriptModel, feature: str, start: int, stop: int, attrs: str) tuple[tuple[str, int, int, int], str] [source]
Build an indexed GTF format record for a feature.
- dae.genomic_resources.gene_models.serialization.calc_frame_for_gtf_cds_feature(transcript: TranscriptModel, region: BedRegion) int [source]
Calculate frame for the given feature.
- dae.genomic_resources.gene_models.serialization.collect_cds_regions(transcript: TranscriptModel) tuple[list[BedRegion], list[BedRegion], list[BedRegion]] [source]
Returns a tuple of start codon regions, normal coding regions and stop codon regions for a given transcript.
Deprecated since version This: function was split into multiple specialized functions.
- dae.genomic_resources.gene_models.serialization.collect_gtf_cds_regions(strand: str, cds_regions: list[BedRegion]) list[BedRegion] [source]
Returns list of all regions that represent the CDS.
- dae.genomic_resources.gene_models.serialization.collect_gtf_start_codon_regions(strand: str, cds_regions: list[BedRegion]) list[BedRegion] [source]
Returns list of all regions that represent the start codon.
- dae.genomic_resources.gene_models.serialization.collect_gtf_stop_codon_regions(strand: str, cds_regions: list[BedRegion]) list[BedRegion] [source]
Returns list of all regions that represent the stop codon.
- dae.genomic_resources.gene_models.serialization.find_exon_cds_region_for_gtf_cds_feature(transcript: TranscriptModel, region: BedRegion) tuple[Exon, BedRegion] [source]
Find exon and CDS region that contains the given feature.
- dae.genomic_resources.gene_models.serialization.gene_models_to_gtf(gene_models: GeneModels, *, sort_by_position: bool = True) StringIO [source]
Output a GTF format string representation.
- dae.genomic_resources.gene_models.serialization.gtf_canonical_index(index: tuple[str, int, int, int]) tuple [source]
- dae.genomic_resources.gene_models.serialization.save_as_default_gene_models(gene_models: GeneModels, output_filename: str, *, gzipped: bool = True) None [source]
Save gene models in a file in default file format.
- dae.genomic_resources.gene_models.serialization.transcript_to_gtf(transcript: TranscriptModel) list[tuple[tuple[str, int, int, int], str]] [source]
Output an indexed list of GTF-formatted features of a transcript.
Module contents
- class dae.genomic_resources.gene_models.Exon(start: int, stop: int, frame: int | None = None)[source]
Bases:
object
Provides exon model.
- class dae.genomic_resources.gene_models.GeneModels(resource: GenomicResource)[source]
Bases:
ResourceConfigValidationMixin
Provides class for gene models.
- add_transcript_model(transcript_model: TranscriptModel) None [source]
Add a transcript model to the gene models.
- gene_models_by_gene_name(name: str) list[TranscriptModel] | None [source]
- gene_models_by_location(chrom: str, pos1: int, pos2: int | None = None) list[TranscriptModel] [source]
Retrieve TranscriptModel objects based on genomic position(s).
- Args:
chrom (str): The chromosome name. pos1 (int): The starting genomic position. pos2 (Optional[int]): The ending genomic position. If not provided,
only models that contain pos1 will be returned.
- Returns:
- list[TranscriptModel]: A list of TranscriptModel objects that
match the given location criteria.
- load() GeneModels [source]
Load gene models.
- relabel_chromosomes(relabel: dict[str, str] | None = None, map_file: str | None = None) None [source]
Relabel chromosomes in gene model.
- property resource_id: str
- class dae.genomic_resources.gene_models.TranscriptModel(gene: str, tr_id: str, tr_name: str, chrom: str, strand: str, tx: tuple[int, int], cds: tuple[int, int], exons: list[Exon] | None = None, attributes: dict[str, Any] | None = None)[source]
Bases:
object
Provides transcript model.
- dae.genomic_resources.gene_models.build_gene_models_from_file(file_name: str, file_format: str | None = None, gene_mapping_file_name: str | None = None) GeneModels [source]
Load gene models from local filesystem.
- dae.genomic_resources.gene_models.build_gene_models_from_resource(resource: GenomicResource | None) GeneModels [source]
Load gene models from a genomic resource.
- dae.genomic_resources.gene_models.build_gene_models_from_resource_id(resource_id: str, grr: GenomicResourceRepo | None = None) GeneModels [source]
- dae.genomic_resources.gene_models.create_regions_from_genes(gene_models: GeneModels, genes: list[str], regions: list[Region] | None, gene_regions_heuristic_cutoff: int = 20, gene_regions_heuristic_extend: int = 20000) list[Region] | None [source]
Produce a list of regions from given gene symbols.
If given a list of regions, will merge the newly-created regions from the genes with the provided ones.
- dae.genomic_resources.gene_models.gene_models_to_gtf(gene_models: GeneModels, *, sort_by_position: bool = True) StringIO [source]
Output a GTF format string representation.
- dae.genomic_resources.gene_models.join_gene_models(*gene_models: GeneModels) GeneModels [source]
Join muliple gene models into a single gene models object.
- dae.genomic_resources.gene_models.save_as_default_gene_models(gene_models: GeneModels, output_filename: str, *, gzipped: bool = True) None [source]
Save gene models in a file in default file format.