gain.annotation package

Submodules

gain.annotation.annotatable module

class gain.annotation.annotatable.Annotatable(chrom: str, pos: int, pos_end: int, annotatable_type: Type)[source]

Bases: object

Base class for annotatables used in annotation pipeline.

class Type(*values)[source]

Bases: Enum

Defines annotatable types.

COMPLEX = 5
LARGE_DELETION = 7
LARGE_DUPLICATION = 6
POSITION = 0
REGION = 1
SMALL_DELETION = 4
SMALL_INSERTION = 3
SUBSTITUTION = 2
static from_string(variant: str) Type[source]

Construct annotatable type from string argument.

property chrom: str
property chromosome: str
property end_position: int
static from_string(value: str) Annotatable[source]

Deserialize an Annotatable instance from a string value.

property pos: int
property pos_end: int
property position: int
abstractmethod to_dict() dict[source]

Serialize the annotatable to a dictionary.

static tokenize(value: str) tuple[str, list[str]][source]
class gain.annotation.annotatable.CNVAllele(chrom: str, pos_begin: int, pos_end: int, cnv_type: Type)[source]

Bases: Annotatable

Defines copy number variants annotatable.

static from_string(value: str) CNVAllele[source]

Deserialize an Annotatable instance from a string value.

to_dict() dict[source]

Serialize the annotatable to a dictionary.

class gain.annotation.annotatable.Position(chrom: str, pos: int)[source]

Bases: Annotatable

Annotatable class representing a single position in a chromosome.

static from_string(value: str) Position[source]

Deserialize an Annotatable instance from a string value.

to_dict() dict[source]

Serialize the annotatable to a dictionary.

class gain.annotation.annotatable.Region(chrom: str, pos_begin: int, pos_end: int)[source]

Bases: Annotatable

Annotatable class representing a region in a chromosome.

static from_string(value: str) Region[source]

Deserialize an Annotatable instance from a string value.

to_dict() dict[source]

Serialize the annotatable to a dictionary.

class gain.annotation.annotatable.VCFAllele(chrom: str, pos: int, ref: str, alt: str)[source]

Bases: Annotatable

Defines small variants annotatable.

property alt: str
property alternative: str
static from_string(value: str) VCFAllele[source]

Deserialize an Annotatable instance from a string value.

property ref: str
property reference: str
to_dict() dict[source]

Serialize the annotatable to a dictionary.

gain.annotation.annotate_columns module

Deprecated alias for gain.annotation.annotate_tabular.

gain.annotation.annotate_columns.annotate_columns(input_path: str, pipeline: AnnotationPipeline, output_path: str, args: dict[str, Any], *, reference_genome: ReferenceGenome | None = None, region: Region | None = None, attributes_to_delete: Sequence[str] | None = None) None

Annotate a tabular file using a processing pipeline.

gain.annotation.annotate_columns.cli(argv: list[str] | None = None) None[source]

Entry point for the deprecated annotate_columns CLI.

gain.annotation.annotate_doc module

gain.annotation.annotate_doc.cli(raw_args: list[str] | None = None) None[source]

Run command line interface for annotate_vcf tool.

gain.annotation.annotate_doc.configure_argument_parser() ArgumentParser[source]

Construct and configure argument parser.

gain.annotation.annotate_tabular module

gain.annotation.annotate_tabular.annotate_tabular(input_path: str, pipeline: AnnotationPipeline, output_path: str, args: dict[str, Any], *, reference_genome: ReferenceGenome | None = None, region: Region | None = None, attributes_to_delete: Sequence[str] | None = None) None[source]

Annotate a tabular file using a processing pipeline.

gain.annotation.annotate_tabular.cli(argv: list[str] | None = None) None[source]

Entry point for running the tabular annotation tool.

gain.annotation.annotate_utils module

gain.annotation.annotate_utils.add_common_annotation_arguments(parser: ArgumentParser) None[source]

Add common arguments to an annotation command line parser.

gain.annotation.annotate_utils.add_input_files_to_task_graph(args: dict, task_graph: TaskGraph) None[source]
gain.annotation.annotate_utils.build_cli_genomic_context(cli_args: dict[str, Any]) GenomicContext[source]

Helper method to collect necessary objects from the genomic context.

gain.annotation.annotate_utils.build_output_path(raw_input_path: str, output_path: str | None) str[source]

Build an output filepath for an annotation tool’s output.

An explicit compression suffix (.gz/.bgz) on the output is preserved. An output named without one inherits (“mirrors”) the input’s compression suffix, so a .bgz input yields a .bgz output and a .gz input a .gz output.

gain.annotation.annotate_utils.cache_pipeline_resources(grr: GenomicResourceRepo, pipeline: AnnotationPipeline, *, workers: int | None = None, progress: bool = True) None[source]

Cache resources that the given pipeline will use.

gain.annotation.annotate_utils.check_resource_locality(pipeline: AnnotationPipeline, count_rows: Callable[[int], int], *, allow_remote: bool = False) None[source]

Guard against annotating many variants over non-local resources.

count_rows(limit) returns the number of input rows, capped at limit (short-circuiting so a huge input is never read in full).

Below LOCALITY_WARNING_THRESHOLD rows the guard is silent; between the warning and error thresholds it logs a warning and proceeds; above LOCALITY_ERROR_THRESHOLD it raises ValueError. Passing allow_remote disables the guard entirely.

gain.annotation.annotate_utils.emit_annotation_plan(args: dict[str, Any], pipeline: AnnotationPipeline, grr: GenomicResourceRepo) None[source]

Print the (re)annotation plan to stderr.

With --reannotate the previous pipeline is loaded and a ReannotationPipeline plan is rendered; otherwise the plain all-ADDED annotation plan is rendered. Printed with print (not a logger) so it is visible at the default WARNING log level.

gain.annotation.annotate_utils.find_nonlocal_resources(pipeline: AnnotationPipeline) list[tuple[str, str]][source]

Return (resource_id, scheme) for each non-local pipeline resource.

A resource is local when it is served by a caching protocol (its files are mirrored to disk) or by an fsspec protocol with a file or memory scheme. Everything else (http/https/s3) is non-local and would be queried over the network per variant.

gain.annotation.annotate_utils.get_grr_from_context(context: GenomicContext) GenomicResourceRepo[source]

Get the genomic resource repository from the genomic context.

gain.annotation.annotate_utils.get_pipeline_from_context(context: GenomicContext) AnnotationPipeline[source]

Get the annotation pipeline from the genomic context.

gain.annotation.annotate_utils.handle_default_args(args: dict[str, Any]) dict[str, Any][source]

Handle default arguments for annotation command line tools.

gain.annotation.annotate_utils.maybe_remove_work_dir(args: dict[str, Any], *, result: bool) None[source]

Remove the working directory after a clean run, if the tool made it.

The directory is removed only when every condition holds:

  • the tool created it (it did not pre-exist; see work_dir_created),

  • the command actually ran annotation (not list/status),

  • the run succeeded (result is True – a --keep-going run that finished with task errors returns False and is preserved),

  • neither --keep-parts nor --keep-work-dir was requested,

  • the output file does not live inside the working directory.

Removal is best-effort: a failure to remove logs a warning and is not fatal, since the annotation has already succeeded.

gain.annotation.annotate_utils.maybe_wrap_reannotation(pipeline: AnnotationPipeline, args: dict[str, Any], grr: GenomicResourceRepo) AnnotationPipeline[source]

Wrap pipeline in a ReannotationPipeline if reannotating.

When --reannotate is not given the pipeline is returned unchanged. Otherwise the previous pipeline is loaded, the new pipeline is wrapped in a ReannotationPipeline, and the previous pipeline is closed – the wrapper reuses the live new-pipeline annotators and never touches the previous pipeline after construction.

gain.annotation.annotate_utils.produce_partfile_paths(input_file_path: str, regions: list[Region], work_dir: str) list[str][source]

Produce a list of file paths for output region part files.

gain.annotation.annotate_utils.produce_regions(pysam_file: TabixFile, region_size: int) list[Region][source]

Given a region size, produce contig regions to annotate by.

gain.annotation.annotate_utils.stringify(value: Any, *, vcf: bool = False) str[source]

Format the value to a string for human-readable output.

gain.annotation.annotate_vcf module

gain.annotation.annotate_vcf.annotate_vcf(input_path: str, pipeline: AnnotationPipeline, output_path: str, args: dict[str, Any], *, region: Region | None = None, attributes_to_delete: Sequence[str] | None = None) None[source]

Annotate a columns file using a processing pipeline.

gain.annotation.annotate_vcf.cli(argv: list[str] | None = None) None[source]

Entry point for running the VCF annotation tool.

gain.annotation.annotation_config module

class gain.annotation.annotation_config.AnnotationConfigParser[source]

Bases: object

Parser for annotation configuration.

ANNOTATION_CONFIG_GRAMMAR = '\n?start: resource_id [filter]\n\n?resource_id: (resource_name | wildcard)\n\nwildcard: /[\\w\\d\\/_*]+/\n\nfilter: "[" (equals | and_)+ "]"\n\nand_: operation "and" operation\n\nequals: (name"=\\""value"\\"") | (name"=\'"value"\'")\n\nin: ("\\""value"\\"" " in " name) | ("\'"value"\'" " in " name)\n\nresource_name: /[\\w\\d\\/_\\-!@#$%^<>+]+/\n\n?name: /[\\w\\d\\/_\\-!@#$%^<>+*]+/\n\n?value: /[\\w\\d\\/ _\\-!@#$%^<>+*]+/\n\n?operation: equals | in | and_\n\n%ignore " "\n'
WILDCARD_LIMIT = 500
static build_labels_query(node: Any, labels_query: dict[str, Any] | None = None) dict[str, Callable[[str], bool]][source]

Build labels query from parsed tree node.

static has_wildcard(string: str) bool[source]

Ascertain whether a string contains a valid wildcard.

static match_labels_query(query: dict[str, Callable[[str], bool]], resource_labels: dict[str, str]) bool[source]

Check if the labels query for a wildcard matches.

static parse_complete(raw: dict[str, Any], idx: int, grr: GenomicResourceRepo | None = None) list[AnnotatorInfo][source]

Parse a full-form annotation config.

static parse_minimal(raw: str, idx: int) AnnotatorInfo[source]

Parse a minimal-form annotation config.

static parse_raw(pipeline_raw_config: list[dict[str, Any]] | RawFullConfig | None, grr: GenomicResourceRepo | None = None) tuple[AnnotationPreamble | None, list[AnnotatorInfo]][source]

Parse raw dictionary annotation pipeline configuration.

static parse_raw_attribute_config(raw_attribute_config: dict[str, Any]) AttributeConfig[source]

Parse annotation attribute raw configuration.

static parse_raw_attributes(raw_attributes_config: Any) list[AttributeConfig][source]

Parse annotator pipeline attribute configuration.

static parse_short(raw: dict[str, Any], idx: int, grr: GenomicResourceRepo | None = None) list[AnnotatorInfo][source]

Parse a short-form annotation config.

static parse_str(content: str, source_file_name: str | None = None, grr: GenomicResourceRepo | None = None) tuple[AnnotationPreamble | None, list[AnnotatorInfo]][source]

Parse annotation pipeline configuration string.

static query_resources(annotator_type: str, resource_id: str, grr: GenomicResourceRepo) list[str][source]

Collect resources matching a given query.

exception gain.annotation.annotation_config.AnnotationConfigurationError(message: str | None, other_error: Exception | None = None, error_mark: ErrorMark | None = None)[source]

Bases: Exception

Exception raised for errors in the annotation configuration.

error_mark: ErrorMark | None
message: str | None
class gain.annotation.annotation_config.AnnotationPreamble(summary: 'str', description: 'str', input_reference_genome: 'str', input_reference_genome_res: 'GenomicResource | None', metadata: 'dict[str, Any]')[source]

Bases: object

description: str
input_reference_genome: str
input_reference_genome_res: GenomicResource | None
metadata: dict[str, Any]
summary: str
class gain.annotation.annotation_config.AnnotatorInfo(_type: str, attributes: list[AttributeConfig], parameters: ParamsUsageMonitor | dict[str, Any], documentation: str = '', resources: list[GenomicResource] | None = None, annotator_id: str = 'N/A')[source]

Bases: object

Defines annotator configuration.

annotator_id: str
attributes: list[AttributeConfig]
documentation: str = ''
parameters: ParamsUsageMonitor
resources: list[GenomicResource]
to_dict() dict[str, Any][source]

Convert annotator info to a configuration dictionary.

type: str
class gain.annotation.annotation_config.Attribute(name: str, source: str, internal: bool | None = None, aggregator: AggregatorSource | None = None, parameters: ParamsUsageMonitor = <factory>, spec: AttributeSpec | None = None, _documentation: str | None = None)[source]

Bases: object

Runtime attribute instance produced by an annotator.

aggregator: AggregatorSource | None = None
property description: str
property documentation: str
internal: bool | None = None
name: str
parameters: ParamsUsageMonitor
source: str
spec: AttributeSpec | None = None
property value_type: str
class gain.annotation.annotation_config.AttributeConfig(name: str, source: str, internal: bool | None = None, aggregator: AggregatorDefinition | str | dict[str, ~typing.Any] | None=None, parameters: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

Configuration for an annotator attribute (from pipeline YAML).

aggregator: AggregatorDefinition | str | dict[str, Any] | None = None
as_dict() dict[str, Any][source]

Serialize to a config dict, omitting fields that are unset.

internal: bool | None = None
name: str
parameters: dict[str, Any]
source: str
class gain.annotation.annotation_config.ErrorMark(row: int, column: int)[source]

Bases: object

Marks an error position in a file.

column: int
row: int
class gain.annotation.annotation_config.ParamsUsageMonitor(data: dict[str, Any])[source]

Bases: Mapping

Class to monitor usage of annotator parameters.

as_dict() dict[str, Any][source]

Return a plain copy of all parameters without tracking.

get_unused_keys() set[str][source]

Return the set of keys that have not been accessed.

get_used_keys() set[str][source]

Return the set of keys that have been accessed.

inject(key: str, value: Any) None[source]

Add a parameter and mark it as used (for framework injection).

class gain.annotation.annotation_config.RawFullConfig[source]

Bases: TypedDict

annotators: list[dict[str, Any]]
preamble: RawPreamble
class gain.annotation.annotation_config.RawPreamble[source]

Bases: TypedDict

description: str
input_reference_genome: str
metadata: dict[str, Any]
summary: str

gain.annotation.annotation_factory module

Factory for creation of annotation pipeline.

gain.annotation.annotation_factory.build_annotation_pipeline(config: list[dict[str, Any]] | RawFullConfig, grr: GenomicResourceRepo, *, allow_repeated_attributes: bool = False, work_dir: Path | None = None) AnnotationPipeline[source]

Build an annotation pipeline.

gain.annotation.annotation_factory.build_pipeline_annotator(pipeline: AnnotationPipeline, annotator_config: AnnotatorInfo, work_dir: Path) Annotator[source]

Build an annotator for the pipeline.

gain.annotation.annotation_factory.check_for_repeated_attributes_in_annotator(annotator_config: AnnotatorInfo) None[source]

Check for repeated attributes in annotator configuration.

gain.annotation.annotation_factory.check_for_repeated_attributes_in_pipeline(pipeline: AnnotationPipeline, *, allow_repeated_attributes: bool = False, annotator_config: AnnotatorInfo | None = None) None[source]

Check for repeated attributes in pipeline configuration.

gain.annotation.annotation_factory.check_for_unused_attribute_parameters(annotator: Annotator) None[source]

Check each attribute’s parameters for unused keys.

gain.annotation.annotation_factory.check_for_unused_parameters(info: AnnotatorInfo) None[source]

Check annotator configuration for unused parameters.

gain.annotation.annotation_factory.get_annotator_factory(annotator_type: str) Callable[[AnnotationPipeline, AnnotatorInfo], Annotator][source]

Find and return a factory function for creation of an annotator type.

If the specified annotator type is not found, this function raises ValueError exception.

Returns:

the annotator factory for the specified annotator type.

Raises:

ValueError – when can’t find an annotator factory for the specified annotator type.

gain.annotation.annotation_factory.get_available_annotator_types() list[str][source]

Return the list of all registered annotator factory types.

gain.annotation.annotation_factory.load_pipeline_from_file(raw_path: str, grr: GenomicResourceRepo, *, allow_repeated_attributes: bool = False, work_dir: Path | None = None) AnnotationPipeline[source]

Load an annotation pipeline from a configuration file.

gain.annotation.annotation_factory.load_pipeline_from_file_or_resource(arg: str, grr: GenomicResourceRepo, *, allow_repeated_attributes: bool = False, work_dir: Path | None = None) AnnotationPipeline[source]

Load a pipeline from a file path or a GRR resource id.

Tries to interpret arg as a filesystem path first; on miss, falls back to looking it up as a GRR resource of type annotation_pipeline.

gain.annotation.annotation_factory.load_pipeline_from_grr(grr: GenomicResourceRepo, resource: GenomicResource) AnnotationPipeline[source]

Load a pipeline from a grr and a resource.

gain.annotation.annotation_factory.load_pipeline_from_yaml(raw: str, grr: GenomicResourceRepo, *, allow_repeated_attributes: bool = False, work_dir: Path | None = None) AnnotationPipeline[source]

Load an annotation pipeline from a YAML-formatted string.

gain.annotation.annotation_factory.register_annotator_factory(annotator_type: str, factory: Callable[[AnnotationPipeline, AnnotatorInfo], Annotator]) None[source]

Register additional annotator factory.

By default all annotator factories should be registered at the [gain.annotation.annotators] entry point. All registered factories are loaded automatically. This function should be used if you want to bypass the entry point mechanism and register an additional annotator factory programmatically.

gain.annotation.annotation_factory.resolve_repeated_attributes(pipeline: AnnotationPipeline, repeated_attributes: set[str]) None[source]

Resolve repeated attributes in pipeline configuration via renaming.

gain.annotation.annotation_genomic_context_cli module

Command line helpers for constructing annotation pipelines.

The utilities in this module complement the generic genomic context providers by supplying annotation pipeline objects. They enable CLI tools to load pipeline definitions from the file system or from genomic resource repositories, and to make the resulting AnnotationPipeline instances available through the shared genomic context mechanism.

class gain.annotation.annotation_genomic_context_cli.CLIAnnotationContextProvider[source]

Bases: GenomicContextProvider

Expose annotation pipeline configuration through CLI options.

The provider allows users to point to an annotation pipeline definition (either as a file path or a genomic resource identifier) and optionally tweak pipeline behaviour via command-line flags. When invoked without a pipeline argument the provider abstains from creating a context so that other providers can supply their default pipelines.

add_argparser_arguments(parser: ArgumentParser, **kwargs: Any) None[source]

Register arguments that describe the annotation pipeline source.

Parameters

parser

The parser that should receive the provider specific CLI options.

init(**kwargs: Any) GenomicContext | None[source]

Materialise a genomic context containing an annotation pipeline.

Parameters

**kwargs

Keyword arguments parsed from the command line. The provider looks at pipeline, allow_repeated_attributes, and work_dir.

Returns

GenomicContext | None

A context containing the annotation pipeline, or None when no pipeline could be created (for example when the pipeline argument is omitted).

gain.annotation.annotation_genomic_context_cli.get_context_pipeline(context: GenomicContext) AnnotationPipeline | None[source]

Extract a validated AnnotationPipeline from context.

Parameters

context

The genomic context from which to retrieve the pipeline object.

Returns

AnnotationPipeline | None

The pipeline instance or None when the context does not expose a pipeline.

Raises

TypeError

If the context entry is present but does not contain the expected AnnotationPipeline type.

gain.annotation.annotation_pipeline module

Provides annotation pipeline class.

class gain.annotation.annotation_pipeline.AnnotationPipeline(repository: GenomicResourceRepo)[source]

Bases: object

Provides annotation pipeline abstraction.

add_annotator(annotator: Annotator) None[source]
annotate(annotatable: Annotatable | None, context: dict | None = None) dict[source]

Apply all annotators to an annotatable.

batch_annotate(annotatables: Sequence[Annotatable | None], contexts: list[dict] | None = None, batch_work_dir: str | None = None) list[dict][source]

Apply all annotators to a list of annotatables.

close() None[source]

Close the annotation pipeline.

get_annotator_by_attribute_info(attribute_info: Attribute) Annotator | None[source]
get_attribute_info(attribute_name: str) Attribute | None[source]
get_attributes() list[Attribute][source]
get_attributes_by_type(attribute_type: str) list[Attribute][source]
get_info() list[AnnotatorInfo][source]
get_resource_ids() set[str][source]
open() AnnotationPipeline[source]

Open all annotators in the pipeline and mark it as open.

print() None[source]

Print the annotation pipeline.

class gain.annotation.annotation_pipeline.Annotator(pipeline: AnnotationPipeline | None, info: AnnotatorInfo)[source]

Bases: ABC

Annotator provides a set of attrubutes for a given Annotatable.

BASE_DOC_URL = 'https://iossifovlab.com/gaindocs/annotation_infrastructure.html'
abstractmethod annotate(annotatable: Annotatable | None, context: dict[str, Any]) dict[str, Any][source]

Produce annotation attributes for an annotatable.

abstract property attributes: list[Attribute]

Return the list of attributes this annotator produces.

batch_annotate(annotatables: Sequence[Annotatable | None], contexts: list[dict[str, Any]], batch_work_dir: str | None = None) Iterable[dict[str, Any]][source]
close() None[source]
abstractmethod get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

get_info() AnnotatorInfo[source]
is_open() bool[source]
open() Annotator[source]
property resource_ids: set[str]
property resources: list[GenomicResource]
property used_context_attributes: tuple[str, ...]
class gain.annotation.annotation_pipeline.AnnotatorDecorator(child: Annotator)[source]

Bases: Annotator

Defines annotator decorator base class.

property attributes: list[Attribute]

Return the list of attributes this annotator produces.

close() None[source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

is_open() bool[source]
open() Annotator[source]
class gain.annotation.annotation_pipeline.AttributeSpec(source: str, value_type: str, description: str, is_default: bool = True, internal_default: bool = False, supports_aggregation: bool = True, attribute_type: str = 'attribute')[source]

Bases: object

Describes a single attribute an annotator can produce.

as_dict() dict[str, Any][source]

Serialize to a response dict.

attribute_type: str = 'attribute'
description: str
internal_default: bool = False
is_default: bool = True
source: str
supports_aggregation: bool = True
value_type: str
class gain.annotation.annotation_pipeline.InputAnnotableAnnotatorDecorator(child: Annotator)[source]

Bases: AnnotatorDecorator

Defines annotator decorator to use input annotatable if defined.

annotate(annotatable: Annotatable | None, context: dict[str, Any]) dict[str, Any][source]

Produce annotation attributes for an annotatable.

static decorate(child: Annotator) Annotator[source]
property used_context_attributes: tuple[str, ...]
class gain.annotation.annotation_pipeline.PlanEntry(name: str, internal: bool, annotator_id: str, reason: str | None = None)[source]

Bases: object

A single attribute entry in a reannotation/annotation plan.

annotator_id: str
internal: bool
name: str
reason: str | None = None
class gain.annotation.annotation_pipeline.ReannotationPipeline(pipeline_new: AnnotationPipeline, pipeline_previous: AnnotationPipeline, *, full_reannotation: bool = False)[source]

Bases: AnnotationPipeline

Provides functionality for reannotation.

annotators: list[Annotator]
format_plan(reference: str | None = None) str[source]

Render the reannotation plan as human-readable text.

get_attributes() list[Attribute][source]
infos_new: set[AnnotatorInfo]
infos_rerun: set[AnnotatorInfo]
print_plan(reference: str | None = None, file: IO[str] | None = None) None[source]

Print the reannotation plan.

rerun_triggers: dict[AnnotatorInfo, tuple[AnnotatorInfo, Attribute]]
class gain.annotation.annotation_pipeline.ReannotationPlan(copied: list[PlanEntry] = <factory>, added: list[PlanEntry] = <factory>, computed: list[PlanEntry] = <factory>, deleted: list[PlanEntry] = <factory>)[source]

Bases: object

Structured description of how a reannotation reuses/recomputes data.

Each bucket is a list of PlanEntry:

  • copied: attributes reused unchanged from the input;

  • added: attributes of annotators new to the pipeline;

  • computed: attributes of unchanged annotators forced to recompute (reason records the triggering dependency);

  • deleted: attributes present in the previous pipeline but no longer produced.

added: list[PlanEntry]
computed: list[PlanEntry]
copied: list[PlanEntry]
deleted: list[PlanEntry]
class gain.annotation.annotation_pipeline.ValueTransformAnnotatorDecorator(child: Annotator, value_transformers: dict[str, Callable[[Any], Any]])[source]

Bases: AnnotatorDecorator

Define value transformer annotator decorator.

annotate(annotatable: Annotatable | None, context: dict[str, Any]) dict[str, Any][source]

Produce annotation attributes for an annotatable.

static decorate(child: Annotator) Annotator[source]

Apply value transform decorator to an annotator.

gain.annotation.annotation_pipeline.format_annotation_plan(pipeline: AnnotationPipeline) str[source]

Render a plain annotation pipeline as an all-ADDED plan.

gain.annotation.annotation_pipeline.print_annotation_plan(pipeline: AnnotationPipeline, file: IO[str] | None = None) None[source]

Print a plain annotation pipeline plan.

gain.annotation.annotator_base module

Provides base class for annotators.

class gain.annotation.annotator_base.AnnotatorBase(pipeline: AnnotationPipeline | None, info: AnnotatorInfo)[source]

Bases: Annotator

Base implementation of the Annotator class.

annotate(annotatable: Annotatable | None, context: dict[str, Any]) dict[str, Any][source]

Produce annotation attributes for an annotatable.

property attributes: list[Attribute]

Return the list of attributes this annotator produces.

batch_annotate(annotatables: Sequence[Annotatable | None], contexts: list[dict[str, Any]], batch_work_dir: str | None = None) list[dict[str, Any]][source]
get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
open() Annotator[source]

gain.annotation.chrom_mapping_annotator module

class gain.annotation.chrom_mapping_annotator.ChromMappingAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

Annotator for adjusting chromosome values.

get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

gain.annotation.chrom_mapping_annotator.build_chrom_mapping_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

gain.annotation.cnv_collection_annotator module

class gain.annotation.cnv_collection_annotator.CnvCollectionAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

CNV collection annotator class.

CNV_FILTER_GRAMMAR = '\n?start: filter | and_ | or\n\nand_: filter "and" filter\n\nor: filter "or" filter\n\n?filter: subject operator subject | or | and_\n\n?subject: variable | value\n\nvalue: "\\"" word "\\"" | number\n\nvariable: word\n\noperator: equals | greater_than | less_than | in\n\nequals: "=="\n\ngreater_than: ">"\n\nless_than: "<"\n\nin: "in"\n\nword: /[a-zA-Z!@#$%^&*()_+]+/\n\nnumber: /[0-9\\.]+/\n\n%ignore " "\n'
close() None[source]
get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

open() Annotator[source]
gain.annotation.cnv_collection_annotator.build_cnv_collection_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

gain.annotation.debug_annotator module

class gain.annotation.debug_annotator.HelloWorldAnnotator(pipeline: AnnotationPipeline | None, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

Defines example annotator.

get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

gain.annotation.debug_annotator.build_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

Create an example hello world annotator.

gain.annotation.docker_annotator module

class gain.annotation.docker_annotator.DockerAnnotator(pipeline: AnnotationPipeline | None, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

Base class for annotators that use docker containers.

open() Annotator[source]
abstractmethod run(**kwargs: Any) None[source]

gain.annotation.effect_annotator module

class gain.annotation.effect_annotator.EffectAnnotatorAdapter(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

Adapts effect annotator to be used in annotation infrastructure.

close() None[source]
get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

open() Annotator[source]
gain.annotation.effect_annotator.build_effect_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

gain.annotation.gene_score_annotator module

Module containing the gene score annotator.

class gain.annotation.gene_score_annotator.GeneScoreAnnotator(pipeline: AnnotationPipeline | None, info: AnnotatorInfo, gene_score_resource: GenomicResource, input_gene_list: str)[source]

Bases: AnnotatorBase

Gene score annotator class.

annotate(annotatable: Annotatable | None, context: dict[str, Any]) dict[str, Any][source]

Produce annotation attributes for an annotatable.

batch_annotate(annotatables: Sequence[Annotatable | None], contexts: list[dict[str, Any]], batch_work_dir: str | None = None) list[dict[str, Any]][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

property used_context_attributes: tuple[str, ...]
gain.annotation.gene_score_annotator.build_gene_score_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

Create a gene score annotator.

gain.annotation.gene_set_annotator module

class gain.annotation.gene_set_annotator.GeneSetAnnotator(pipeline: AnnotationPipeline | None, info: AnnotatorInfo, gene_set_resource: GenomicResource, input_gene_list: str)[source]

Bases: AnnotatorBase

Gene set annotator class.

DEFAULT_AGGREGATOR_TYPE = 'list'
get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

open() Annotator[source]
property used_context_attributes: tuple[str, ...]
gain.annotation.gene_set_annotator.build_gene_set_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

Create a gene set annotator.

gain.annotation.liftover_annotator module

Provides a lift over annotator and helpers.

class gain.annotation.liftover_annotator.AbstractLiftoverAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo, chain: LiftoverChain, source_genome: ReferenceGenome, target_genome: ReferenceGenome)[source]

Bases: AnnotatorBase

Liftovver annotator class.

close() None[source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

liftover_allele(allele: VCFAllele) VCFAllele | None[source]

Liftover an allele.

liftover_cnv(cnv_allele: Annotatable) Annotatable | None[source]

Liftover CNV allele annotatable.

liftover_position(position: Annotatable) Annotatable | None[source]

Liftover position annotatable.

liftover_region(region: Annotatable) Annotatable | None[source]

Liftover region annotatable.

open() Annotator[source]
class gain.annotation.liftover_annotator.BasicLiftoverAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo, chain: LiftoverChain, source_genome: ReferenceGenome, target_genome: ReferenceGenome)[source]

Bases: AbstractLiftoverAnnotator

Basic liftover annotator class.

class gain.annotation.liftover_annotator.BcfLiftoverAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo, chain: LiftoverChain, source_genome: ReferenceGenome, target_genome: ReferenceGenome)[source]

Bases: AbstractLiftoverAnnotator

BCF tools liftover re-implementation annotator class.

class gain.annotation.liftover_annotator.LiftoverFunction(*args, **kwargs)[source]

Bases: Protocol

Protocol for liftover function.

gain.annotation.liftover_annotator.basic_liftover_allele(chrom: str, pos: int, ref: str, alt: str, liftover_chain: LiftoverChain, *, source_genome: ReferenceGenome, target_genome: ReferenceGenome) tuple[str, int, str, str] | None[source]

Basic liftover an allele.

gain.annotation.liftover_annotator.basic_liftover_variant(chrom: str, pos: int, ref: str, alts: list[str], liftover_chain: LiftoverChain, *, source_genome: ReferenceGenome, target_genome: ReferenceGenome) tuple[str, int, str, list[str]] | None[source]

Basic liftover variant utility function.

gain.annotation.liftover_annotator.bcf_liftover_allele(chrom: str, pos: int, ref: str, alt: str, liftover_chain: LiftoverChain, *, source_genome: ReferenceGenome, target_genome: ReferenceGenome) tuple[str, int, str, str] | None[source]

Liftover a variant.

gain.annotation.liftover_annotator.bcf_liftover_variant(chrom: str, pos: int, ref: str, alts: list[str], liftover_chain: LiftoverChain, *, source_genome: ReferenceGenome, target_genome: ReferenceGenome) tuple[str, int, str, list[str]] | None[source]

BCF liftover variant utility function.

gain.annotation.liftover_annotator.build_liftover_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

Create a liftover annotator.

gain.annotation.normalize_allele_annotator module

Provides normalize allele annotator and helpers.

class gain.annotation.normalize_allele_annotator.NormalizeAlleleAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

Annotator to normalize VCF alleles.

close() None[source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

open() Annotator[source]
gain.annotation.normalize_allele_annotator.build_normalize_allele_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]
gain.annotation.normalize_allele_annotator.normalize_allele(allele: VCFAllele, genome: ReferenceGenome) VCFAllele[source]

Normalize an allele.

Using algorithm defined in following https://genome.sph.umich.edu/wiki/Variant_Normalization

gain.annotation.prepare_tabular module

Prepare a tabular file for parallel annotation.

Sorts a (possibly gzip-compressed) columnar file by genomic coordinates and produces a bgzip-compressed, tabix-indexed output that annotate_tabular can fan out across regions.

The same --col-* options as annotate_tabular select which input columns carry chromosome / position / etc., and the same RecordToAnnotable lookup is reused to derive the sort and tabix keys.

gain.annotation.prepare_tabular.cli(argv: list[str] | None = None) None[source]

Entry point for the prepare_tabular tool.

gain.annotation.processing_pipeline module

class gain.annotation.processing_pipeline.Annotation(annotatable: Annotatable | None, context: dict[str, ~typing.Any]=<factory>)[source]

Bases: object

A pair of an annotatable and its relevant context.

The context can hold any key/value pair relevant to the annotatable and is typically used to store the results of annotators.

annotatable: Annotatable | None
context: dict[str, Any]
class gain.annotation.processing_pipeline.AnnotationPipelineAnnotatablesBatchFilter(annotation_pipeline: AnnotationPipeline)[source]

Bases: AnnotationsWithSourceBatchFilter, AnnotationPipelineContextManager

Filter that annotates an AnnotationWithSource batch using a pipeline.

class gain.annotation.processing_pipeline.AnnotationPipelineAnnotatablesFilter(annotation_pipeline: AnnotationPipeline)[source]

Bases: AnnotationsWithSourceFilter, AnnotationPipelineContextManager

Filter that annotates an AnnotationWithSource object using a pipeline.

class gain.annotation.processing_pipeline.AnnotationPipelineContextManager(annotation_pipeline: AnnotationPipeline)[source]

Bases: AbstractContextManager

A context manager for annotation pipelines.

class gain.annotation.processing_pipeline.AnnotationsWithSource(source: Any, annotations: list[Annotation])[source]

Bases: object

A pair of a list of Annotation instances and their source.

The source is typically a variant read from some format, with the ‘annotations’ attribute corresponding to its alleles.

annotations: list[Annotation]
source: Any
class gain.annotation.processing_pipeline.AnnotationsWithSourceBatchFilter[source]

Bases: Filter

Base class for filters that work on AnnotationsWithSource batches.

filter(data: Sequence[AnnotationsWithSource]) Sequence[AnnotationsWithSource][source]

Filter a batch of AnnotationsWithSource objects.

class gain.annotation.processing_pipeline.AnnotationsWithSourceFilter[source]

Bases: Filter

Base class for filters that work on AnnotationsWithSource objects.

filter(data: AnnotationsWithSource) AnnotationsWithSource[source]

Filter a single AnnotationsWithSource object.

class gain.annotation.processing_pipeline.DeleteAttributesFromAWSBatchFilter(attributes_to_remove: Sequence[str])[source]

Bases: Filter

Filter to remove items from AWS batches. Works in-place.

filter(data: Sequence[AnnotationsWithSource]) Sequence[AnnotationsWithSource][source]
class gain.annotation.processing_pipeline.DeleteAttributesFromAWSFilter(attributes_to_remove: Sequence[str])[source]

Bases: Filter

Filter to remove items from AWSs. Works in-place.

filter(data: AnnotationsWithSource) AnnotationsWithSource[source]

gain.annotation.record_to_annotatable module

class gain.annotation.record_to_annotatable.CSHLAlleleRecordToAnnotatable(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

Transform a CSHL variant record into a VCF allele annotatable.

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.DaeAlleleRecordToAnnotatable(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

Transform a CSHL variant record into a VCF allele annotatable.

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.RecordToAnnotable(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: ABC

Base class for record to annotable transformation.

abstractmethod build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.RecordToCNVAllele(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

Transform a columns record into a CNV allele annotatable.

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.RecordToPosition(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.RecordToRegion(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.RecordToVcfAllele(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

class gain.annotation.record_to_annotatable.VcfLikeRecordToVcfAllele(columns: tuple, ref_genome: ReferenceGenome | None)[source]

Bases: RecordToAnnotable

Transform a columns record into VCF allele annotatable.

build(record: dict[str, str]) Annotatable[source]

Constructs an annotatable from a record.

gain.annotation.record_to_annotatable.add_record_to_annotable_arguments(parser: ArgumentParser) None[source]
gain.annotation.record_to_annotatable.build_annotatable_from_dict(obj: dict[str, str], ref_genome: ReferenceGenome | None = None) Annotatable[source]

Build an annotatable from a dictionary of string values.

gain.annotation.record_to_annotatable.build_record_to_annotatable(renamed_columns: dict[str, str], available_columns: set[str], ref_genome: ReferenceGenome | None = None) RecordToAnnotable[source]

Transform a variant record into an annotatable.

Parameters

renamed_columnsdict[str, str]

Mapping from expected internal column identifiers (e.g. “col_<field>”) to the actual column names present in the input source. A column can be excluded from usage if an identifier is mapped to “-“. Example rename:

“col_<field>”: “<input source column name for the field>”

Example exclude:

“col_<field>”: “-”

available_columnsset[str]

The set of column names available in the input records.

ref_genomeReferenceGenome | None, optional

Optional reference genome context used for creating annotatables. Not all annotatables require it.

gain.annotation.score_annotator module

This contains the implementation of the three score annotators.

Genomic score annotators defined are position_score_annotator, np_score_annotator, and allele_score_annotator.

class gain.annotation.score_annotator.AlleleScoreAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: GenomicScoreAnnotatorBase

Annotator for allele-level genomic scores (frequencies, pathogenicity…).

Operates in one of two modes, selected by the mode parameter:

  • allele (default): performs an exact chrom/pos/ref/alt lookup and returns the single matching line’s scores. The annotatable must be a VCFAllele; other types receive an empty result.

  • region: iterates all allele lines that overlap the annotatable’s span and aggregates their scores. Works with any Annotatable (VCFAllele, Region, CNV, …). An aggregator must be defined for every score attribute, either in the attribute config or as the score’s allele_aggregator default in the resource YAML.

Virtual allele attribute

All annotators expose a virtual attribute "allele" (is_default=False) that is synthesised rather than read from the data file.

  • In allele mode: returns ["chrom:pos:ref:alt"] for the matched line.

  • In region mode: returns the set of "chrom:pos:ref:alt" strings for all lines that pass the optional allele_filter.

Optionally append score values to each allele string with include_attributes.

allele_filter

An optional annotator-level boolean expression evaluated against each ScoreLine before it is included in the result. Supported operators: >, <, ==, in, and, or. Variables resolve via ScoreLine.get_score.

ALLELE_FILTER_GRAMMAR = '\n?start: filter | and_ | or\n\nand_: filter "and" filter\n\nor: filter "or" filter\n\n?filter: subject operator subject | or | and_\n\n?subject: variable | value\n\nvalue: "\\"" word "\\"" | number\n\nvariable: word\n\noperator: equals | greater_than | less_than | in\n\nequals: "=="\n\ngreater_than: ">"\n\nless_than: "<"\n\nin: "in"\n\nword: /[0-9]*[a-zA-Z_!@#$%^&*()_+][a-zA-Z0-9!@#$%^&*()_+]*/\n\nnumber: /-?[0-9]+\\.?[0-9]*/\n\n%ignore " "\n'
build_score_aggregator_documentation(attr: Attribute) list[str][source]

Collect score aggregator documentation.

get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Return score attribute specs plus the virtual allele.

class gain.annotation.score_annotator.GenomicScoreAnnotatorBase(pipeline: AnnotationPipeline, info: AnnotatorInfo, score: GenomicScore)[source]

Bases: AnnotatorBase

Genomic score base annotator.

add_score_aggregator_documentation(attr: Attribute, aggregator: str, attribute_conf_agg: AggregatorDefinition | str | dict[str, Any] | None) None[source]

Collect score aggregator documentation.

build_attribute_help(attr: Attribute) str[source]

Build attribute help.

abstractmethod build_score_aggregator_documentation(attr: Attribute) list[str][source]

Construct score aggregator documentation.

close() None[source]
get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

is_open() bool[source]
open() Annotator[source]
simple_score_queries: list[str]
class gain.annotation.score_annotator.PositionScoreAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: GenomicScoreAnnotatorBase

This class implements the position_score_annotator.

The position_score_annotator requires the resource_id parameter, whose value must be an id of a genomic resource of type position_score.

The position_score resource provides a set of scores (see …) that the position_score_annotator uses as attributes to assign to the annotatable.

The position_score_annotator recognizes one attribute level parameter called aggregator that controls how the position scores are aggregated for annotatables that refer to a region of the reference genome. The deprecated name position_aggregator is still accepted.

build_score_aggregator_documentation(attr: Attribute) list[str][source]

Collect score aggregator documentation.

get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
gain.annotation.score_annotator.build_allele_score_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]
gain.annotation.score_annotator.build_np_score_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]
gain.annotation.score_annotator.build_position_score_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]
gain.annotation.score_annotator.get_genomic_resource(pipeline: AnnotationPipeline, info: AnnotatorInfo, resource_types: set[str]) GenomicResource[source]

Return genomic score resource used for given genomic score annotator.

gain.annotation.simple_effect_annotator module

class gain.annotation.simple_effect_annotator.SimpleEffect(effect_type: str, transcript_id: str, gene: str)[source]

Bases: object

effect_type: str
gene: str
transcript_id: str
class gain.annotation.simple_effect_annotator.SimpleEffectAnnotator(pipeline: AnnotationPipeline, info: AnnotatorInfo)[source]

Bases: AnnotatorBase

Simple effect annotator class.

call_region(chrom: str, beg: int, end: int, tx: TranscriptModel, *, func_name: str, classification: str) SimpleEffect | None[source]

Call a region with a specific classification.

cds_intron_regions(transcript: TranscriptModel) list[Region][source]

Return whether region is CDS intron.

cds_regions(transcript: TranscriptModel) Sequence[Region][source]

Return whether the region is classified as coding.

static effect_types() list[str][source]
get_attribute_defaults(spec: AttributeSpec) dict[str, Any][source]
get_attribute_specs() dict[str, AttributeSpec][source]

Get specs of all attributes the annotator can produce.

noncoding_regions(transcript: TranscriptModel) list[Region][source]

Return whether the region is noncoding.

open() Annotator[source]
peripheral_regions(transcript: TranscriptModel) list[Region][source]

Return whether the region is peripheral.

run_annotate(chrom: str, beg: int, end: int) dict[str, set[SimpleEffect]][source]

Return classification with a set of affected genes.

gain.annotation.simple_effect_annotator.build_simple_effect_annotator(pipeline: AnnotationPipeline, info: AnnotatorInfo) Annotator[source]

gain.annotation.utils module

gain.annotation.utils.find_annotator_gene_models(info: AnnotatorInfo, grr: GenomicResourceRepo) GeneModels[source]

Get gene models from the annotator info or genomic context.

gain.annotation.utils.find_annotator_reference_genome(info: AnnotatorInfo, gene_models: GeneModels, pipeline: AnnotationPipeline, grr: GenomicResourceRepo) ReferenceGenome[source]

Get reference genome from the annotator info or genomic context.

Module contents