GAIn Python interface ====================== The command-line workflow is designed for specific GAIn tasks, such as annotating variants, positions, or regions and inspecting resources in a GRR. This covers many common annotation use cases. However, users may also want to ask more open-ended questions, such as summarizing the contents of a repository, querying a resource over a custom interval, combining information from several resources, or generating a custom plot. The Python interface supports these workflows by allowing users to access GRR resources directly and process the results using standard Python tools. The first step is to connect to a Genomic Resource Repository (GRR). In the simplest case, a repository can be accessed without providing any arguments: .. code-block:: python from gain.genomic_resources.repository_factory import build_genomic_resource_repository grr = build_genomic_resource_repository() Calling ``build_genomic_resource_repository()`` without arguments uses the default GRR configuration available in the user environment. In a standard GAIn installation, the default configuration provides access to the public IossifovLab GRR. Users can also access local repositories or other public repositories by providing the repository definition explicitly. For example, the code below accesses the current working directory as a local GRR: .. code-block:: python import os from gain.genomic_resources.repository_factory import build_genomic_resource_repository grr = build_genomic_resource_repository({ "id": "local_grr", "type": "directory", "directory": os.getcwd(), }) After connecting to a repository, individual resources can be accessed using their resource IDs. For example, users can access a reference genome, a gene models resource, or a position score resource, and then call resource-specific methods on the resulting Python objects. This allows GRR resources to be queried directly in Python for custom analyses that are not part of a predefined annotation workflow. The examples below illustrate three uses of the Python interface: inspecting chromosome lengths from a reference genome, locating a gene and retrieving scores across its interval, and summarizing the number of resources available for different genome builds. `GAIn development page `_ provides more detail on the Python methods available for different resource types. 1: Chromosome lengths ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following example shows how to access a reference genome and retrieve chromosome lengths for a selected set of canonical chromosomes (chromosomes 1-22, X, Y, and M). First, the default GRR is accessed and a reference genome resource is retrieved using its resource ID. The script then defines a set of canonical chromosome names and iterates over the chromosomes available in the genome. For each chromosome that matches the canonical set, the chromosome length is retrieved and printed. .. code-block:: python from gain.genomic_resources.repository_factory import build_genomic_resource_repository grr = build_genomic_resource_repository() from gain.genomic_resources.reference_genome import build_reference_genome_from_resource_id genome = build_reference_genome_from_resource_id("hg38/genomes/GRCh38-hg38", grr).open() # Define canonical chromosomes canonical_chroms = {f"chr{i}" for i in range(1, 23)} canonical_chroms.update(["chrX", "chrY", "chrM"]) for chrom in genome.chromosomes: if chrom in canonical_chroms: print(chrom, genome.get_chrom_length(chrom)) Save this file as ``python_1.py``, and run it with: .. code-block:: bash python python_1.py The output lists the canonical chromosome names and their lengths for the ``hg38/genomes/GRCh38-hg38`` genome resource. The same chromosome lengths are reported in the `HTML summary page `_ for this resource, allowing users to compare the Python output with the GAIn resource summary. .. code-block:: text chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555 chr5 181538259 chr6 170805979 chr7 159345973 chr8 145138636 chr9 138394717 chr10 133797422 chr11 135086622 chr12 133275309 chr13 114364328 chr14 107043718 chr15 101991189 chr16 90338345 chr17 83257441 chr18 80373285 chr19 58617616 chr20 64444167 chr21 46709983 chr22 50818468 chrX 156040895 chrY 57227415 chrM 16569 2: Position scores across a gene ^^^^^^^^^^ This example shows how to retrieve and visualize position scores across a gene (e.g. TP53). The script uses the MANE 1.5 gene models resource (``hg38/gene_models/MANE/1.5``) to obtain the genomic coordinates of the primary transcript. It then queries the ``phastCons100way`` position score resource (hg38/scores/phastCons100way) across that interval to retrieve conservation scores. The scores are plotted over the TP53 region, with genomic position on the ``x-axis`` and the position score on the ``y-axis``, and the resulting figure is saved as ``TP53_phastCons100way.png`` in the current working directory. This example illustrates how gene models and position score resources can be combined to extract and visualize signal over a biologically meaningful region. .. code-block:: python GENE_NAME = "TP53" from gain.genomic_resources.repository_factory import build_genomic_resource_repository grr = build_genomic_resource_repository() from gain.genomic_resources.gene_models import build_gene_models_from_resource_id gene_models = build_gene_models_from_resource_id("hg38/gene_models/MANE/1.5", grr).load() tx = gene_models.gene_models_by_gene_name(GENE_NAME)[0] chrom, start, end = tx.chrom, tx.tx[0], tx.tx[1] print(f"{GENE_NAME} is on {chrom}, from position {start} to {end}.") from gain.genomic_resources.genomic_scores import build_score_from_resource_id score = build_score_from_resource_id("hg38/scores/phastCons100way", grr).open() xs = [] ys = [] for pos_begin, pos_end, values in score.fetch_region(chrom, start, end): if values is not None: for p in range(pos_begin, pos_end + 1): xs.append(p) ys.append(values[0]) import matplotlib.pyplot as plt plt.figure(figsize=(10, 4)) plt.plot(xs, ys) plt.xlabel(f"Position on {chrom}") plt.ylabel("phastCons100way") plt.title(f"phastCons100way across {GENE_NAME}") plt.tight_layout() plt.savefig(f"{GENE_NAME}_phastCons100way.png", dpi=150) Save this file as ``python_2.py``, and run it as before: .. code-block:: bash python python_2.py This will produce the following image: .. figure:: figures/TP53_phastCons100way.png In this plot, higher ``phastCons100way`` values indicate positions that are more conserved across the 100-way vertebrate alignment used by the resource. Peaks in the plot mark parts of the TP53 interval that are under stronger evolutionary constraint, while lower values indicate less conserved positions. This provides a compact view of how conservation varies across the TP53 locus. 3: Resource counts by genome ^^^^^^^^^^^^^ This example summarizes how many resources of selected types are available for each genome build in the GRR. The script iterates over all resources, assigns each resource to a genome based on the prefix of its resource ID (e.g., ``hg19/``, ``hg38/``, ``t2t/``), filters by resource type, and counts how many resources of each type are present for each genome. The result is organized as a table (DataFrame). .. code-block:: python import pandas as pd from collections import defaultdict from gain.genomic_resources.repository_factory import build_genomic_resource_repository grr = build_genomic_resource_repository() genomes = ["hg19", "hg38", "t2t"] types = ["genome", "gene_models", "position_score", "allele_score", "cnv_collection"] # initialize counts counts = {g: defaultdict(int) for g in genomes} for resource in grr.get_all_resources(): rid = resource.get_id() rtype = resource.get_type() for g in genomes: if rid.startswith(f"{g}/") and rtype in types: counts[g][rtype] += 1 # build dataframe df = pd.DataFrame( {g: [counts[g][t] for t in types] for g in genomes}, index=types ) df.index.name = "resource_type" print(df) Save this file as ``python_3.py``, and run it as before: .. code-block:: bash python python_3.py The output is a table with resource types as rows and genome builds as columns, where each entry gives the number of resources of that type for the corresponding genome. .. csv-table:: :header-rows: 1 resource_type,hg19,hg38,t2t genome,1,3,1 gene_models,5,47,1 position_score,152,8,0 allele_score,5,31,0 cnv_collection,0,7,0