From mims-harvard-tooluniverse
Traces genetic variants (rsID/coordinates) from regulatory context (GWAS, eQTL, RegulomeDB, ENCODE) to target genes (GTEx, OpenTargets) and pathways (STRING, Reactome) for evidence-graded disease mechanisms.
npx claudepluginhub joshuarweaver/cascade-data-analytics --plugin mims-harvard-tooluniverseThis skill uses the workspace's default tool permissions.
Trace the full causal chain from a genetic variant to its disease mechanism: regulatory context,
Conducts multi-round deep research on GitHub repos via API and web searches, generating markdown reports with executive summaries, timelines, metrics, and Mermaid diagrams.
Dynamically discovers and combines enabled skills into cohesive, unexpected delightful experiences like interactive HTML or themed artifacts. Activates on 'surprise me', inspiration, or boredom cues.
Generates images from structured JSON prompts via Python script execution. Supports reference images and aspect ratios for characters, scenes, products, visuals.
Trace the full causal chain from a genetic variant to its disease mechanism: regulatory context, target gene(s), molecular pathways, and phenotypic consequences. Integrates 7+ databases across 3 evidence layers (regulatory, molecular, disease) to build an evidence-graded mechanistic model.
IMPORTANT: Always use English terms in tool calls. Respond in the user's language.
When uncertain about any scientific fact, SEARCH databases first (PubMed, UniProt, ChEMBL, ClinVar, etc.) rather than reasoning from memory. A database-verified answer is always more reliable than a guess.
Apply when users:
NOT for (use other skills instead):
tooluniverse-variant-interpretationtooluniverse-regulatory-genomicstooluniverse-gwas-snp-interpretationtooluniverse-pharmacogenomicstooluniverse-gene-disease-association| Parameter | Required | Description | Example |
|---|---|---|---|
| variant | Yes | rsID or genomic coordinates | "rs7903146" or "10:112998590:C:T" |
| trait | No | Disease/trait context (helps prioritize) | "type 2 diabetes" |
| tissue | No | Tissue of interest for eQTL/expression | "pancreas" |
Phase 1 Variant Characterization (VEP, population frequency, CADD) -> Phase 2 Regulatory Context (GWAS, RegulomeDB, ENCODE, cCREs) -> Phase 3 Target Gene Identification (GTEx eQTL, OpenTargets L2G) -> Phase 4 Gene Function & Pathways (STRING, Reactome, GO) -> Phase 5 Disease Connection (OpenTargets, GenCC, DisGeNET) -> Phase 6 Mechanistic Synthesis (causal chain, evidence grading, confidence scoring)
Resolve variant identity and get basic annotations.
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Step 1: VEP annotation
vep = tu.tools.EnsemblVEP_annotate_rsid(variant_id="rs7903146")
# NOTE: param is `variant_id`, NOT `rsid`
# Response format variable: list, {data, metadata}, or {error} -- handle all three
# Extract: consequence_type, gene, chromosome, position, alleles
# Step 2: Population frequency and pathogenicity scores
myvar = tu.tools.MyVariant_query_variants(
query="rs7903146", # NOTE: param is `query`, NOT `variant_id`
fields="dbsnp,gnomad_genome,cadd,clinvar"
)
# Returns: {hits: [{_id, dbsnp, gnomad_genome: {af: {af, ...ancestry}}, cadd: {phred}, clinvar}]}
# gnomAD AF: frequency in general population
# CADD phred: >=20 = top 1% deleterious, >=30 = top 0.1%
# Step 3: Confirm gene context
gwas_snp = tu.tools.gwas_search_snps(rs_id="rs7903146")
# Returns: SNP location, mapped genes, functional class
Assess the variant's regulatory environment.
# All trait associations for this variant
gwas = tu.tools.gwas_search_associations(rs_id="rs7903146", size=50)
# Returns: {data: [{disease_trait, p_value, or_per_copy_number, beta, study_accession}]}
# NOTE: Use gwas_search_associations, NOT gwas_get_associations_for_trait (BROKEN)
regulome = tu.tools.RegulomeDB_query_variant(rsid="rs7903146")
# Returns: {status, data: {score/ranking, probability, features}}
# Score 1a-2a = strong regulatory evidence
# Uses GRCh38 (NOT hg19)
# Check enhancer (H3K27ac), promoter (H3K4me3), and accessibility in disease-relevant tissue
encode_h3k27ac = tu.tools.ENCODE_search_histone_experiments(
histone_mark="H3K27ac", biosample_term_name="pancreas", limit=10)
encode_h3k4me3 = tu.tools.ENCODE_search_histone_experiments(
histone_mark="H3K4me3", biosample_term_name="pancreas", limit=10)
encode_atac = tu.tools.ENCODE_search_chromatin_accessibility(
biosample_term_name="pancreas", limit=10)
# Check candidate cis-regulatory elements
ccres = tu.tools.UCSC_get_encode_cCREs(
chrom="chr10",
start=112998000,
end=112999000
)
# Returns: cCRE type (PLS=promoter, pELS=proximal enhancer, dELS=distal enhancer, CTCF-only)
# Resolve trait to ontology ID for downstream tools
ols = tu.tools.ols_search_terms(query="type 2 diabetes", ontology="efo")
# Returns: {status, data: [{iri, label, short_form, ontology_name}]}
# For OpenTargets: use MONDO_0005148 for T2D (NOT EFO_0001360)
Identify which gene(s) the variant regulates. This is the critical link.
# Find eQTL associations for the nearest gene
eqtls = tu.tools.GTEx_query_eqtl(gene_symbol="TCF7L2", size=100)
# Returns: {status, data: [{snpId, geneSymbol, tissueSiteDetailId, pValue, nes}]}
# Filter for the specific rsID in the results
# NES > 0: alt allele increases expression; NES < 0: decreases
# Check expression levels to prioritize tissues
expression = tu.tools.GTEx_get_median_gene_expression(gene_symbol="TCF7L2")
# Returns: median TPM across GTEx tissues
# IMPORTANT: GTEx uses v8 data (v10 endpoints may return empty)
# Get credible sets and L2G predictions
credible = tu.tools.OpenTargets_get_variant_credible_sets(
variant_id="10_112998590_C_T" # chr_pos_ref_alt format
)
# Returns: credible set membership, L2G scores for causal genes
# L2G > 0.5: high confidence causal gene
# L2G 0.1-0.5: moderate confidence
# Get gene details for the candidate target
gene_info = tu.tools.MyGene_query_genes(
query="symbol:TCF7L2",
species="human",
fields="symbol,ensembl.gene,entrezgene,name,summary,go",
size=5
)
# Extract Ensembl ID for OpenTargets queries
# Filter by exact symbol match (first hit may not be correct gene)
The nearest gene is often NOT the causal gene for non-coding variants. Regulatory elements can act over hundreds of kilobases, skipping intervening genes entirely. Before looking at tools, reason about which gene is the likely target:
Once target gene(s) identified, characterize molecular function.
# STRING PPI network — NOTE: param is `identifiers` (string), NOT `protein_ids`
ppi = tu.tools.STRING_get_interaction_partners(
identifiers="TCF7L2", species=9606, required_score=700)
# STRING functional enrichment (GO, KEGG, Reactome)
enrichment = tu.tools.STRING_get_functional_enrichment(
identifiers="TCF7L2", species=9606)
# Reactome pathway enrichment — space-separated STRING, NOT array
reactome = tu.tools.ReactomeAnalysis_pathway_enrichment(
identifiers="TCF7L2 CTNNB1 APC")
# UniProt function (returns list of strings, NOT dict)
uniprot_func = tu.tools.UniProt_get_function_by_accession(accession="Q9NQB0")
# PANTHER GO enrichment — GO:0008150=BP, GO:0003674=MF, GO:0005575=CC
panther = tu.tools.PANTHER_enrichment(
gene_list="TCF7L2,CTNNB1,APC,LEF1", organism=9606,
annotation_dataset="GO:0008150")
Link the molecular mechanism to disease phenotype.
# Disease associations for target gene (requires Ensembl ID)
ot_diseases = tu.tools.OpenTargets_get_diseases_phenotypes_by_target_ensembl(
ensemblId="ENSG00000148737")
# Specific gene-disease evidence (both IDs must be pre-resolved)
ot_evidence = tu.tools.OpenTargets_target_disease_evidence(
ensemblId="ENSG00000148737", efoId="MONDO_0005148")
# GenCC curated validity (Definitive/Strong/Moderate/Limited/Disputed/Refuted)
gencc = tu.tools.GenCC_search_gene(gene_symbol="TCF7L2")
# DisGeNET scored associations (requires DISGENET_API_KEY)
disgenet = tu.tools.DisGeNET_search_gene(gene="TCF7L2", limit=20)
# OpenTargets GWAS studies — use MONDO IDs, NOT EFO
ot_gwas = tu.tools.OpenTargets_search_gwas_studies_by_disease(
diseaseIds=["MONDO_0005148"], size=20)
The goal is an explicit chain: Variant -> changed protein or expression -> altered pathway -> disease phenotype. Each arrow is a link that needs evidence. A chain with all links supported is strong. A chain with a missing link is a hypothesis, not a conclusion.
Start with the variant's location. Is this coding or non-coding? Coding variants directly change the protein -- a missense variant swaps one amino acid, a nonsense variant truncates it. Non-coding variants change REGULATION -- they affect how much protein is made, not what protein is made. This distinction determines your entire analysis path.
For coding variants, ask: Is this position conserved across species? Is it in a functional domain (active site, binding interface, transmembrane helix)? A variant in a conserved active site residue is more likely pathogenic than one in a disordered loop. Check ClinVar and CADD for existing assessments, but reason about the structural context yourself.
For non-coding variants, ask: Is there an eQTL linking this variant to a nearby gene? Is the variant in a TFBS, enhancer, or promoter (from RegulomeDB and ENCODE)? A variant in open chromatin with an eQTL in disease-relevant tissue has a clear regulatory mechanism. A variant in a gene desert with no eQTL is much harder to interpret -- acknowledge this uncertainty.
Evaluate each link independently. For each arrow in the chain, ask: what is the evidence? Is it replicated? Is it from the right tissue? Then assess the chain as a whole:
The evidence hierarchy matters. Clinical observation (ClinVar pathogenic with clinical data) is stronger than functional assays (eQTL, chromatin marks), which are stronger than computational predictions (CADD score, L2G model). A CADD score alone is weak evidence. Multiple convergent computational predictions are moderate. An eQTL replicated across studies in disease-relevant tissue is strong.
Always consider alternative mechanisms. If eQTL points to Gene A but L2G favors Gene B, present both models. If the variant could act through splicing rather than expression, note that. Honest uncertainty is more useful than false confidence.
Grade each piece of evidence by how directly it supports a mechanistic link:
Parameter pitfalls (correct param names are shown in code blocks above):
EnsemblVEP_annotate_rsid uses variant_id, NOT rsidMyVariant_query_variants uses query, NOT variant_idReactomeAnalysis_pathway_enrichment takes space-separated STRING, NOT arrayqueryString, NOT querygwas_get_associations_for_trait is BROKEN; use gwas_search_associationsWhen data is missing, adapt the analysis:
When VEP shows an intronic or intergenic variant, GWAS shows strong trait associations, and RegulomeDB returns a high score (1a-2a), the path is straightforward: the variant disrupts a regulatory element. The key question becomes WHICH GENE it regulates. Check eQTL in disease-relevant tissue first, then L2G. When both converge on the same gene, follow the pathway analysis through to disease. This is the "happy path" and yields Established confidence.
When the variant is intergenic with multiple nearby genes and eQTL/L2G evidence is weak or conflicting, resist the temptation to pick the nearest gene. Instead: check eQTLs for ALL candidate genes in the locus, compare L2G scores, and consider whether one candidate has known disease-relevant biology. Present multiple models ranked by evidence. Report as Preliminary or Moderate.
When there is no eQTL, no L2G support, and minimal regulatory annotation, acknowledge the gap honestly. The variant may act through a mechanism not captured by current databases (cell-type-specific regulation, splicing QTL, 3D genome interactions). State what was checked, what was negative, and what experiments would resolve the question. This is a valid and useful finding.
When ToolUniverse tools return partial annotations, aggregate from multiple sources directly:
import requests, pandas as pd
rsid = "rs12345"
# Ensembl VEP REST — functional consequence prediction
vep = requests.get(f"https://rest.ensembl.org/vep/human/id/{rsid}?content-type=application/json").json()
# GTEx eQTL — tissue-specific expression effects
eqtl = requests.get(f"https://gtexportal.org/api/v2/association/singleTissueEqtl",
params={"snpId": rsid, "datasetId": "gtex_v8"}).json()
# ClinVar bulk (variant_summary.txt.gz, ~200MB, all variants)
# Download once, filter locally:
# df = pd.read_csv("https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz", sep="\t")
# hits = df[df["RS# (dbSNP)"] == int(rsid.replace("rs",""))]
# Combine into unified annotation DataFrame
annotations = {"rsid": rsid, "vep_consequence": vep[0]["most_severe_consequence"] if vep else None,
"eqtl_tissues": len(eqtl.get("singleTissueEqtl", [])),
"sources_checked": ["VEP", "GTEx", "ClinVar"]}
See tooluniverse-data-wrangling skill for API patterns and error handling.