Skill

tooluniverse-population-genetics

Analyzes population genetics: allele frequencies (gnomAD/1000 Genomes), HWE testing, Fst, GWAS associations, constraint scores. For cross-population variant comparison and ancestry-aware frequency lookups.

data-engineering

research

Popularity

Stars

1,546

Forks

235

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/mims-harvard-tooluniverse:tooluniverse-population-genetics

User invocable

Model invocation disabled

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

**MC Strategy**: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."

Supporting Files

scripts/popgen_calculator.py

SKILL.md

363 lines · ~5.4k tokens(exceeds 5k compaction limit)

Stats

LanguagePython

Stars1,546

Forks235

MaintenanceExcellent

Last CommitJul 7, 2026

Actions

View Source View Plugin View on GitHub View README

Population Genetics Analysis

MC Strategy: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."

Analyze population-level genetic variation, allele frequencies, GWAS associations, clinical significance, and evolutionary constraints using ToolUniverse tools.

When to Use

Activate this skill when the user asks about:

Allele frequencies across populations (gnomAD, 1000 Genomes)
GWAS associations for diseases/traits
Clinical variant interpretation (ClinVar, VEP)
Gene-level constraint metrics (pLI, LOEUF, o/e ratios)
Selection, drift, linkage disequilibrium, or population structure
Variant annotation and functional consequences

LOOK UP, DON'T GUESS

Query gnomAD/1000Genomes/GWAS Catalog FIRST for allele frequencies and associations. Preferred: use the PopGen_hwe_test, PopGen_fst, PopGen_inbreeding, and PopGen_haplotype_count tools for HWE, Fst, inbreeding, and haplotype calculations. Fallback: run popgen_calculator.py directly. For theoretical problems (delta-q, drift, LD decay), apply the formulas in the Theoretical Reasoning section below.

Tool Quick Reference

Tool	Key Parameters	Notes
`gnomad_search_variants`	`query` (REQUIRED)	Resolve rsID to variant_id format "CHR-POS-REF-ALT"
`gnomad_get_variant`	`variant_id` (REQUIRED), `dataset`	Population frequencies. Default dataset: gnomad_r3; use gnomad_r4 for latest
`gnomad_get_gene_constraints`	`gene_symbol` (REQUIRED)	pLI, o/e ratios. May timeout -- retry once
`MyVariant_query_variants`	`query` (REQUIRED)	Aggregated: ClinVar + dbSNP + gnomAD + CADD. Uses hg19 coordinates
`EnsemblVEP_annotate_rsid`	`variant_id` (REQUIRED)	Functional consequence, SIFT, PolyPhen. Param is "variant_id" NOT "rsid"
`EnsemblVEP_variant_recoder`	`variant_id` (REQUIRED)	Convert between rsID/HGVS/VCF/SPDI
`gwas_get_snps_for_gene`	`gene_symbol` (REQUIRED)	All GWAS SNPs for a gene
`gwas_search_associations`	`query` (REQUIRED)	GWAS for a disease/trait (NOT gene name -- use gwas_get_snps_for_gene for genes)
`gwas_get_variants_for_trait`	`trait` (REQUIRED)	Variants associated with a trait
`ClinVar_search_variants`	`gene`, `condition`, `significance`	At least one filter required
`RegulomeDB_query_variant`	`rsid` (REQUIRED)	Regulatory scoring (1a=strongest to 7=minimal)

Critical Gotchas

gnomAD variant_id: Format is "CHR-POS-REF-ALT" (no "chr" prefix). Always resolve rsIDs via gnomad_search_variants first.
gwas_search_associations: Takes disease/trait names ONLY. Gene names will fail. Use gwas_get_snps_for_gene for gene-based lookups.
gwas_search_snps: BROKEN (HTTP 500). Use gwas_get_snps_for_gene instead.
VEP/ClinVar responses: Format is variable (list, {data, metadata}, or {error}). Handle all three.

Workflow Patterns

Variant frequency: gnomad_search_variants -> gnomad_get_variant(dataset="gnomad_r4") -> MyVariant_query_variants (1000G pop breakdowns) -> EnsemblVEP_annotate_rsid

GWAS for disease: gwas_search_associations -> gwas_get_variants_for_trait -> gnomad_get_variant for top hits -> EuropePMC_search_articles

Gene characterization: gnomad_get_gene_constraints -> gwas_get_snps_for_gene -> ClinVar_search_variants -> PubMed_search_articles

Pathogenicity assessment: EnsemblVEP_annotate_rsid -> MyVariant_query_variants (CADD, ClinVar) -> gnomad_get_variant (frequency) -> RegulomeDB_query_variant (if non-coding)

Theoretical Reasoning (CRITICAL for computation problems)

These formulas are needed for quantitative population genetics problems. Work through step by step, showing intermediate values.

Allele Frequency Change Under Selection (delta-q)

For a recessive deleterious allele (fitness: AA=1, Aa=1, aa=1-s):

delta_q = -s * q^2 * p / (1 - s * q^2)

where p = freq(A), q = freq(a), s = selection coefficient.

For dominant deleterious (AA=1, Aa=1-s, aa=1-s):

delta_q = -s * q * p / (1 - s * q * (2 - q))

For heterozygote advantage (AA=1-s1, Aa=1, aa=1-s2):

equilibrium: q_hat = s1 / (s1 + s2)

Example: plug in s1 and s2 from the question; q_hat = s1/(s1+s2).

Selection against recessives is slow at low q because most a alleles hide in heterozygotes. Time to reduce q from q0 to qt: t ~ (1/qt - 1/q0) / s generations.

Genetic Drift in Small Populations

Variance in allele frequency per generation: Var(delta_p) = pq / (2Ne)

Probability of fixation of a new neutral mutation: 1/(2*Ne)

Time to fixation (given it fixes): ~4*Ne generations for neutral alleles

Heterozygosity decay: H_t = H_0 * (1 - 1/(2*Ne))^t

After t generations, fraction of heterozygosity lost ~ 1 - e^(-t/(2*Ne))

Effective population size (Ne) adjustments:

Unequal sex ratio: Ne = 4NfNm / (Nf + Nm)
Fluctuating size: Ne = harmonic mean of N across generations
Bottleneck: dominated by the smallest generation size

Drift vs selection: Drift dominates when |s| < 1/(2*Ne). A variant with s=0.01 behaves neutrally in a population of Ne < 50.

Linkage Disequilibrium (LD) Decay

D = freq(AB) - freq(A)*freq(B), where A and B are alleles at two loci.

Decay with recombination: D_t = D_0 * (1 - r)^t, where r = recombination fraction, t = generations.

Half-life of LD: t_half = -ln(2) / ln(1-r) ~ 0.693/r generations (for small r).

r-squared (normalized LD): r^2 = D^2 / (pA * pa * pB * pb). Range 0-1.

Expected r^2 in finite population at equilibrium: E[r^2] = 1 / (1 + 4Ner) (for drift-recombination balance).

Practical implications:

Tightly linked loci (r < 0.01): LD persists for hundreds of generations
Loosely linked (r = 0.5, independent assortment): LD halves every generation
GWAS tag SNPs work because LD extends over blocks; block size depends on Ne and recombination rate
African populations have shorter LD blocks (larger historical Ne) -> need denser SNP arrays

Hardy-Weinberg Equilibrium

For alleles A (freq p) and a (freq q=1-p): expected genotypes AA=p^2, Aa=2pq, aa=q^2.

Chi-square test: df=1 (2 alleles). Preferred: use PopGen_hwe_test tool. Fallback: popgen_calculator.py --type hwe --AA N1 --Aa N2 --aa N3.

Causes of HWE departure: non-random mating, selection, migration, drift, genotyping error. Excess homozygotes -> inbreeding or population structure (Wahlund effect). Excess heterozygotes -> overdominant selection or negative assortative mating.

Heritability

H^2 (broad-sense) = V_G / V_P; h^2 (narrow-sense) = V_A / V_P
V_G includes ALL genetic variance: additive + dominance + epistasis. Trap: "broad-sense" is not just additive.
Under HWE with two alleles (p, q): genotype frequencies are p^2, 2pq, q^2
Phenotype frequency from genotype: sum(genotype_freq * penetrance) for each genotype class
For quantitative traits: V_P = V_G + V_E (no covariance assumed)
With dominance: assign genotypic values (e.g., AA=a, Aa=d, aa=-a), compute mean, then V_G from freq-weighted squared deviations
PGS vs SNP-h² trap: PGS R² is NOT necessarily ≤ h²_SNP. With large GWAS, PGS can exceed SNP-h² by tagging rare causal variants through LD with common SNPs. The word "necessarily" makes this claim False. h²_SNP is estimated from common variants; PGS can capture additional variance.

Path Analysis (Causal Diagrams)

Trace ALL paths from cause to effect through the diagram (direct + indirect)
Each path's contribution = product of path coefficients along that path
Total effect (correlation) = sum of contributions from all paths
Indirect effects can mask (suppression) or inflate (confounding) the direct effect
Unanalyzed correlations (double-headed arrows) count as valid path segments
Never ignore indirect paths — the total is rarely just the direct arrow

Genetic Combinatorics (F2 crosses, haplotype counting)

For n SNPs between two inbred (homozygous) strains:

F1 is heterozygous at all n loci
F2 distinct haplotypes = 2^n (each SNP contributes parental A or B allele)
F2 distinct diploid genotypes = 3^n (AA, AB, BB at each locus)
F2 unique chromosomes (distinct haplotypes) = 2^n (e.g., 5 SNPs → 2^5 = 32; but subtract the 2 parental haplotypes if "novel" is asked → 30)
ALWAYS write and run Python code (python3 -c "...") for these counts. Never enumerate by hand.
For specimens/counting from field data: parse the data into a structure and compute programmatically.

Mutation-Selection Balance

Equilibrium frequency of a deleterious allele:

Recessive lethal: q_hat = sqrt(mu/s) ~ sqrt(mu) when s=1
Dominant lethal: q_hat = mu/s
Example: mu=1e-5, s=1 (recessive lethal) -> q_hat = 0.003 (carrier freq ~ 0.006)

F-statistics and Population Structure

Fis: Inbreeding within subpopulations (heterozygote deficit within demes)
Fst: Differentiation between subpopulations. Fst = Var(p) / (p_bar * q_bar)
Fit: Total inbreeding. (1-Fit) = (1-Fis)(1-Fst)
Fst interpretation: <0.05 little, 0.05-0.15 moderate, 0.15-0.25 great, >0.25 very great differentiation
Preferred: use PopGen_fst tool. Fallback: popgen_calculator.py --type fst --p1 X --p2 Y --n1 N1 --n2 N2

Mendelian Genetics Reasoning Framework

For any genetics cross problem, follow these steps IN ORDER. Do not skip steps.

Step 1: Identify genes, locations, and allele relationships

List every gene involved in the cross
Determine chromosomal location: autosomal vs X-linked (X-linked genes show different inheritance in males vs females)
Determine allele relationships: dominant/recessive, codominant, incomplete dominance
Note any epistasis, suppressor, or modifier interactions between genes

Step 2: Write parental genotypes explicitly

Use standard notation (e.g., Aa Bb for autosomal; X^w X^+ for X-linked)
For X-linked genes, males are hemizygous (X^w Y), not homozygous
If parental genotypes are not given, deduce them from phenotypes and pedigree context

Step 3: Draw Punnett square(s) for each gene

For multi-gene crosses, handle each gene independently (if unlinked) then combine
For linked genes, use recombination frequency to adjust gamete ratios
For X-linked genes, remember that fathers pass X to all daughters and Y to all sons

Step 4: Calculate expected phenotypic ratios

Multiply independent gene ratios (e.g., 3:1 x 3:1 = 9:3:3:1)
For X-linked: calculate male and female ratios separately, then combine or report separately as required

Step 5: Verify ratios sum to 1.0

Convert all ratios to fractions and confirm they sum to 1
If they don't sum to 1, there is an error in the Punnett square or gamete calculation

Step 6: Apply phenotype modification rules AFTER computing genotypic ratios

For epistasis: first compute the full genotypic ratios (e.g., 9:3:3:1), then collapse genotype classes that produce the same phenotype
For suppressor genes: a suppressor homozygote (su/su) restores wild-type in an otherwise mutant background. Apply suppression AFTER determining which individuals carry the mutant allele
Example: 9 A_B_ : 3 A_bb : 3 aaB_ : 1 aabb with recessive epistasis (aa masks B) becomes 9:3:4

E. coli Hfr Mapping Framework

For bacterial conjugation and Hfr mapping problems:

Core Principles

In Hfr x F- crosses, the Hfr chromosome is transferred linearly starting from the origin of transfer (oriT)
Gene transfer order = chromosomal order from the origin
Early markers (entering first) are closest to the origin of transfer
Late markers (entering last) are farthest from the origin

Interrupted Mating Experiments

Genes that appear in recombinants at earlier time points are closer to oriT
The time of entry gives the order and approximate distance between genes
Recombinants require integration by homologous recombination (double crossover)

Recombination Frequency Between Markers

KEY TRAP: Highest recombination frequency occurs between markers that are FARTHEST APART on the transferred segment
This is because more time elapses between entry of distant markers, providing more opportunity for recombination events between them
Conversely, markers that enter close together in time show LOW recombination between them
Do NOT confuse "highest recombination frequency" with "first markers to enter" -- these are opposite concepts

Ordering Markers from Hfr Data

Use time-of-entry data to establish gene order relative to oriT
Use recombination frequency data between pairs of selected markers to confirm/refine order
Multiple Hfr strains with different origins can be used to build a circular map

MCQ Elimination Strategy for Genetics

General MCQ Protocol

ALWAYS evaluate ALL options before choosing an answer
Never select the first option that seems correct -- there may be a better or more precise answer
Read the question stem carefully for qualifiers: "MOST likely", "LEAST likely", "NOT true", "ALWAYS", "NEVER"

"Which is NOT true" Questions

Evaluate EACH statement independently as True or False
Mark each option with T or F before selecting
The answer is the statement marked F
Double-check: verify the "false" statement is genuinely false, not just misleadingly worded

"Which mechanism" Questions

Test each proposed mechanism against ALL observations given in the question
A correct mechanism must explain every observation, not just some
Eliminate mechanisms that contradict even one observation

Specific Traps to Watch For

Subfunctionalization vs neofunctionalization: Subfunctionalization = partitioning of EXISTING ancestral functions between duplicates (both copies needed to perform original function). Neofunctionalization = one copy acquires a genuinely NEW function not present in the ancestor
Copy-neutral LOH: Caused by mitotic recombination (segmental, affects part of a chromosome), NOT uniparental disomy (UPD, which is whole-chromosome). The question may try to conflate these
Penetrance vs expressivity: Penetrance = fraction of individuals with genotype who show ANY phenotype. Expressivity = degree/severity of phenotype among those who show it. These are distinct concepts
Complementation vs recombination: Complementation = two mutations in DIFFERENT genes restore wild-type in trans. Recombination = exchange between two mutations in the SAME or different genes. Complementation is tested in F1 (heterozygote); recombination is tested in progeny

Common Genetics Reasoning Traps

These are specific patterns that have caused reasoning failures in hard genetics questions. Review before answering genetics MCQs.

Suppressor Genetics

A suppressor mutation, when homozygous, restores wild-type phenotype in an otherwise mutant background
In F2 crosses involving both the original mutation and an autosomal recessive suppressor:
- Treat as a dihybrid cross — the primary mutation and the suppressor segregate independently
- Only 1/4 of F2 are homozygous for the suppressor
- The suppressor only acts in individuals that are also homozygous for the primary mutation
- Use a Punnett square to enumerate all genotypic classes, then apply the suppression rule to determine phenotypes

Non-disjunction (Bridges' Experiments)

Bridges used non-disjunction to prove the chromosome theory of inheritance
X0 males arise from female meiosis non-disjunction events
Meiosis I non-disjunction: both X chromosomes go to one pole -> XX egg + O egg (nullo-X)
Meiosis II non-disjunction: sister chromatids fail to separate -> XX egg from one secondary oocyte
The classic Bridges result: exceptional white-eyed females (X^w X^w) and red-eyed males (from nullo-X eggs + Y sperm = X0, but these are typically sterile)
Key distinction: know which type of non-disjunction (MI vs MII) produces which specific gamete types

GWAS LD Blocks

SNPs WITHIN the same LD block are correlated and can inflate false positive associations (one causal SNP drags along non-causal tag SNPs)
SNPs ACROSS different LD blocks are largely independent and do NOT create misleading cross-locus associations
LD block structure varies by population (shorter in African populations due to larger historical Ne)
Fine-mapping within an LD block is needed to distinguish the causal variant from hitchhiking tag SNPs

Gene Retention After Whole-Genome Duplication

Neofunctionalization: One copy acquires a NEW function -> most commonly cited reason for gene RETENTION after duplication (preserves both copies because each is now essential)
Subfunctionalization: Ancestral functions are PARTITIONED between copies -> explains DIVERGENCE of duplicate copies, but both copies must be retained to maintain the full ancestral function
Dosage balance: Some genes are retained in duplicate to maintain stoichiometric balance in protein complexes
Trap: Questions may ask "what explains retention" vs "what explains divergence" -- these have different best answers
For retention: neofunctionalization (new function makes both copies essential)
For divergence of expression/function: subfunctionalization (partitioning of ancestral roles)

Advanced Genetics Traps v2

PGS vs Heritability: "Necessarily True" Logic

For "necessarily true" questions about PGS and heritability: a statement is necessarily true only if it holds when V_D=0 AND when V_D=V_G. Test the extremes.

Path Diagram Sign Assignment Protocol

Do NOT guess path signs from general knowledge. Signs may differ from well-known systems. Follow this protocol:

Establish reference direction: What varies? What is increasing?
For each path X→Y: Ask ONLY "when X increases, does Y increase (+) or decrease (-)?"
Use the question's experimental context (knockout/control comparisons, provided data) to determine signs — not intuition
Expect negative paths: Path diagrams test your ability to identify negative relationships. All-positive is almost always wrong. Direct residual paths (e) often have opposite sign from expectation.

Chi-Square: "Most Likely to Reject" Protocol

Compute chi-square from the expected ratio given in the question. Compare to chi-square-critical at df = (number of phenotype classes - 1). Pick the answer choice with the highest chi-square, but also check which pattern is biologically diagnostic of the alternative hypothesis.

LD and Misleading GWAS Associations

LD block boundaries at recombination hotspots are a source of GWAS false localization — strong signal in the block does not guarantee the causal variant is in the block.

Low-Frequency Allele Detection

Duplex sequencing (unique molecular identifiers + double-strand consensus) detects alleles at 0.01% frequency — far below standard NGS even at 80X depth. Simply increasing read depth does NOT help for ultra-rare variants because the Illumina error rate (~0.1%) masks variants rarer than ~1% regardless of depth. Error correction methods (UMIs, duplex consensus) are needed to distinguish true rare variants from sequencing errors.

Bundled Computation Script

Script: skills/tooluniverse-population-genetics/scripts/popgen_calculator.py

Preferred: Use ToolUniverse tools (via MCP/SDK) instead of the script when possible:

PopGen_hwe_test tool -- HWE chi-square test. Fallback: popgen_calculator.py --type hwe
PopGen_fst tool -- Weir-Cockerham Fst. Fallback: popgen_calculator.py --type fst
PopGen_inbreeding tool -- Inbreeding coefficient from pedigree. Fallback: popgen_calculator.py --type inbreeding
PopGen_haplotype_count tool -- Expected haplotype diversity. Fallback: popgen_calculator.py --type haplotypes

Fallback script modes (all require --type):

hwe: --AA N --Aa N --aa N -- chi-square HWE test with p-value
fst: --p1 F --p2 F --n1 N --n2 N -- Weir-Cockerham Fst
inbreeding: --pedigree TYPE --generations G -- F from pedigree (self, full-sib, half-sib, first-cousin, etc.)
haplotypes: --snps N --generations G --recomb_rate R -- expected haplotype diversity

Key Concepts

MAF: Minor allele frequency. Common: >5%. Rare: <1%. Ultra-rare: <0.01%.
pLI: P(LoF intolerant). >0.9 = haploinsufficient gene.
LOEUF: LoF o/e upper fraction. <0.35 = highly constrained.
CADD PHRED: >=10 top 10%, >=20 top 1%, >=30 top 0.1% most deleterious.
Genome-wide significance: GWAS p < 5e-8 (Bonferroni for ~1M independent tests).
Effect size: OR > 1 = risk allele, < 1 = protective. Beta > 0 = increases trait.

Evidence Grading

T1: ClinVar pathogenic/likely pathogenic, FDA pharmacogenomics
T2: gnomAD frequencies, GTEx eQTLs, GWAS genome-wide significant
T3: CADD/SIFT/PolyPhen predictions, RegulomeDB, constraint metrics
T4: VEP consequence terms, dbSNP annotations, literature mentions

tooluniverse-population-genetics

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

tooluniverse-population-genetics

Popularity

Invocation

Context Preview

Supporting Files

SKILL.md

Population Genetics Analysis

When to Use

LOOK UP, DON'T GUESS

Tool Quick Reference

Critical Gotchas

Workflow Patterns

Theoretical Reasoning (CRITICAL for computation problems)

Allele Frequency Change Under Selection (delta-q)

Genetic Drift in Small Populations

Linkage Disequilibrium (LD) Decay

Hardy-Weinberg Equilibrium

Heritability

Path Analysis (Causal Diagrams)

Genetic Combinatorics (F2 crosses, haplotype counting)

Mutation-Selection Balance

F-statistics and Population Structure

Mendelian Genetics Reasoning Framework

Step 1: Identify genes, locations, and allele relationships

Step 2: Write parental genotypes explicitly

Step 3: Draw Punnett square(s) for each gene

Step 4: Calculate expected phenotypic ratios

Step 5: Verify ratios sum to 1.0

Step 6: Apply phenotype modification rules AFTER computing genotypic ratios

E. coli Hfr Mapping Framework

Core Principles

Interrupted Mating Experiments

Recombination Frequency Between Markers

Ordering Markers from Hfr Data

MCQ Elimination Strategy for Genetics

General MCQ Protocol

"Which is NOT true" Questions

"Which mechanism" Questions

Specific Traps to Watch For

Common Genetics Reasoning Traps

Suppressor Genetics

Non-disjunction (Bridges' Experiments)

GWAS LD Blocks

Gene Retention After Whole-Genome Duplication

Advanced Genetics Traps v2

PGS vs Heritability: "Necessarily True" Logic

Path Diagram Sign Assignment Protocol

Chi-Square: "Most Likely to Reject" Protocol

LD and Misleading GWAS Associations

Low-Frequency Allele Detection

Bundled Computation Script

Key Concepts

Evidence Grading

Similar Skills

Population Genetics Analysis

When to Use

LOOK UP, DON'T GUESS

Tool Quick Reference

Critical Gotchas

Workflow Patterns

Theoretical Reasoning (CRITICAL for computation problems)

Allele Frequency Change Under Selection (delta-q)

Genetic Drift in Small Populations

Linkage Disequilibrium (LD) Decay

Hardy-Weinberg Equilibrium

Heritability

Path Analysis (Causal Diagrams)

Genetic Combinatorics (F2 crosses, haplotype counting)

Mutation-Selection Balance

F-statistics and Population Structure

Mendelian Genetics Reasoning Framework

Step 1: Identify genes, locations, and allele relationships

Step 2: Write parental genotypes explicitly

Step 3: Draw Punnett square(s) for each gene