Skill: Population Genetics Analysis
MC Strategy: Population genetics MC questions often test whether you know a specific theorem or result. COMPUTE the answer first (use popgen_calculator.py or write Python), then match to options. Don't try to reason about which option "sounds right."
Analyze population-level genetic variation, allele frequencies, GWAS associations, clinical significance, and evolutionary constraints using ToolUniverse tools.
When to Use
Activate this skill when the user asks about:
- Allele frequencies across populations (gnomAD, 1000 Genomes)
- GWAS associations for diseases/traits
- Clinical variant interpretation (ClinVar, VEP)
- Gene-level constraint metrics (pLI, LOEUF, o/e ratios)
- Selection, drift, linkage disequilibrium, or population structure
- Variant annotation and functional consequences
LOOK UP, DON'T GUESS
Query gnomAD/1000Genomes/GWAS Catalog FIRST for allele frequencies and associations. Preferred: use the PopGen_hwe_test, PopGen_fst, PopGen_inbreeding, and PopGen_haplotype_count tools for HWE, Fst, inbreeding, and haplotype calculations. Fallback: run popgen_calculator.py directly. For theoretical problems (delta-q, drift, LD decay), apply the formulas in the Theoretical Reasoning section below.
Tool Quick Reference
| Tool | Key Parameters | Notes |
|---|
gnomad_search_variants | query (REQUIRED) | Resolve rsID to variant_id format "CHR-POS-REF-ALT" |
gnomad_get_variant | variant_id (REQUIRED), dataset | Population frequencies. Default dataset: gnomad_r3; use gnomad_r4 for latest |
gnomad_get_gene_constraints | gene_symbol (REQUIRED) | pLI, o/e ratios. May timeout -- retry once |
MyVariant_query_variants | query (REQUIRED) | Aggregated: ClinVar + dbSNP + gnomAD + CADD. Uses hg19 coordinates |
EnsemblVEP_annotate_rsid | variant_id (REQUIRED) | Functional consequence, SIFT, PolyPhen. Param is "variant_id" NOT "rsid" |
EnsemblVEP_variant_recoder | variant_id (REQUIRED) | Convert between rsID/HGVS/VCF/SPDI |
gwas_get_snps_for_gene | gene_symbol (REQUIRED) | All GWAS SNPs for a gene |
gwas_search_associations | query (REQUIRED) | GWAS for a disease/trait (NOT gene name -- use gwas_get_snps_for_gene for genes) |
gwas_get_variants_for_trait | trait (REQUIRED) | Variants associated with a trait |
ClinVar_search_variants | gene, condition, significance | At least one filter required |
RegulomeDB_query_variant | rsid (REQUIRED) | Regulatory scoring (1a=strongest to 7=minimal) |
Critical Gotchas
- gnomAD variant_id: Format is
"CHR-POS-REF-ALT" (no "chr" prefix). Always resolve rsIDs via gnomad_search_variants first.
- gwas_search_associations: Takes disease/trait names ONLY. Gene names will fail. Use
gwas_get_snps_for_gene for gene-based lookups.
- gwas_search_snps: BROKEN (HTTP 500). Use
gwas_get_snps_for_gene instead.
- VEP/ClinVar responses: Format is variable (list,
{data, metadata}, or {error}). Handle all three.
Workflow Patterns
Variant frequency: gnomad_search_variants -> gnomad_get_variant(dataset="gnomad_r4") -> MyVariant_query_variants (1000G pop breakdowns) -> EnsemblVEP_annotate_rsid
GWAS for disease: gwas_search_associations -> gwas_get_variants_for_trait -> gnomad_get_variant for top hits -> EuropePMC_search_articles
Gene characterization: gnomad_get_gene_constraints -> gwas_get_snps_for_gene -> ClinVar_search_variants -> PubMed_search_articles
Pathogenicity assessment: EnsemblVEP_annotate_rsid -> MyVariant_query_variants (CADD, ClinVar) -> gnomad_get_variant (frequency) -> RegulomeDB_query_variant (if non-coding)
Theoretical Reasoning (CRITICAL for computation problems)
These formulas are needed for quantitative population genetics problems. Work through step by step, showing intermediate values.
Allele Frequency Change Under Selection (delta-q)
For a recessive deleterious allele (fitness: AA=1, Aa=1, aa=1-s):
delta_q = -s * q^2 * p / (1 - s * q^2)
where p = freq(A), q = freq(a), s = selection coefficient.
For dominant deleterious (AA=1, Aa=1-s, aa=1-s):
delta_q = -s * q * p / (1 - s * q * (2 - q))
For heterozygote advantage (AA=1-s1, Aa=1, aa=1-s2):
equilibrium: q_hat = s1 / (s1 + s2)
Example: plug in s1 and s2 from the question; q_hat = s1/(s1+s2).
Selection against recessives is slow at low q because most a alleles hide in heterozygotes. Time to reduce q from q0 to qt: t ~ (1/qt - 1/q0) / s generations.
Genetic Drift in Small Populations
Variance in allele frequency per generation: Var(delta_p) = pq / (2Ne)
Probability of fixation of a new neutral mutation: 1/(2*Ne)
Time to fixation (given it fixes): ~4*Ne generations for neutral alleles
Heterozygosity decay: H_t = H_0 * (1 - 1/(2*Ne))^t
After t generations, fraction of heterozygosity lost ~ 1 - e^(-t/(2*Ne))
Effective population size (Ne) adjustments:
- Unequal sex ratio: Ne = 4NfNm / (Nf + Nm)
- Fluctuating size: Ne = harmonic mean of N across generations
- Bottleneck: dominated by the smallest generation size
Drift vs selection: Drift dominates when |s| < 1/(2*Ne). A variant with s=0.01 behaves neutrally in a population of Ne < 50.
Linkage Disequilibrium (LD) Decay
D = freq(AB) - freq(A)*freq(B), where A and B are alleles at two loci.
Decay with recombination: D_t = D_0 * (1 - r)^t, where r = recombination fraction, t = generations.
Half-life of LD: t_half = -ln(2) / ln(1-r) ~ 0.693/r generations (for small r).
r-squared (normalized LD): r^2 = D^2 / (pA * pa * pB * pb). Range 0-1.
Expected r^2 in finite population at equilibrium: E[r^2] = 1 / (1 + 4Ner) (for drift-recombination balance).
Practical implications:
- Tightly linked loci (r < 0.01): LD persists for hundreds of generations
- Loosely linked (r = 0.5, independent assortment): LD halves every generation
- GWAS tag SNPs work because LD extends over blocks; block size depends on Ne and recombination rate
- African populations have shorter LD blocks (larger historical Ne) -> need denser SNP arrays
Hardy-Weinberg Equilibrium
For alleles A (freq p) and a (freq q=1-p): expected genotypes AA=p^2, Aa=2pq, aa=q^2.
Chi-square test: df=1 (2 alleles). Preferred: use PopGen_hwe_test tool. Fallback: popgen_calculator.py --type hwe --AA N1 --Aa N2 --aa N3.
Causes of HWE departure: non-random mating, selection, migration, drift, genotyping error. Excess homozygotes -> inbreeding or population structure (Wahlund effect). Excess heterozygotes -> overdominant selection or negative assortative mating.
Heritability
- H^2 (broad-sense) = V_G / V_P; h^2 (narrow-sense) = V_A / V_P
- V_G includes ALL genetic variance: additive + dominance + epistasis. Trap: "broad-sense" is not just additive.
- Under HWE with two alleles (p, q): genotype frequencies are p^2, 2pq, q^2
- Phenotype frequency from genotype: sum(genotype_freq * penetrance) for each genotype class
- For quantitative traits: V_P = V_G + V_E (no covariance assumed)
- With dominance: assign genotypic values (e.g., AA=a, Aa=d, aa=-a), compute mean, then V_G from freq-weighted squared deviations
- PGS vs SNP-h² trap: PGS R² is NOT necessarily ≤ h²_SNP. With large GWAS, PGS can exceed SNP-h² by tagging rare causal variants through LD with common SNPs. The word "necessarily" makes this claim False. h²_SNP is estimated from common variants; PGS can capture additional variance.
Path Analysis (Causal Diagrams)
- Trace ALL paths from cause to effect through the diagram (direct + indirect)
- Each path's contribution = product of path coefficients along that path
- Total effect (correlation) = sum of contributions from all paths
- Indirect effects can mask (suppression) or inflate (confounding) the direct effect
- Unanalyzed correlations (double-headed arrows) count as valid path segments
- Never ignore indirect paths — the total is rarely just the direct arrow
Genetic Combinatorics (F2 crosses, haplotype counting)
For n SNPs between two inbred (homozygous) strains:
- F1 is heterozygous at all n loci
- F2 distinct haplotypes = 2^n (each SNP contributes parental A or B allele)
- F2 distinct diploid genotypes = 3^n (AA, AB, BB at each locus)
- F2 unique chromosomes (distinct haplotypes) = 2^n (e.g., 5 SNPs → 2^5 = 32; but subtract the 2 parental haplotypes if "novel" is asked → 30)
- ALWAYS write and run Python code (
python3 -c "...") for these counts. Never enumerate by hand.
- For specimens/counting from field data: parse the data into a structure and compute programmatically.
Mutation-Selection Balance
Equilibrium frequency of a deleterious allele:
- Recessive lethal: q_hat = sqrt(mu/s) ~ sqrt(mu) when s=1
- Dominant lethal: q_hat = mu/s
- Example: mu=1e-5, s=1 (recessive lethal) -> q_hat = 0.003 (carrier freq ~ 0.006)
F-statistics and Population Structure
- Fis: Inbreeding within subpopulations (heterozygote deficit within demes)
- Fst: Differentiation between subpopulations. Fst = Var(p) / (p_bar * q_bar)
- Fit: Total inbreeding. (1-Fit) = (1-Fis)(1-Fst)
- Fst interpretation: <0.05 little, 0.05-0.15 moderate, 0.15-0.25 great, >0.25 very great differentiation
- Preferred: use
PopGen_fst tool. Fallback: popgen_calculator.py --type fst --p1 X --p2 Y --n1 N1 --n2 N2
Mendelian Genetics Reasoning Framework
For any genetics cross problem, follow these steps IN ORDER. Do not skip steps.
Step 1: Identify genes, locations, and allele relationships
- List every gene involved in the cross
- Determine chromosomal location: autosomal vs X-linked (X-linked genes show different inheritance in males vs females)
- Determine allele relationships: dominant/recessive, codominant, incomplete dominance
- Note any epistasis, suppressor, or modifier interactions between genes
Step 2: Write parental genotypes explicitly
- Use standard notation (e.g., Aa Bb for autosomal; X^w X^+ for X-linked)
- For X-linked genes, males are hemizygous (X^w Y), not homozygous
- If parental genotypes are not given, deduce them from phenotypes and pedigree context
Step 3: Draw Punnett square(s) for each gene
- For multi-gene crosses, handle each gene independently (if unlinked) then combine
- For linked genes, use recombination frequency to adjust gamete ratios
- For X-linked genes, remember that fathers pass X to all daughters and Y to all sons
Step 4: Calculate expected phenotypic ratios
- Multiply independent gene ratios (e.g., 3:1 x 3:1 = 9:3:3:1)
- For X-linked: calculate male and female ratios separately, then combine or report separately as required
Step 5: Verify ratios sum to 1.0
- Convert all ratios to fractions and confirm they sum to 1
- If they don't sum to 1, there is an error in the Punnett square or gamete calculation
Step 6: Apply phenotype modification rules AFTER computing genotypic ratios
- For epistasis: first compute the full genotypic ratios (e.g., 9:3:3:1), then collapse genotype classes that produce the same phenotype
- For suppressor genes: a suppressor homozygote (su/su) restores wild-type in an otherwise mutant background. Apply suppression AFTER determining which individuals carry the mutant allele
- Example: 9 A_B_ : 3 A_bb : 3 aaB_ : 1 aabb with recessive epistasis (aa masks B) becomes 9:3:4
E. coli Hfr Mapping Framework
For bacterial conjugation and Hfr mapping problems:
Core Principles
- In Hfr x F- crosses, the Hfr chromosome is transferred linearly starting from the origin of transfer (oriT)
- Gene transfer order = chromosomal order from the origin
- Early markers (entering first) are closest to the origin of transfer
- Late markers (entering last) are farthest from the origin
Interrupted Mating Experiments
- Genes that appear in recombinants at earlier time points are closer to oriT
- The time of entry gives the order and approximate distance between genes
- Recombinants require integration by homologous recombination (double crossover)
Recombination Frequency Between Markers
- KEY TRAP: Highest recombination frequency occurs between markers that are FARTHEST APART on the transferred segment
- This is because more time elapses between entry of distant markers, providing more opportunity for recombination events between them
- Conversely, markers that enter close together in time show LOW recombination between them
- Do NOT confuse "highest recombination frequency" with "first markers to enter" -- these are opposite concepts
Ordering Markers from Hfr Data
- Use time-of-entry data to establish gene order relative to oriT
- Use recombination frequency data between pairs of selected markers to confirm/refine order
- Multiple Hfr strains with different origins can be used to build a circular map
MCQ Elimination Strategy for Genetics
General MCQ Protocol
- ALWAYS evaluate ALL options before choosing an answer
- Never select the first option that seems correct -- there may be a better or more precise answer
- Read the question stem carefully for qualifiers: "MOST likely", "LEAST likely", "NOT true", "ALWAYS", "NEVER"
"Which is NOT true" Questions
- Evaluate EACH statement independently as True or False
- Mark each option with T or F before selecting
- The answer is the statement marked F
- Double-check: verify the "false" statement is genuinely false, not just misleadingly worded
"Which mechanism" Questions
- Test each proposed mechanism against ALL observations given in the question
- A correct mechanism must explain every observation, not just some
- Eliminate mechanisms that contradict even one observation
Specific Traps to Watch For
- Subfunctionalization vs neofunctionalization: Subfunctionalization = partitioning of EXISTING ancestral functions between duplicates (both copies needed to perform original function). Neofunctionalization = one copy acquires a genuinely NEW function not present in the ancestor
- Copy-neutral LOH: Caused by mitotic recombination (segmental, affects part of a chromosome), NOT uniparental disomy (UPD, which is whole-chromosome). The question may try to conflate these
- Penetrance vs expressivity: Penetrance = fraction of individuals with genotype who show ANY phenotype. Expressivity = degree/severity of phenotype among those who show it. These are distinct concepts
- Complementation vs recombination: Complementation = two mutations in DIFFERENT genes restore wild-type in trans. Recombination = exchange between two mutations in the SAME or different genes. Complementation is tested in F1 (heterozygote); recombination is tested in progeny
Common Genetics Reasoning Traps
These are specific patterns that have caused reasoning failures in hard genetics questions. Review before answering genetics MCQs.
Suppressor Genetics
- A suppressor mutation, when homozygous, restores wild-type phenotype in an otherwise mutant background
- In F2 crosses involving both the original mutation and an autosomal recessive suppressor:
- Treat as a dihybrid cross — the primary mutation and the suppressor segregate independently
- Only 1/4 of F2 are homozygous for the suppressor
- The suppressor only acts in individuals that are also homozygous for the primary mutation
- Use a Punnett square to enumerate all genotypic classes, then apply the suppression rule to determine phenotypes
Non-disjunction (Bridges' Experiments)
- Bridges used non-disjunction to prove the chromosome theory of inheritance
- X0 males arise from female meiosis non-disjunction events
- Meiosis I non-disjunction: both X chromosomes go to one pole -> XX egg + O egg (nullo-X)
- Meiosis II non-disjunction: sister chromatids fail to separate -> XX egg from one secondary oocyte
- The classic Bridges result: exceptional white-eyed females (X^w X^w) and red-eyed males (from nullo-X eggs + Y sperm = X0, but these are typically sterile)
- Key distinction: know which type of non-disjunction (MI vs MII) produces which specific gamete types
GWAS LD Blocks
- SNPs WITHIN the same LD block are correlated and can inflate false positive associations (one causal SNP drags along non-causal tag SNPs)
- SNPs ACROSS different LD blocks are largely independent and do NOT create misleading cross-locus associations
- LD block structure varies by population (shorter in African populations due to larger historical Ne)
- Fine-mapping within an LD block is needed to distinguish the causal variant from hitchhiking tag SNPs
Gene Retention After Whole-Genome Duplication
- Neofunctionalization: One copy acquires a NEW function -> most commonly cited reason for gene RETENTION after duplication (preserves both copies because each is now essential)
- Subfunctionalization: Ancestral functions are PARTITIONED between copies -> explains DIVERGENCE of duplicate copies, but both copies must be retained to maintain the full ancestral function
- Dosage balance: Some genes are retained in duplicate to maintain stoichiometric balance in protein complexes
- Trap: Questions may ask "what explains retention" vs "what explains divergence" -- these have different best answers
- For retention: neofunctionalization (new function makes both copies essential)
- For divergence of expression/function: subfunctionalization (partitioning of ancestral roles)
Advanced Genetics Traps v2
PGS vs Heritability: "Necessarily True" Logic
For "necessarily true" questions about PGS and heritability: a statement is necessarily true only if it holds when V_D=0 AND when V_D=V_G. Test the extremes.
Path Diagram Sign Assignment Protocol
Do NOT guess path signs from general knowledge. Signs may differ from well-known systems. Follow this protocol:
- Establish reference direction: What varies? What is increasing?
- For each path X→Y: Ask ONLY "when X increases, does Y increase (+) or decrease (-)?"
- Use the question's experimental context (knockout/control comparisons, provided data) to determine signs — not intuition
- Expect negative paths: Path diagrams test your ability to identify negative relationships. All-positive is almost always wrong. Direct residual paths (e) often have opposite sign from expectation.
Chi-Square: "Most Likely to Reject" Protocol
Compute chi-square from the expected ratio given in the question. Compare to chi-square-critical at df = (number of phenotype classes - 1). Pick the answer choice with the highest chi-square, but also check which pattern is biologically diagnostic of the alternative hypothesis.
LD and Misleading GWAS Associations
LD block boundaries at recombination hotspots are a source of GWAS false localization — strong signal in the block does not guarantee the causal variant is in the block.
Low-Frequency Allele Detection
Duplex sequencing (unique molecular identifiers + double-strand consensus) detects alleles at 0.01% frequency — far below standard NGS even at 80X depth. Simply increasing read depth does NOT help for ultra-rare variants because the Illumina error rate (~0.1%) masks variants rarer than ~1% regardless of depth. Error correction methods (UMIs, duplex consensus) are needed to distinguish true rare variants from sequencing errors.
Bundled Computation Script
Script: skills/tooluniverse-population-genetics/scripts/popgen_calculator.py
Preferred: Use ToolUniverse tools (via MCP/SDK) instead of the script when possible:
PopGen_hwe_test tool -- HWE chi-square test. Fallback: popgen_calculator.py --type hwe
PopGen_fst tool -- Weir-Cockerham Fst. Fallback: popgen_calculator.py --type fst
PopGen_inbreeding tool -- Inbreeding coefficient from pedigree. Fallback: popgen_calculator.py --type inbreeding
PopGen_haplotype_count tool -- Expected haplotype diversity. Fallback: popgen_calculator.py --type haplotypes
Fallback script modes (all require --type):
hwe: --AA N --Aa N --aa N -- chi-square HWE test with p-value
fst: --p1 F --p2 F --n1 N --n2 N -- Weir-Cockerham Fst
inbreeding: --pedigree TYPE --generations G -- F from pedigree (self, full-sib, half-sib, first-cousin, etc.)
haplotypes: --snps N --generations G --recomb_rate R -- expected haplotype diversity
Key Concepts
- MAF: Minor allele frequency. Common: >5%. Rare: <1%. Ultra-rare: <0.01%.
- pLI: P(LoF intolerant). >0.9 = haploinsufficient gene.
- LOEUF: LoF o/e upper fraction. <0.35 = highly constrained.
- CADD PHRED: >=10 top 10%, >=20 top 1%, >=30 top 0.1% most deleterious.
- Genome-wide significance: GWAS p < 5e-8 (Bonferroni for ~1M independent tests).
- Effect size: OR > 1 = risk allele, < 1 = protective. Beta > 0 = increases trait.
Evidence Grading
- T1: ClinVar pathogenic/likely pathogenic, FDA pharmacogenomics
- T2: gnomAD frequencies, GTEx eQTLs, GWAS genome-wide significant
- T3: CADD/SIFT/PolyPhen predictions, RegulomeDB, constraint metrics
- T4: VEP consequence terms, dbSNP annotations, literature mentions