From sciagent-skills
Guides quality filtering raw VCF files with bcftools on QUAL scores before Ts/Tv ratios and variant stats. Detects raw VCFs via FILTER/QUAL and explains when not to filter.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
Raw VCF files produced by variant callers (GATK HaplotypeCaller, bcftools mpileup, DeepVariant, etc.) contain a mixture of true variants and artifacts from sequencing errors, alignment issues, and low-coverage regions. Computing summary statistics -- Ts/Tv ratio, variant counts, allele frequency distributions -- on unfiltered data yields unreliable results because false-positive calls dispropor...
Manipulates VCF/BCF files: filters by quality, merges multi-samples, annotates rsIDs, normalizes indels, extracts genotypes, computes stats. For post-variant-calling genomics pipelines.
Processes VCF files: parses large/streaming files, classifies mutations/SVs, filters VAF/depth/quality, annotates ClinVar/gnomAD/CADD, interprets CNV pathogenicity. For bioinformatics variant analysis.
Annotate VCF variants with Ensembl VEP REST, ClinVar significance, gnomAD/population frequency context, and prioritized variant ranking.
Share bugs, ideas, or general feedback.
Raw VCF files produced by variant callers (GATK HaplotypeCaller, bcftools mpileup, DeepVariant, etc.) contain a mixture of true variants and artifacts from sequencing errors, alignment issues, and low-coverage regions. Computing summary statistics -- Ts/Tv ratio, variant counts, allele frequency distributions -- on unfiltered data yields unreliable results because false-positive calls disproportionately inflate transversion counts and depress the Ts/Tv ratio. This guide covers how to detect whether a VCF is raw, how to apply appropriate quality filters, when filtering is not appropriate, and how to interpret the resulting statistics correctly.
The QUAL column in a VCF file represents the Phred-scaled probability that the variant site is polymorphic. A QUAL score of 30 means a 1-in-1000 chance the call is wrong; a QUAL of 20 means 1-in-100. Variant callers assign QUAL scores based on read evidence, base qualities, and mapping qualities. Low-QUAL variants (below 20-30) are enriched for sequencing errors and alignment artifacts. Filtering on QUAL is the simplest and most widely used first-pass quality control step.
Common QUAL thresholds and their interpretation:
| QUAL Score | Error Probability | Typical Use |
|---|---|---|
| 10 | 1 in 10 | Very permissive; rarely appropriate for final calls |
| 20 | 1 in 100 | Lenient filtering; useful for somatic calling with supporting evidence |
| 30 | 1 in 1,000 | Standard default for germline variant filtering |
| 50 | 1 in 100,000 | Stringent; used in clinical or high-confidence applications |
| 100+ | Extremely low | Highly supported variants; may over-filter low-coverage regions |
For GATK-based workflows, VQSR or hard filtering on INFO annotations (QD, FS, MQ, etc.) may supplement or replace QUAL filtering. GATK's recommended hard filters for SNPs include QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < -12.5, and ReadPosRankSum < -8.0.
The transition-to-transversion (Ts/Tv) ratio is a key quality metric for variant call sets. Transitions (A<->G, C<->T) are chemically favored over transversions (all other substitutions) due to the molecular structure of nucleotide bases. Expected Ts/Tv values serve as benchmarks:
A Ts/Tv ratio significantly below the expected range indicates contamination by false-positive transversion calls, which are the hallmark of sequencing errors. After proper quality filtering, the Ts/Tv ratio should rise to the expected range for the assay type.
A "raw" VCF is the direct output of a variant caller before any quality filtering has been applied. A "filtered" VCF has had quality thresholds applied, either by hard filtering (QUAL, DP, QD, etc.) or by model-based filtering (GATK VQSR, CNN). Distinguishing between the two is critical because computing statistics on raw data without disclosure leads to incorrect conclusions.
Indicators that a VCF is raw or unfiltered:
sample_raw_variants.vcf). (missing) for all recordsIndicators that a VCF has already been filtered:
PASS, LowQual, VQSRTrancheSNP99.90to100.00)##FILTER= and ##bcftools_viewCommand= lines)The FILTER column in VCF format has specific semantics defined by the VCF specification:
. (dot) -- filter status has not been applied; the variant is unassessedPASS -- the variant passed all filtersA common misconception is that . means the variant passed. In reality, . means no filter has been evaluated, so the variant's quality is unknown. Another misconception is that PASS in every row means the file is unfiltered -- some callers (e.g., DeepVariant) mark all emitted variants as PASS because they only output high-confidence calls.
To inspect the FILTER column programmatically:
# Count occurrences of each FILTER value
bcftools query -f '%FILTER\n' input.vcf | sort | uniq -c | sort -rn | head
# Check if any non-'.' FILTER values exist
bcftools query -f '%FILTER\n' input.vcf | grep -v '^\.$' | head
Understanding the FILTER column is the first step in any VCF quality assessment. Always inspect it before deciding whether additional filtering is needed.
Is the VCF raw or unfiltered?
├── Yes (FILTER='.', many low-QUAL variants)
│ ├── Does the user ask for RAW statistics specifically?
│ │ ├── Yes → Do NOT filter; compute on data as-is, note it is raw
│ │ └── No → Apply QUAL>=30 filter before computing statistics
│ └── Does the user specify a custom threshold?
│ ├── Yes → Use their threshold
│ └── No → Default to QUAL>=30
├── No (FILTER has PASS/other values, filtered header present)
│ ├── Does the user ask for additional filtering?
│ │ ├── Yes → Apply requested filter
│ │ └── No → Compute statistics on existing filtered set
│ └── Does the Ts/Tv ratio look suspicious despite filtering?
│ └── Yes → Investigate; may need stricter thresholds
└── Uncertain
└── Inspect FILTER column and QUAL distribution to determine status
| Scenario | Action | Rationale |
|---|---|---|
| Raw VCF, general statistics requested | Apply QUAL>=30 before computing | Low-QUAL variants are enriched for errors and bias all statistics |
| Raw VCF, user explicitly asks for raw stats | Compute as-is, report that data is unfiltered | Respect the user's intent; they may be evaluating caller performance |
| Raw VCF, user specifies threshold (e.g., QUAL>=50) | Use the user-specified threshold | User has domain-specific requirements |
| Pre-filtered VCF (FILTER=PASS present) | Compute without additional QUAL filter | Double-filtering may remove true variants unnecessarily |
| VCF with mixed FILTER values | Filter to PASS-only with bcftools view -f PASS | Non-PASS variants were flagged for a reason |
| VQSR-filtered VCF | Respect existing tranche filtering | VQSR is a more sophisticated filter than simple QUAL thresholds |
| Small panel or targeted sequencing | Consider lower QUAL threshold (e.g., 20) | Small panels have different error profiles and fewer variants |
Always inspect the VCF before computing statistics. Check the FILTER column distribution and QUAL score distribution before running any summary. A 30-second inspection prevents publishing misleading numbers. Use bcftools query -f '%FILTER\n' input.vcf | sort | uniq -c | sort -rn | head to get a quick overview.
Use established CLI tools instead of custom parsers. bcftools, vcftools, and similar domain-specific tools handle edge cases that custom parsers typically miss: multi-allelic sites, complex variants, missing genotypes, and proper transition/transversion classification. Writing a Python parser to count Ts/Tv is error-prone and unnecessary.
# Preferred: use bcftools for Ts/Tv computation
bcftools stats input.vcf | grep ^TSTV
# Preferred: use bcftools for variant counts
bcftools stats input.vcf | grep ^SN
Report what filtering was applied (or not). Every time you present VCF summary statistics, state clearly whether the data was filtered, what threshold was used, and how many variants passed. This is essential for reproducibility and for the reader to assess the validity of the results.
Use QUAL>=30 as a sensible default, not a universal rule. QUAL>=30 (1-in-1000 error rate) is a widely used default for first-pass filtering of germline variant calls. However, different applications may warrant different thresholds: somatic variant calling may use lower thresholds with additional evidence, while clinical applications may demand stricter cutoffs. When in doubt, QUAL>=30 is a defensible starting point.
Combine filtering and statistics in a single pipeline. Piping the filtered output directly into the statistics command avoids creating intermediate files and ensures you never accidentally compute statistics on the wrong file.
# Filter and compute statistics in one pipeline
bcftools view -i 'QUAL>=30' input.vcf | bcftools stats - > filtered_stats.txt
Validate filtering by checking the Ts/Tv ratio. After filtering, the Ts/Tv ratio should be in the expected range for the assay type. If it is still low, consider that the variant caller may have systematic issues, the sample may have quality problems, or a stricter filter is needed.
Preserve the original VCF. Never overwrite the raw VCF with filtered output. Keep the raw file for auditability and in case you need to re-filter with different parameters.
Computing statistics on raw VCF files without any filtering. This is the most common and most consequential mistake. Raw Ts/Tv ratios can be 1.5 or lower, suggesting much worse data quality than actually exists, and variant counts will be inflated by thousands of false positives.
. for all records, apply QUAL>=30 filtering first.Assuming FILTER='.' means the variant passed. The dot in the FILTER column means "not assessed," not "passed." Treating unassessed variants as high-quality leads to inclusion of many false positives.
. (unassessed), PASS (passed all filters), and named filters (failed). Use bcftools query -f '%FILTER\n' | sort | uniq -c to inspect.Writing custom Python parsers for Ts/Tv calculation. Custom parsers frequently mishandle multi-allelic sites, complex variants, or edge cases in the VCF format. They also tend to be slower than optimized C-based tools.
bcftools stats for Ts/Tv computation. It is well-tested, fast, and handles all VCF edge cases correctly.Double-filtering a VCF that was already filtered. Applying QUAL>=30 to a VCF that already went through VQSR or hard filtering can remove true variants that were assigned lower QUAL scores by the caller but passed model-based assessment.
##bcftools_viewCommand, ##GATKCommandLine) and the FILTER column for non-dot values before adding more filters.Using the wrong Ts/Tv expectation for the assay type. Comparing a WES Ts/Tv of 3.0 against a WGS expectation of 2.1 would incorrectly suggest the data is too good, while comparing a WGS Ts/Tv of 2.0 against WES expectations would incorrectly suggest poor quality.
Forgetting to report the filtering status in results. Presenting a Ts/Tv ratio without stating whether or how the data was filtered makes the number uninterpretable and non-reproducible.
Filtering on QUAL alone when richer annotations are available. GATK and other callers provide INFO-level annotations (QD, FS, MQ, SOR, MQRankSum, ReadPosRankSum) that are more informative than QUAL alone. Relying solely on QUAL when these are available leaves quality on the table.
Inspect the VCF metadata and FILTER column
# View header for caller and filter information
bcftools view -h input.vcf | grep -E '##(FILTER|source|GATKCommandLine|bcftools)'
# Check FILTER column values
bcftools query -f '%FILTER\n' input.vcf | sort | uniq -c | sort -rn | head
Assess QUAL score distribution
# Check QUAL distribution
bcftools query -f '%QUAL\n' input.vcf | awk '{if($1<30) low++; else high++} END {print "QUAL<30:", low+0, "QUAL>=30:", high+0}'
. and many low-QUAL variants exist, proceed to Step 3.Apply quality filtering
# Apply QUAL>=30 filter
bcftools view -i 'QUAL>=30' input.vcf -Oz -o filtered.vcf.gz
bcftools index filtered.vcf.gz
Compute summary statistics
# Ts/Tv ratio from filtered VCF
bcftools view -i 'QUAL>=30' input.vcf | bcftools stats - | grep ^TSTV | cut -f5
# Full summary numbers
bcftools view -i 'QUAL>=30' input.vcf | bcftools stats - | grep ^SN
# Per-sample statistics
bcftools stats -s - input.vcf | grep ^PSC
Compare before and after filtering
# Count variants before filtering
bcftools view -H input.vcf | wc -l
# Count variants after filtering
bcftools view -i 'QUAL>=30' input.vcf | bcftools view -H | wc -l
# Compare Ts/Tv before and after
echo "=== Before filtering ==="
bcftools stats input.vcf | grep ^TSTV
echo "=== After QUAL>=30 filtering ==="
bcftools view -i 'QUAL>=30' input.vcf | bcftools stats - | grep ^TSTV
Validate and report
Pre-analysis inspection is non-negotiable. Before any VCF analysis, run bcftools query -f '%FILTER\n' and examine the QUAL distribution. This takes seconds and prevents hours of wasted analysis on unreliable data.
Default to QUAL>=30 for unlabeled VCFs. When the filtering status is ambiguous and the user has not specified a threshold, QUAL>=30 is the standard community default. Document this choice explicitly in the output.
Never silently filter. If you apply filtering, always report it. If you choose not to filter, state why. Transparency in filtering decisions is fundamental to reproducible genomics.
Use piped commands for efficiency. Chain bcftools view and bcftools stats in a pipe rather than writing intermediate files. This is faster, uses less disk, and reduces the chance of analyzing the wrong file.
Verify with Ts/Tv as a sanity check. After any filtering operation, compute Ts/Tv as a quality control metric. If the ratio does not improve or remains outside the expected range, reassess the filtering strategy or investigate upstream data quality issues.
bcftools-variant-manipulation -- Detailed bcftools usage for subsetting, merging, annotating, and manipulating VCF files beyond simple filteringgatk-variant-calling -- Upstream variant calling with GATK HaplotypeCaller, including VQSR and hard filtering configurationsamtools-bam-processing -- BAM file quality control and processing steps that precede variant calling and affect downstream VCF quality