From sciagent-skills
Guides BUSCO output interpretation: why Duplicated BUSCOs count as complete, parsing files, computing/comparing completeness across proteomes/genomes, common mistakes. For QC, assembly comparison, reporting.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
BUSCO (Benchmarking Universal Single-Copy Orthologs) is the standard tool for assessing genome, transcriptome, and proteome completeness by searching for conserved single-copy orthologs from the OrthoDB database. Correct interpretation of BUSCO output is essential for genome quality assessment, comparative genomics, and publication-ready reporting. The most common analytical error is excluding ...
Guides post-quantification analysis of omics data (bulk RNA-seq transcriptomics, proteomics) via three-tiered approach: validated pipelines (DESeq2, MaxQuant), standard workflows, custom methods.
Processes biological sequences, alignments, phylogenetic trees, diversity metrics (alpha/beta, UniFrac), ordination (PCoA), PERMANOVA, and formats like FASTA/Newick for microbiome analysis using scikit-bio.
Analyzes microbiome/metagenomics data via MGnify, GTDB, ENA: searches studies by biome/keyword, retrieves taxonomy/functional profiles, classifies genomes, finds publications.
Share bugs, ideas, or general feedback.
BUSCO (Benchmarking Universal Single-Copy Orthologs) is the standard tool for assessing genome, transcriptome, and proteome completeness by searching for conserved single-copy orthologs from the OrthoDB database. Correct interpretation of BUSCO output is essential for genome quality assessment, comparative genomics, and publication-ready reporting. The most common analytical error is excluding Duplicated BUSCOs from completeness counts, which artificially penalizes polyploid organisms and assemblies with legitimate gene duplications.
This guide covers BUSCO status categories, output file formats, parsing strategies, cross-proteome comparisons, lineage dataset selection, and common pitfalls in BUSCO interpretation.
BUSCO assigns each searched ortholog one of four statuses:
| Status | Abbreviation | Meaning | Count as Complete? |
|---|---|---|---|
| Complete (single-copy) | S | Found exactly once in the genome/proteome | YES |
| Duplicated | D | Found more than once (multiple copies) | YES |
| Fragmented | F | Partial match, likely incomplete gene model | NO |
| Missing | M | Not detected at all | NO |
The headline completeness percentage (C%) reported by BUSCO is always S + D combined. Individual category counts (S, D, F, M) are reported for transparency and should be included in publications.
A Duplicated BUSCO means the ortholog IS present and fully intact in the genome or proteome -- it simply exists in more than one copy. This can occur through:
The gene is not incomplete or absent. Excluding Duplicated BUSCOs from completeness counts would incorrectly penalize polyploid organisms, recently duplicated genomes, or proteomes that include isoform-level annotations. The correct completeness formula is always:
Completeness (%) = (Complete_single_copy + Duplicated) / Total_BUSCOs * 100
A high Duplicated fraction is not inherently problematic -- it is biologically informative. For example, the zebrafish genome (a teleost with an ancient whole-genome duplication) routinely shows 15-25% Duplicated BUSCOs, and this is expected.
BUSCO produces two primary output formats relevant to downstream analysis:
Short summary format -- a single-line notation found in short_summary.*.txt:
C:95.0%[S:90.0%,D:5.0%],F:3.0%,M:2.0%,n:255
Where C = Complete (S + D), S = Single-copy, D = Duplicated, F = Fragmented, M = Missing, and n = total BUSCO groups searched.
Full table format -- a TSV file (full_table.tsv) with per-ortholog results containing columns for BUSCO ID, Status, Sequence, Score, and Length. This file enables detailed per-gene analysis, filtering, and cross-species comparisons.
When deciding whether and how to use BUSCO for quality assessment:
Question: What are you assessing?
├── Genome assembly completeness
│ ├── Draft assembly → Run BUSCO in genome mode
│ └── Polished/final assembly → Run BUSCO in genome mode, report in publication
├── Transcriptome completeness
│ └── De novo assembly → Run BUSCO in transcriptome mode (expect higher D%)
├── Proteome / annotation completeness
│ └── Predicted proteins → Run BUSCO in protein mode
└── Comparing multiple assemblies
└── Same lineage dataset across all → Use compare_proteome_completeness pattern
| Organism type | Recommended lineage | Example dataset | Notes |
|---|---|---|---|
| Broad eukaryotic screen | eukaryota | eukaryota_odb10 | Low resolution, useful for initial checks |
| Vertebrate | vertebrata or class-level | mammalia_odb10, actinopterygii_odb10 | Class-level gives better resolution |
| Insect | insecta or order-level | diptera_odb10, hymenoptera_odb10 | Order-level preferred when available |
| Plant | viridiplantae or more specific | embryophyta_odb10, eudicots_odb10 | Plants often show high D% due to polyploidy |
| Fungus | fungi or division-level | ascomycota_odb10, basidiomycota_odb10 | Match to known phylogenetic placement |
| Bacterium | bacteria or phylum-level | proteobacteria_odb10 | Use --auto-lineage-prok for unknown bacteria |
General rule: Use the most specific lineage dataset that encompasses your organism. More specific datasets contain more BUSCOs and provide higher resolution, but using a dataset that does not include your organism will produce misleadingly low scores.
Always report all four categories (S, D, F, M): Do not report only the headline C% value. Reviewers and readers need the breakdown to assess whether high completeness comes from single-copy genes (expected for haploid organisms) or duplicated genes (expected for polyploids). This is now a standard expectation in genome papers.
Use the same lineage dataset for all comparisons: When comparing assemblies or proteomes, every run must use the identical lineage dataset and BUSCO version. Mixing lineage datasets (e.g., comparing one assembly run with eukaryota_odb10 against another with metazoa_odb10) produces incomparable results.
Choose the most specific lineage available: More specific lineage datasets provide more BUSCO markers and finer resolution. A vertebrate genome assessed with eukaryota_odb10 (255 markers) gives a much coarser picture than one assessed with mammalia_odb10 (9,226 markers).
Interpret Duplicated percentage in biological context: High D% in plants, teleost fish, or salmonids is expected due to known whole-genome duplication events. High D% in a haploid bacterium, however, may indicate assembly artifacts (e.g., uncollapsed haplotypes or contamination).
Run BUSCO on the correct input type: Use genome mode for assemblies (FASTA of contigs/scaffolds), transcriptome mode for de novo transcriptome assemblies, and protein mode for predicted proteomes. Using the wrong mode produces misleading results because BUSCO applies different search strategies for each.
Include BUSCO version and dataset in methods sections: Reproducibility requires reporting the exact BUSCO version, OrthoDB dataset version, and any non-default parameters used. Example: "Completeness was assessed with BUSCO v5.4.7 using the mammalia_odb10 dataset."
Validate with BUSCO's built-in plotting: Use generate_plot.py to create the standard BUSCO stacked bar chart for visual comparison across assemblies. This standardized visualization is widely recognized by reviewers.
Counting only single-copy BUSCOs as "complete": This is the most frequent error. Filtering for Status == 'Complete' alone misses all Duplicated entries, which are fully intact orthologs.
df['Status'].isin(['Complete', 'Duplicated']). Verify your total matches the C% in the short summary.Comparing results across different lineage datasets: BUSCO scores from eukaryota_odb10 (255 groups) and insecta_odb10 (1,367 groups) are not comparable because they search for different sets of orthologs with different expected counts.
Interpreting high Duplicated percentage as an assembly error: For polyploid organisms (many plants, some fish, some amphibians), high D% is biologically correct. Flagging it as an error can lead to unnecessary reassembly or incorrect filtering.
Using a lineage dataset that does not encompass the organism: Running a fungal genome through insecta_odb10 will produce near-zero completeness, not because the assembly is poor but because the wrong orthologs are being searched.
--auto-lineage for unknown organisms, or verify phylogenetic placement before selecting a dataset. Check the OrthoDB taxonomy browser.Ignoring Fragmented BUSCOs during troubleshooting: A high Fragmented percentage often indicates real problems -- truncated gene models, poor assembly in genic regions, or incomplete polishing -- that are actionable.
Not accounting for BUSCO version differences: BUSCO v3, v4, and v5 use different algorithms, datasets, and scoring thresholds. Results are not directly comparable across major versions.
Reporting completeness without the total BUSCO count (n): Saying "95% complete" is ambiguous without knowing whether that is 95% of 255 BUSCOs (eukaryota) or 95% of 9,226 BUSCOs (mammalia).
C:95.0%[S:90.0%,D:5.0%],F:3.0%,M:2.0%,n:255.Select lineage dataset
busco --auto-lineage firstRun BUSCO
Parse short summary
import re
def parse_busco_summary(filepath):
"""Parse BUSCO short summary file."""
with open(filepath) as f:
text = f.read()
# Extract the summary line
match = re.search(
r'C:(\d+\.?\d*)%\[S:(\d+\.?\d*)%,D:(\d+\.?\d*)%\],'
r'F:(\d+\.?\d*)%,M:(\d+\.?\d*)%,n:(\d+)',
text
)
if match:
return {
'complete_pct': float(match.group(1)), # S + D
'single_copy_pct': float(match.group(2)),
'duplicated_pct': float(match.group(3)),
'fragmented_pct': float(match.group(4)),
'missing_pct': float(match.group(5)),
'total': int(match.group(6))
}
return None
import pandas as pd
def parse_busco_full_table(filepath):
"""Parse BUSCO full_table.tsv output."""
df = pd.read_csv(filepath, sep='\t', comment='#',
names=['Busco_id', 'Status', 'Sequence', 'Score', 'Length'])
# Count by status
counts = df['Status'].value_counts()
print(counts)
# Complete = Complete + Duplicated
n_complete = counts.get('Complete', 0) + counts.get('Duplicated', 0)
print(f"\nTotal complete (S+D): {n_complete}")
return df
def count_complete_buscos(busco_results):
"""Count complete BUSCOs (single-copy + duplicated).
Args:
busco_results: DataFrame with columns including 'Status'
Status values: 'Complete', 'Duplicated', 'Fragmented', 'Missing'
Returns:
int: Count of complete orthologs
"""
complete_statuses = ['Complete', 'Duplicated']
n_complete = busco_results['Status'].isin(complete_statuses).sum()
n_single = (busco_results['Status'] == 'Complete').sum()
n_duplicated = (busco_results['Status'] == 'Duplicated').sum()
n_fragmented = (busco_results['Status'] == 'Fragmented').sum()
n_missing = (busco_results['Status'] == 'Missing').sum()
print(f"Complete (single-copy): {n_single}")
print(f"Duplicated: {n_duplicated}")
print(f"Total complete: {n_complete} (single + duplicated)")
print(f"Fragmented: {n_fragmented}")
print(f"Missing: {n_missing}")
return n_complete
# WRONG: Only counting single-copy as "complete"
n_complete = (busco_results['Status'] == 'Complete').sum() # Misses duplicated!
# CORRECT: Count both single-copy and duplicated
n_complete = busco_results['Status'].isin(['Complete', 'Duplicated']).sum()
def compare_proteome_completeness(busco_results_dict):
"""Compare BUSCO completeness across multiple proteomes.
Args:
busco_results_dict: {proteome_name: busco_dataframe}
"""
summary = []
for name, df in busco_results_dict.items():
n_complete = df['Status'].isin(['Complete', 'Duplicated']).sum()
n_total = len(df)
pct = 100 * n_complete / n_total
summary.append({
'Proteome': name,
'Complete': n_complete,
'Total': n_total,
'Completeness_pct': round(pct, 1)
})
summary_df = pd.DataFrame(summary).sort_values('Completeness_pct', ascending=False)
print(summary_df.to_string(index=False))
return summary_df
prokka-genome-annotation -- Prokaryotic genome annotation pipeline; BUSCO is commonly run on Prokka-predicted proteomes to assess annotation completenesssamtools-bam-processing -- BAM file processing; alignment quality metrics complement BUSCO completeness for assembly QCmultiqc-qc-reports -- Aggregated QC reporting; MultiQC can incorporate BUSCO results into unified quality reports across samples