From sciagent-skills
Guides sgRNA design for CRISPR experiments using validated Addgene sequences, CRISPick pre-computed datasets, or de novo rules.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
---
Provides fast CLI and Python queries to 20+ bioinformatics databases for gene info, BLAST searches, AlphaFold structures, enrichment analysis, and reference genomes. For interactive exploration and simple queries.
Queries 20+ bioinformatics databases via CLI/Python for gene info from Ensembl/UniProt, AlphaFold structures, ARCHS4 expression, Enrichr, and genome downloads. Use for quick BLAST searches and data retrieval.
Analyzes pooled/arrayed CRISPR screens for essential genes, synthetic lethals, and drug targets via sgRNA processing, MAGeCK/BAGEL scoring, QC, normalization, and enrichment.
Share bugs, ideas, or general feedback.
Short Description: Comprehensive guide for finding or designing sgRNAs using validated sequences, CRISPick datasets, or de novo design tools.
Authors: Ohagent Team
Version: 1.0
Last Updated: November 2025
License: CC BY 4.0
Commercial Use: Allowed
This guide provides a three-tiered approach to sgRNA design, prioritizing validated sequences before moving to computational predictions. Always start with Option 1 and proceed to subsequent options only if needed.
Validated sgRNAs are guide sequences that have been experimentally tested in published work, with documented cell line, cutting efficiency, and (often) off-target characterization. Addgene curates such sequences alongside the deposited plasmids that carry them, and a downloadable CSV (addgene_grna_sequences.csv) provides a searchable index keyed by gene symbol, target species, and application (cut / activate / RNA targeting). Using a validated sgRNA is preferred because the largest source of CRISPR experimental failure is poor on-target activity that would have been detected during the original validation.
When no validated sgRNA exists, pre-computed genome-scale designs from the Broad Institute's CRISPick service are the next best option. CRISPick provides 238 datasets covering multiple genomes, Cas variants, and applications, with each candidate guide ranked by on-target efficiency (how reliably it cuts), off-target specificity (how unlikely it is to cut elsewhere), and a combined rank that balances both. The on-target models behind CRISPick are Sanson 2018 (Cas9) and DeWeirdt 2021 (Cas12a). Combined Rank is the recommended default; use On-Target or Off-Target rank only when the experiment specifically prioritizes one over the other.
PAM (Protospacer Adjacent Motif) requirements differ by Cas variant: SpCas9 needs NGG (3'), SaCas9 needs NNGRRT (3'), AsCas12a and enAsCas12a need TTTV (5'). Critically, AsCas12a and enAsCas12a are different enzymes: enAsCas12a is an engineered variant with broadened activity, and guides optimized for one will not perform identically on the other. Always match the CRISPick dataset to the exact Cas variant used in the lab. Off-target stringency thresholds (typically Off-Target Rank or a CFD-style score) trade specificity against the size of the candidate pool — tighter thresholds yield fewer but cleaner guides.
When neither validated sequences nor pre-computed datasets cover the target (non-model organism, custom locus, etc.), apply rule-based design: 20 bp protospacer for SpCas9/SaCas9 (23–25 bp for Cas12a), GC content 40–60%, avoid TTTT (Pol III terminator) and homopolymer runs >4. Target location matters: early exons (first 50% of coding sequence) for knockout, −200 to +1 from TSS for CRISPRa, −50 to +300 from TSS for CRISPRi.
sgRNA design decision tree
└── Does Addgene database (Tier 1) contain a validated sgRNA for your gene + species + application?
├── Yes -> Use the validated sgRNA(s); cite the original PubMed reference
└── No -> Run advanced literature search (mandatory before Tier 2)
├── Found in literature -> Use the literature sgRNA; cite the paper
└── Still no match -> Is the target organism + Cas variant covered by a CRISPick dataset (Tier 2)?
├── Yes -> Download dataset, filter by gene, sort by Combined Rank
│ └── Pick top 3-4 sgRNAs (ideally from different exons for redundancy)
└── No -> De novo design (Tier 3) using 20 bp / PAM / GC / avoid-TTTT rules
└── Validate experimentally (Sanger / T7E1 / amplicon-seq)
| Situation | Recommended tier | Rationale |
|---|---|---|
| Common human/mouse gene with prior CRISPR publications | Tier 1 (Addgene + literature) | Validated sequences come with measured cutting efficiency; lowest experimental risk |
| Genome-scale screen of a model organism | Tier 2 (CRISPick) | Pre-computed datasets cover whole genomes with consistent scoring |
| Single human gene knockout, no Addgene hit | Tier 2 (CRISPick GRCh38 SpCas9 CRISPRko), filter by Combined Rank | Best balance of efficiency and specificity for one-off knockouts |
| CRISPRa / CRISPRi experiment | Tier 2 with the matching CRISPRa / CRISPRi dataset | Activation/inhibition models target proximal-promoter windows, not coding exons |
| Cas12a (AsCas12a vs enAsCas12a) | Tier 2 with the exact Cas12a variant dataset | Guides for one variant are not interchangeable with the other |
| Non-model organism not covered by CRISPick | Tier 3 (de novo rules) | No high-throughput training data exists; rule-based design + experimental validation |
| Small custom locus (e.g., engineered cassette) | Tier 3 | Locus is not in any reference genome; design directly against the target sequence |
| Maximum specificity required (e.g., therapeutic application) | Tier 2 sorted by Off-Target Rank | Prioritize guides with the fewest predicted off-target sites |
| Maximum cutting efficiency required (e.g., difficult cell line) | Tier 2 sorted by On-Target Rank | Accept slightly higher off-target risk in exchange for activity |
addgene_grna_sequences.csv returns no rows, run the literature search step (Method 2). Many validated sgRNAs are published in supplementary materials but never deposited at Addgene.On-Target Rank or Off-Target Rank only when the experiment has a specific reason to prioritize one.Pitfall: Skipping the literature search step (Method 2) when the Addgene CSV returns no hits.
Pitfall: Using AsCas12a guides with enAsCas12a (or vice versa) because both share TTTV PAM.
AsCas12a vs enAsCas12a) to the exact enzyme variant used at the bench.Pitfall: Sorting by On-Target Rank only and ending up with high-activity, low-specificity guides that produce off-target editing.
Pitfall: Designing a single sgRNA per gene and concluding "the guide doesn't work" when one fails.
Pitfall: Targeting late exons or 3' UTRs for knockout. Truncated proteins from late-exon edits are often partially functional and confound phenotype interpretation.
Pitfall: Using a CRISPRko dataset for a CRISPRa or CRISPRi experiment. The protospacer windows are different — coding exons for knockout, proximal promoter for activation/inhibition.
CRISPRko / CRISPRa / CRISPRi) matches the experiment.Pitfall: Designing guides containing TTTT. This sequence terminates RNA Pol III transcription, so the sgRNA simply will not be expressed from a U6 promoter.
TTTT and homopolymer runs >4.Pitfall: Failing to cite the original publication for a validated sgRNA.
PubMed_ID from the Addgene CSV (or the DOI from the literature search) as part of the design record and include it in the methods section.IMPORTANT: You MUST complete BOTH Method 1 AND Method 2 before proceeding to Option 2. Do not skip Method 2 even if Method 1 finds no results.
We maintain a curated database of 300+ validated sgRNA sequences from Addgene with experimental evidence.
Location: resource/addgene_grna_sequences.csv (relative to this skill directory)
Search the database:
import pandas as pd
# Load the database
df = pd.read_csv('addgene_grna_sequences.csv')
# Search for your gene
gene_name = "TP53"
results = df[df['Target_Gene'].str.upper() == gene_name.upper()]
# Filter by species and application
results_filtered = results[
(results['Target_Species'] == 'H. sapiens') &
(results['Application'] == 'cut') # or 'activate', 'RNA targeting'
]
# Display results with references
print(results_filtered[['Target_Gene', 'Target_Sequence',
'Plasmid_ID', 'PubMed_ID', 'Depositor']])
Database columns:
Target_Gene: Gene symbolTarget_Species: Organism (H. sapiens, M. musculus, etc.)Target_Sequence: 20bp sgRNA sequence (5' to 3')Application: cut (knockout), activate (CRISPRa), RNA targeting (CRISPRi)Cas9_Species: S. pyogenes, S. aureus, etc.Plasmid_ID: Addgene plasmid numberPlasmid_URL: Direct link to plasmid pagePubMed_ID: Publication reference (cite this in your work)PubMed_URL: Direct link to paperDepositor: Research lab that contributed the sequenceCRITICAL: Even if Method 1 found no results, you MUST perform this literature search before moving to Option 2. Many validated sgRNAs are published in literature but not in the Addgene database.
Use advanced_web_search_claude from ohagent.tool.literature to find validated sgRNAs from literature and databases:
from ohagent.tool.literature import advanced_web_search_claude
# Example usage
results = advanced_web_search_claude("sgRNA TP53 validated H. sapiens experimental")
Search queries to try (use multiple):
"sgRNA" OR "guide RNA" "[GENE_NAME]" validated experimental
"CRISPR knockout" "[GENE_NAME]" sgRNA sequence validated
"[GENE_NAME]" sgRNA "cutting efficiency" OR "on-target"
"[GENE_NAME]" "guide sequence" CRISPR validated
Example for TP53:
"sgRNA" "TP53" validated "H. sapiens" experimental
"CRISPR knockout" "TP53" guide sequence validated
What to search for in results:
IMPORTANT: Spend adequate time searching literature. Look through at least the first 10-15 search results and check supplementary materials of relevant papers.
If you find matching sgRNAs (from either method):
PubMed_ID to cite the original paperExample result format:
Gene: TP53
sgRNA sequence: GAGGTTGTGAGGCGCTGCCC
Species: H. sapiens (human)
Application: Knockout (cut)
Reference: PubMed ID 24336569 (Ran et al., 2013)
Validation: Tested in HEK293T cells, 85% cutting efficiency
If no matches found in BOTH Method 1 AND Method 2: Only then proceed to Option 2: Download CRISPick Dataset
CRISPick (from Broad Institute GPP) provides pre-computed sgRNA designs for entire genomes with 238 available datasets covering:
All 238 download links are available in: resource/CRISPick_download_links.txt (relative to this skill directory)
Files are named: sgRNA_design_{TAXID}_{GENOME}_{CAS}_{APPLICATION}_{ALGORITHM}_{SOURCE}_{DATE}.txt.gz
Common datasets:
| Organism | Genome | Cas9 | Application | Search Pattern |
|-|--||-|-|
| Human | GRCh38 | SpCas9 | Knockout | 9606_GRCh38_SpyoCas9_CRISPRko |
| Human | GRCh38 | SpCas9 | Activation | 9606_GRCh38_SpyoCas9_CRISPRa |
| Human | GRCh38 | SaCas9 | Knockout | 9606_GRCh38_SaurCas9_CRISPRko |
| Mouse | GRCm38 | SpCas9 | Knockout | 10090_GRCm38_SpyoCas9_CRISPRko |
| Mouse | GRCm38 | SpCas9 | Activation | 10090_GRCm38_SpyoCas9_CRISPRa |
Key components:
9606 (Human), 10090 (Mouse), 9615 (Dog), 9913 (Cow)SpyoCas9 (SpCas9, NGG PAM)SaurCas9 (SaCas9, NNGRRT PAM)AsCas12a (Wild-type Cas12a, TTTV PAM)enAsCas12a (Enhanced Cas12a, TTTV PAM)CRISPRko (knockout), CRISPRa (activation), CRISPRi (inhibition)# Search the download links file
grep "9606_GRCh38_SpyoCas9_CRISPRko" resource/CRISPick_download_links.txt
# Or for mouse
grep "10090_GRCm38_SpyoCas9_CRISPRko" resource/CRISPick_download_links.txt
# Copy the URL from download_links.txt, then:
wget [PASTE_URL_HERE]
# Extract the file
gunzip sgRNA_design_*.txt.gz
File sizes: Knockout (300-700 MB), Activation (50-100 MB), Summary files (1-3 MB)
The .txt file is tab-delimited. Column names differ between knockout and activation/inhibition datasets.
Essential Columns (All files):
Knockout-specific columns:
Activation/Inhibition-specific columns:
import pandas as pd
# Load the dataset
df = pd.read_csv('sgRNA_design_9606_GRCh38_SpyoCas9_CRISPRko_*.txt',
sep='\t', low_memory=False)
# Filter for your gene
gene_name = "TP53"
gene_sgrnas = df[df['Target Gene Symbol'] == gene_name].copy()
print(f"Found {len(gene_sgrnas)} sgRNAs for {gene_name}")
Default: Use Combined Rank (balances efficiency and specificity)
# Sort by Combined Rank (lower is better)
top_sgrnas = gene_sgrnas.nsmallest(10, 'Combined Rank')
print(top_sgrnas[['sgRNA Sequence', 'Combined Rank',
'Exon Number', 'sgRNA Cut Position (1-based)']])
Option A: Prioritize On-Target Efficiency
# Sort by On-Target Rank (for maximum cutting efficiency)
efficient_sgrnas = gene_sgrnas.nsmallest(10, 'On-Target Rank')
Option B: Prioritize Off-Target Specificity
# Sort by Off-Target Rank (for maximum specificity)
specific_sgrnas = gene_sgrnas.nsmallest(10, 'Off-Target Rank')
Filter by Exon Number:
# Target specific exon (e.g., exon 5)
exon5_sgrnas = gene_sgrnas[gene_sgrnas['Exon Number'] == 5]
top_exon5 = exon5_sgrnas.nsmallest(5, 'Combined Rank')
Filter by Genomic Position:
# Target specific genomic range
position_filtered = gene_sgrnas[
(gene_sgrnas['sgRNA Cut Position (1-based)'] >= 7572000) &
(gene_sgrnas['sgRNA Cut Position (1-based)'] <= 7575000)
]
Target Early Exons for Knockout:
# Get sgRNAs from first 3 exons
early_exons = gene_sgrnas[gene_sgrnas['Exon Number'] <= 3]
top_early = early_exons.nsmallest(10, 'Combined Rank')
Filter by Target Cut Percentage:
# Target sgRNAs that affect significant portion of protein
high_impact = gene_sgrnas[gene_sgrnas['Target Cut %'] <= 50] # Cut in first 50%
top_high_impact = high_impact.nsmallest(10, 'Combined Rank')
# Get top 4 sgRNAs from different exons for redundancy
final_selection = gene_sgrnas.sort_values('Combined Rank').groupby('Exon Number').head(1).head(4)
# Save results
final_selection.to_csv(f'{gene_name}_selected_sgRNAs.csv', index=False)
print("\nSelected sgRNAs:")
print(final_selection[['sgRNA Sequence', 'Exon Number', 'Combined Rank']])
Once you have selected sgRNAs:
If dataset doesn't cover your gene or organism: Proceed to Option 3: De Novo sgRNA Design
Step 1: Check Addgene
df = pd.read_csv('addgene_grna_sequences.csv')
tp53_results = df[(df['Target_Gene'] == 'TP53') &
(df['Target_Species'] == 'H. sapiens') &
(df['Application'] == 'cut')]
# Result: Found 0 entries -> Proceed to Option 2
Step 2: Download CRISPick dataset
# Download human GRCh38 SpCas9 knockout dataset
wget https://portals.broadinstitute.org/gppx/public/sgrna_design/api/downloads/\
sgRNA_design_9606_GRCh38_SpyoCas9_CRISPRko_RS3seq-Chen2013+RS3target_NCBI_20241104.txt.gz
gunzip sgRNA_design_9606_GRCh38_SpyoCas9_CRISPRko_*.txt.gz
Step 3: Extract TP53 sgRNAs
df = pd.read_csv('sgRNA_design_9606_GRCh38_SpyoCas9_CRISPRko_*.txt', sep='\t')
tp53 = df[df['Gene_Symbol'] == 'TP53']
top_sgrnas = tp53[
(tp53['sgRNA_score'] > 0.6) &
(tp53['Off_target_stringency'] > 0.5)
].sort_values('sgRNA_score', ascending=False).head(4)
print(top_sgrnas[['sgRNA_sequence', 'sgRNA_score', 'Exon_ID']])
Step 1: Check Addgene
oct4_results = df[(df['Target_Gene'] == 'OCT4') &
(df['Application'] == 'activate')]
# Found 1 validated sgRNA!
print(oct4_results['Target_Sequence'].values[0])
# Use this sequence
Remember: Always start with validated sequences (Option 1), then move to pre-computed designs (Option 2), and only use de novo design (Option 3) when necessary. Testing 3-4 sgRNAs per gene is standard practice regardless of prediction scores.