From sciagent-skills
Accesses AlphaFold DB's 200M+ AI-predicted protein structures by UniProt ID. Downloads PDB/mmCIF files, analyzes pLDDT/PAE confidence scores, bulk-fetches proteomes via Google Cloud.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access predictions via BioPython or REST API, download coordinate files in multiple formats, analyze confidence metrics, and retrieve bulk proteome datasets via Google Cloud.
Accesses AlphaFold DB's 200M+ AI-predicted protein structures by UniProt ID. Downloads PDB/mmCIF files, analyzes pLDDT/PAE confidence metrics for drug discovery and structural biology.
Retrieves AlphaFold protein structures by UniProt ID using Biopython or API, downloads PDB/mmCIF files, analyzes pLDDT/PAE confidence metrics for structural biology and drug discovery.
Retrieves protein structure data from RCSB PDB, PDBe, and AlphaFold with disambiguation, quality assessment, metadata, ligands, and download links. For PDB IDs, UniProt accessions, or protein queries.
Share bugs, ideas, or general feedback.
AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access predictions via BioPython or REST API, download coordinate files in multiple formats, analyze confidence metrics, and retrieve bulk proteome datasets via Google Cloud.
# Core (BioPython for structure access)
pip install biopython requests numpy matplotlib
# Optional: Google Cloud for bulk access
pip install google-cloud-bigquery google-cloud-storage
from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np
# 1. Get prediction for a protein
uniprot_id = "P00520" # ABL1 kinase
predictions = list(alphafold_db.get_predictions(uniprot_id))
af_id = predictions[0]['entryId'] # AF-P00520-F1
# 2. Download structure
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")
# 3. Check confidence
conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
scores = conf['confidenceScore']
print(f"Mean pLDDT: {np.mean(scores):.1f}, High-conf residues: {sum(1 for s in scores if s > 90)}/{len(scores)}")
BioPython (recommended for single proteins):
from Bio.PDB import alphafold_db
# Get prediction metadata
predictions = list(alphafold_db.get_predictions("P00520"))
pred = predictions[0]
print(f"AlphaFold ID: {pred['entryId']}")
print(f"Gene: {pred['gene']}, Species: {pred['organismScientificName']}")
# Get Structure objects directly
structures = list(alphafold_db.get_structural_models_for("P00520"))
REST API (for metadata or integration):
import requests
uniprot_id = "P00520"
url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(url)
data = response.json()
# Response includes download URLs for all file types
pred = data[0]
print(f"CIF: {pred['cifUrl']}")
print(f"PDB: {pred['pdbUrl']}")
print(f"PAE: {pred['paeDocUrl']}")
3D-Beacons federated API (query multiple structure providers):
url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
data = requests.get(url).json()
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']
import requests
af_id = "AF-P00520-F1"
version = "v4"
base = "https://alphafold.ebi.ac.uk/files"
# mmCIF (recommended — full metadata, supports large structures)
cif = requests.get(f"{base}/{af_id}-model_{version}.cif")
with open(f"{af_id}.cif", "w") as f:
f.write(cif.text)
# PDB format (legacy — limited to 99,999 atoms)
pdb = requests.get(f"{base}/{af_id}-model_{version}.pdb")
with open(f"{af_id}.pdb", "wb") as f:
f.write(pdb.content)
# Confidence JSON (per-residue pLDDT scores)
conf = requests.get(f"{base}/{af_id}-confidence_{version}.json").json()
# PAE matrix JSON (inter-residue confidence)
pae = requests.get(f"{base}/{af_id}-predicted_aligned_error_{version}.json").json()
pLDDT (per-residue confidence, 0–100):
import numpy as np
conf_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json"
conf = requests.get(conf_url).json()
scores = conf['confidenceScore']
# Classify residues by confidence
very_high = sum(1 for s in scores if s > 90)
high = sum(1 for s in scores if 70 < s <= 90)
low = sum(1 for s in scores if 50 < s <= 70)
very_low = sum(1 for s in scores if s <= 50)
print(f"Very high (>90): {very_high}, High (70-90): {high}, Low (50-70): {low}, Very low (<50): {very_low}")
PAE (Predicted Aligned Error) visualization:
import matplotlib.pyplot as plt
pae_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
pae_matrix = np.array(pae['distance'])
plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {af_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{af_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE (<5 Å) = confident relative positioning; >15 Å = uncertain domain arrangement
# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/
# Download entire proteome by taxonomy ID
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*_v4.tar .
# Download accession index
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .
BigQuery metadata queries:
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT entryId, uniprotAccession, gene, organismScientificName,
globalMetricValue, fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
AND fractionPlddtVeryHigh > 0.8
AND isReviewed = TRUE
LIMIT 100
"""
df = client.query(query).to_dataframe()
print(f"Found {len(df)} high-confidence human proteins")
from Bio.PDB import MMCIFParser
import numpy as np
from scipy.spatial.distance import pdist, squareform
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", f"{af_id}-model_v4.cif")
# Extract alpha-carbon coordinates
coords, plddt_scores = [], []
for model in structure:
for chain in model:
for residue in chain:
if 'CA' in residue:
coords.append(residue['CA'].get_coord())
plddt_scores.append(residue['CA'].get_bfactor()) # pLDDT stored as B-factor
coords = np.array(coords)
print(f"Residues: {len(coords)}, Mean pLDDT: {np.mean(plddt_scores):.1f}")
# Contact map (Cα-Cα < 8 Å)
dist_matrix = squareform(pdist(coords))
contacts = np.where((dist_matrix > 0) & (dist_matrix < 8))
print(f"Contacts: {len(contacts[0]) // 2}")
| Metric | Range | Interpretation | Suitable For |
|---|---|---|---|
| pLDDT >90 | Very high | Backbone + side-chain reliable | Detailed analysis, docking |
| pLDDT 70–90 | High | Backbone generally reliable | Fold analysis, domain ID |
| pLDDT 50–70 | Low | Use with caution | May be flexible/disordered |
| pLDDT <50 | Very low | Likely disordered | Exclude from analysis |
| PAE <5 Å | Confident | Reliable relative domain positions | Multi-domain assembly |
| PAE 5–10 Å | Moderate | Uncertain arrangement | Treat domains independently |
| PAE >15 Å | Uncertain | Domains may be mobile | Do not trust orientation |
Format: AF-{UniProt_accession}-F{fragment_number} (e.g., AF-P00520-F1). Large proteins may be split into fragments (F1, F2, ...). Current database version: v4 — include version suffix in all file URLs.
| File | URL Suffix | Format | Use |
|---|---|---|---|
| Model coordinates | -model_v4.cif | mmCIF | Structural analysis (recommended) |
| Model coordinates | -model_v4.pdb | PDB | Legacy tools (<99,999 atoms) |
| Model coordinates | -model_v4.bcif | Binary CIF | Compressed (~70% smaller) |
| Confidence | -confidence_v4.json | JSON | Per-residue pLDDT array |
| Aligned error | -predicted_aligned_error_v4.json | JSON | N×N PAE matrix |
| PAE image | -predicted_aligned_error_v4.png | PNG | Quick visual assessment |
from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np
uniprot_id = "P04637" # p53 tumor suppressor
# Retrieve and download
predictions = list(alphafold_db.get_predictions(uniprot_id))
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")
af_id = predictions[0]['entryId']
# Parse structure
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p53", cif_file)
# Extract pLDDT from B-factors
plddt = [r['CA'].get_bfactor() for m in structure for c in m for r in c if 'CA' in r]
print(f"Length: {len(plddt)}, Mean pLDDT: {np.mean(plddt):.1f}")
# Identify high-confidence regions for docking
high_conf_regions = [(i+1, s) for i, s in enumerate(plddt) if s > 90]
print(f"High-confidence residues: {len(high_conf_regions)}/{len(plddt)}")
from Bio.PDB import alphafold_db
import requests, numpy as np, pandas as pd, time
uniprot_ids = ["P00520", "P12931", "P04637", "P38398"]
results = []
for uid in uniprot_ids:
try:
preds = list(alphafold_db.get_predictions(uid))
if not preds:
continue
af_id = preds[0]['entryId']
conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
scores = conf['confidenceScore']
results.append({
'uniprot': uid, 'alphafold_id': af_id,
'length': len(scores), 'mean_plddt': np.mean(scores),
'frac_high_conf': sum(1 for s in scores if s > 90) / len(scores)
})
time.sleep(0.2) # Rate limit: 100-200ms between requests
except Exception as e:
print(f"Error {uid}: {e}")
df = pd.DataFrame(results)
print(df.to_string(index=False))
| Parameter | Module | Default | Range | Effect |
|---|---|---|---|---|
uniprot_id | All | — | UniProt accession | Primary query identifier |
version | Download | v4 | v1–v4 | Database version (always use latest) |
directory | BioPython | "." | Path | Download destination |
QUIET | MMCIFParser | False | bool | Suppress parser warnings |
vmin/vmax | PAE plot | 0/30 | Å | PAE colormap range |
taxonomy_id | GCS bulk | — | NCBI tax ID | Species for proteome download |
fractionPlddtVeryHigh | BigQuery | — | 0.0–1.0 | Filter by high-confidence fraction |
| Concurrent requests | API | — | ≤10 | Max parallel API requests |
| Request delay | API | — | 100–200ms | Delay between sequential requests |
_v4 in URLs and document which version was usedimport subprocess
def download_proteome(taxonomy_id: int, output_dir: str = "./proteomes"):
"""Download all AlphaFold predictions for a species via GCS."""
if not isinstance(taxonomy_id, int):
raise ValueError("taxonomy_id must be an integer")
pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
subprocess.run(["gsutil", "-m", "cp", pattern, f"{output_dir}/"], check=True)
# Human (9606), E. coli (83333), Mouse (10090)
download_proteome(9606)
def extract_high_conf_residues(plddt_scores, threshold=90):
"""Extract contiguous high-confidence regions."""
regions, start = [], None
for i, score in enumerate(plddt_scores):
if score > threshold and start is None:
start = i
elif score <= threshold and start is not None:
regions.append((start + 1, i, i - start)) # 1-indexed
start = None
if start is not None:
regions.append((start + 1, len(plddt_scores), len(plddt_scores) - start))
return regions
# Usage: regions = extract_high_conf_residues(plddt_scores)
# Returns: [(start_res, end_res, length), ...]
import numpy as np
def segment_domains(pae_matrix, threshold=10.0):
"""Simple domain segmentation from PAE matrix."""
n = pae_matrix.shape[0]
# Average PAE for each residue pair -> symmetric
sym_pae = (pae_matrix + pae_matrix.T) / 2
# Cluster: residues with low mutual PAE are in the same domain
domains, current_domain = [0] * n, 0
for i in range(1, n):
if sym_pae[i-1, i] > threshold:
current_domain += 1
domains[i] = current_domain
return domains
| Problem | Cause | Solution |
|---|---|---|
404 Not Found from API | No AlphaFold prediction for this UniProt ID | Check if protein is in AlphaFold DB; some organisms not covered |
429 Too Many Requests | Exceeded rate limit | Add time.sleep(0.2) between requests; use GCS for bulk |
| Empty predictions list | UniProt ID not in database | Verify ID at alphafold.ebi.ac.uk; try canonical isoform |
| Large protein split into fragments | Protein >2700 residues | Check all fragments (F1, F2, ...); stitch manually if needed |
| pLDDT values all low (<50) | Intrinsically disordered protein | Expected behavior; structure prediction unreliable for IDPs |
| PAE matrix asymmetric | PAE[i][j] ≠ PAE[j][i] by design | PAE measures error when aligned on residue i; symmetrize for clustering |
ModuleNotFoundError: Bio.PDB.alphafold_db | BioPython version too old | Upgrade: pip install --upgrade biopython>=1.80 |
| GCS download fails | gsutil not configured | Run gcloud auth login or use anonymous access for public data |
| BigQuery quota exceeded | Free tier limit (1 TB/month) | Optimize queries with LIMIT; use GCS for bulk file access instead |
Detailed API data schemas and lookup tables: REST API response fields, mmCIF data categories, confidence JSON schema, PAE JSON schema, BigQuery metadata table fields, HTTP error codes, rate limiting guidelines, and version history (v1–v4). Consult for field-level details when parsing API responses or building custom queries. Scripts functionality (none in original). Original api_reference.md content partially relocated to Core API (endpoints, common code patterns) and Key Concepts (confidence thresholds, file types table); schemas, field catalogs, and error codes retained in this reference.