Skill

alphafold-database-access

Accesses AlphaFold DB's 200M+ AI-predicted protein structures by UniProt ID. Downloads PDB/mmCIF files, analyzes pLDDT/PAE confidence scores, bulk-fetches proteomes via Google Cloud.

Python

GCP

ai-ml

database

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

AlphaFold DB is a public repository of AI-predicted 3D protein structures for over 200 million proteins, maintained by DeepMind and EMBL-EBI. Access predictions via BioPython or REST API, download coordinate files in multiple formats, analyze confidence metrics, and retrieve bulk proteome datasets via Google Cloud.

Supporting Assets

references/api_schemas_reference.md

SKILL.md

Similar Skills

alphafold-database

Accesses AlphaFold DB's 200M+ AI-predicted protein structures by UniProt ID. Downloads PDB/mmCIF files, analyzes pLDDT/PAE confidence metrics for drug discovery and structural biology.

1 file

superpowers

alphafold-database

Retrieves AlphaFold protein structures by UniProt ID using Biopython or API, downloads PDB/mmCIF files, analyzes pLDDT/PAE confidence metrics for structural biology and drug discovery.

1 file

scientific-skills

tooluniverse-protein-structure-retrieval

1.3k

Retrieves protein structure data from RCSB PDB, PDBe, and AlphaFold with disambiguation, quality assessment, metadata, ligands, and download links. For PDB IDs, UniProt accessions, or protein queries.

2 files

mims-harvard-tooluniverse

Stats

Stars135

Forks16

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

AlphaFold Database Access

Overview

When to Use

Retrieving AI-predicted protein structures by UniProt accession
Downloading PDB/mmCIF coordinate files for structural analysis or docking
Analyzing prediction confidence (pLDDT per-residue, PAE domain-level)
Bulk-downloading entire proteome predictions via Google Cloud
Comparing predicted structures with experimental PDB structures
Building structural models for proteins lacking experimental data
Identifying high-confidence binding sites for drug discovery
For experimental structures only → use PDB directly
For running AlphaFold predictions → use ColabFold or local AlphaFold

Prerequisites

# Core (BioPython for structure access)
pip install biopython requests numpy matplotlib

# Optional: Google Cloud for bulk access
pip install google-cloud-bigquery google-cloud-storage

Quick Start

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

# 1. Get prediction for a protein
uniprot_id = "P00520"  # ABL1 kinase
predictions = list(alphafold_db.get_predictions(uniprot_id))
af_id = predictions[0]['entryId']  # AF-P00520-F1

# 2. Download structure
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")

# 3. Check confidence
conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
scores = conf['confidenceScore']
print(f"Mean pLDDT: {np.mean(scores):.1f}, High-conf residues: {sum(1 for s in scores if s > 90)}/{len(scores)}")

Core API

1. Prediction Retrieval

BioPython (recommended for single proteins):

from Bio.PDB import alphafold_db

# Get prediction metadata
predictions = list(alphafold_db.get_predictions("P00520"))
pred = predictions[0]
print(f"AlphaFold ID: {pred['entryId']}")
print(f"Gene: {pred['gene']}, Species: {pred['organismScientificName']}")

# Get Structure objects directly
structures = list(alphafold_db.get_structural_models_for("P00520"))

REST API (for metadata or integration):

import requests

uniprot_id = "P00520"
url = f"https://alphafold.ebi.ac.uk/api/prediction/{uniprot_id}"
response = requests.get(url)
data = response.json()

# Response includes download URLs for all file types
pred = data[0]
print(f"CIF: {pred['cifUrl']}")
print(f"PDB: {pred['pdbUrl']}")
print(f"PAE: {pred['paeDocUrl']}")

3D-Beacons federated API (query multiple structure providers):

url = f"https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/api/uniprot/summary/{uniprot_id}.json"
data = requests.get(url).json()
af_structures = [s for s in data['structures'] if s['provider'] == 'AlphaFold DB']

2. Structure File Download

import requests

af_id = "AF-P00520-F1"
version = "v4"
base = "https://alphafold.ebi.ac.uk/files"

# mmCIF (recommended — full metadata, supports large structures)
cif = requests.get(f"{base}/{af_id}-model_{version}.cif")
with open(f"{af_id}.cif", "w") as f:
    f.write(cif.text)

# PDB format (legacy — limited to 99,999 atoms)
pdb = requests.get(f"{base}/{af_id}-model_{version}.pdb")
with open(f"{af_id}.pdb", "wb") as f:
    f.write(pdb.content)

# Confidence JSON (per-residue pLDDT scores)
conf = requests.get(f"{base}/{af_id}-confidence_{version}.json").json()

# PAE matrix JSON (inter-residue confidence)
pae = requests.get(f"{base}/{af_id}-predicted_aligned_error_{version}.json").json()

3. Confidence Metrics Analysis

pLDDT (per-residue confidence, 0–100):

import numpy as np

conf_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json"
conf = requests.get(conf_url).json()
scores = conf['confidenceScore']

# Classify residues by confidence
very_high = sum(1 for s in scores if s > 90)
high = sum(1 for s in scores if 70 < s <= 90)
low = sum(1 for s in scores if 50 < s <= 70)
very_low = sum(1 for s in scores if s <= 50)
print(f"Very high (>90): {very_high}, High (70-90): {high}, Low (50-70): {low}, Very low (<50): {very_low}")

PAE (Predicted Aligned Error) visualization:

import matplotlib.pyplot as plt

pae_url = f"https://alphafold.ebi.ac.uk/files/{af_id}-predicted_aligned_error_v4.json"
pae = requests.get(pae_url).json()
pae_matrix = np.array(pae['distance'])

plt.figure(figsize=(10, 8))
plt.imshow(pae_matrix, cmap='viridis_r', vmin=0, vmax=30)
plt.colorbar(label='PAE (Å)')
plt.title(f'Predicted Aligned Error: {af_id}')
plt.xlabel('Residue')
plt.ylabel('Residue')
plt.savefig(f'{af_id}_pae.png', dpi=300, bbox_inches='tight')
# Low PAE (<5 Å) = confident relative positioning; >15 Å = uncertain domain arrangement

4. Bulk Data Access (Google Cloud)

# List available data
gsutil ls gs://public-datasets-deepmind-alphafold-v4/

# Download entire proteome by taxonomy ID
gsutil -m cp gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-9606-*_v4.tar .

# Download accession index
gsutil cp gs://public-datasets-deepmind-alphafold-v4/accession_ids.csv .

BigQuery metadata queries:

from google.cloud import bigquery

client = bigquery.Client()
query = """
SELECT entryId, uniprotAccession, gene, organismScientificName,
       globalMetricValue, fractionPlddtVeryHigh
FROM `bigquery-public-data.deepmind_alphafold.metadata`
WHERE organismScientificName = 'Homo sapiens'
  AND fractionPlddtVeryHigh > 0.8
  AND isReviewed = TRUE
LIMIT 100
"""
df = client.query(query).to_dataframe()
print(f"Found {len(df)} high-confidence human proteins")

5. Structure Parsing & Analysis

from Bio.PDB import MMCIFParser
import numpy as np
from scipy.spatial.distance import pdist, squareform

parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("protein", f"{af_id}-model_v4.cif")

# Extract alpha-carbon coordinates
coords, plddt_scores = [], []
for model in structure:
    for chain in model:
        for residue in chain:
            if 'CA' in residue:
                coords.append(residue['CA'].get_coord())
                plddt_scores.append(residue['CA'].get_bfactor())  # pLDDT stored as B-factor

coords = np.array(coords)
print(f"Residues: {len(coords)}, Mean pLDDT: {np.mean(plddt_scores):.1f}")

# Contact map (Cα-Cα < 8 Å)
dist_matrix = squareform(pdist(coords))
contacts = np.where((dist_matrix > 0) & (dist_matrix < 8))
print(f"Contacts: {len(contacts[0]) // 2}")

Key Concepts

Confidence Interpretation

Metric	Range	Interpretation	Suitable For
pLDDT >90	Very high	Backbone + side-chain reliable	Detailed analysis, docking
pLDDT 70–90	High	Backbone generally reliable	Fold analysis, domain ID
pLDDT 50–70	Low	Use with caution	May be flexible/disordered
pLDDT <50	Very low	Likely disordered	Exclude from analysis
PAE <5 Å	Confident	Reliable relative domain positions	Multi-domain assembly
PAE 5–10 Å	Moderate	Uncertain arrangement	Treat domains independently
PAE >15 Å	Uncertain	Domains may be mobile	Do not trust orientation

AlphaFold ID Format

Format: AF-{UniProt_accession}-F{fragment_number} (e.g., AF-P00520-F1). Large proteins may be split into fragments (F1, F2, ...). Current database version: v4 — include version suffix in all file URLs.

File Types

File	URL Suffix	Format	Use
Model coordinates	`-model_v4.cif`	mmCIF	Structural analysis (recommended)
Model coordinates	`-model_v4.pdb`	PDB	Legacy tools (<99,999 atoms)
Model coordinates	`-model_v4.bcif`	Binary CIF	Compressed (~70% smaller)
Confidence	`-confidence_v4.json`	JSON	Per-residue pLDDT array
Aligned error	`-predicted_aligned_error_v4.json`	JSON	N×N PAE matrix
PAE image	`-predicted_aligned_error_v4.png`	PNG	Quick visual assessment

Common Workflows

Workflow 1: Single Protein Structure Analysis

from Bio.PDB import alphafold_db, MMCIFParser
import requests, numpy as np

uniprot_id = "P04637"  # p53 tumor suppressor

# Retrieve and download
predictions = list(alphafold_db.get_predictions(uniprot_id))
cif_file = alphafold_db.download_cif_for(predictions[0], directory="./structures")
af_id = predictions[0]['entryId']

# Parse structure
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("p53", cif_file)

# Extract pLDDT from B-factors
plddt = [r['CA'].get_bfactor() for m in structure for c in m for r in c if 'CA' in r]
print(f"Length: {len(plddt)}, Mean pLDDT: {np.mean(plddt):.1f}")

# Identify high-confidence regions for docking
high_conf_regions = [(i+1, s) for i, s in enumerate(plddt) if s > 90]
print(f"High-confidence residues: {len(high_conf_regions)}/{len(plddt)}")

Workflow 2: Batch Protein Processing

from Bio.PDB import alphafold_db
import requests, numpy as np, pandas as pd, time

uniprot_ids = ["P00520", "P12931", "P04637", "P38398"]
results = []

for uid in uniprot_ids:
    try:
        preds = list(alphafold_db.get_predictions(uid))
        if not preds:
            continue
        af_id = preds[0]['entryId']
        conf = requests.get(f"https://alphafold.ebi.ac.uk/files/{af_id}-confidence_v4.json").json()
        scores = conf['confidenceScore']
        results.append({
            'uniprot': uid, 'alphafold_id': af_id,
            'length': len(scores), 'mean_plddt': np.mean(scores),
            'frac_high_conf': sum(1 for s in scores if s > 90) / len(scores)
        })
        time.sleep(0.2)  # Rate limit: 100-200ms between requests
    except Exception as e:
        print(f"Error {uid}: {e}")

df = pd.DataFrame(results)
print(df.to_string(index=False))

Key Parameters

Parameter	Module	Default	Range	Effect
`uniprot_id`	All	—	UniProt accession	Primary query identifier
`version`	Download	`v4`	v1–v4	Database version (always use latest)
`directory`	BioPython	`"."`	Path	Download destination
`QUIET`	MMCIFParser	`False`	bool	Suppress parser warnings
`vmin/vmax`	PAE plot	0/30	Å	PAE colormap range
`taxonomy_id`	GCS bulk	—	NCBI tax ID	Species for proteome download
`fractionPlddtVeryHigh`	BigQuery	—	0.0–1.0	Filter by high-confidence fraction
Concurrent requests	API	—	≤10	Max parallel API requests
Request delay	API	—	100–200ms	Delay between sequential requests

Best Practices

Use BioPython for single proteins, Google Cloud for bulk — individual API downloads are slow for >100 proteins; GCS parallel download is orders of magnitude faster
Always check pLDDT before downstream analysis — low-confidence regions (pLDDT <50) are likely disordered and should be excluded from docking, contact analysis, or binding site prediction
Anti-pattern — trusting all regions equally: AlphaFold predictions lack ligands, PTMs, cofactors, and multi-chain context. High pLDDT does not guarantee functional accuracy
Cache downloaded files locally — avoid re-downloading the same structures; AlphaFold files are static per version
Use PAE for multi-domain proteins — pLDDT tells you per-residue confidence, but PAE reveals whether domain orientations are reliable. Low inter-domain PAE (<5 Å) = trust the arrangement; high PAE (>15 Å) = treat domains independently
Pin database version in reproducible analyses — include _v4 in URLs and document which version was used

Common Recipes

Recipe 1: Proteome Download by Species

import subprocess

def download_proteome(taxonomy_id: int, output_dir: str = "./proteomes"):
    """Download all AlphaFold predictions for a species via GCS."""
    if not isinstance(taxonomy_id, int):
        raise ValueError("taxonomy_id must be an integer")
    pattern = f"gs://public-datasets-deepmind-alphafold-v4/proteomes/proteome-tax_id-{taxonomy_id}-*_v4.tar"
    subprocess.run(["gsutil", "-m", "cp", pattern, f"{output_dir}/"], check=True)

# Human (9606), E. coli (83333), Mouse (10090)
download_proteome(9606)

Recipe 2: High-Confidence Region Extraction

def extract_high_conf_residues(plddt_scores, threshold=90):
    """Extract contiguous high-confidence regions."""
    regions, start = [], None
    for i, score in enumerate(plddt_scores):
        if score > threshold and start is None:
            start = i
        elif score <= threshold and start is not None:
            regions.append((start + 1, i, i - start))  # 1-indexed
            start = None
    if start is not None:
        regions.append((start + 1, len(plddt_scores), len(plddt_scores) - start))
    return regions

# Usage: regions = extract_high_conf_residues(plddt_scores)
# Returns: [(start_res, end_res, length), ...]

Recipe 3: PAE-Based Domain Segmentation

import numpy as np

def segment_domains(pae_matrix, threshold=10.0):
    """Simple domain segmentation from PAE matrix."""
    n = pae_matrix.shape[0]
    # Average PAE for each residue pair -> symmetric
    sym_pae = (pae_matrix + pae_matrix.T) / 2
    # Cluster: residues with low mutual PAE are in the same domain
    domains, current_domain = [0] * n, 0
    for i in range(1, n):
        if sym_pae[i-1, i] > threshold:
            current_domain += 1
        domains[i] = current_domain
    return domains

Troubleshooting

Problem	Cause	Solution
`404 Not Found` from API	No AlphaFold prediction for this UniProt ID	Check if protein is in AlphaFold DB; some organisms not covered
`429 Too Many Requests`	Exceeded rate limit	Add `time.sleep(0.2)` between requests; use GCS for bulk
Empty predictions list	UniProt ID not in database	Verify ID at alphafold.ebi.ac.uk; try canonical isoform
Large protein split into fragments	Protein >2700 residues	Check all fragments (F1, F2, ...); stitch manually if needed
pLDDT values all low (<50)	Intrinsically disordered protein	Expected behavior; structure prediction unreliable for IDPs
PAE matrix asymmetric	PAE[i][j] ≠ PAE[j][i] by design	PAE measures error when aligned on residue i; symmetrize for clustering
`ModuleNotFoundError: Bio.PDB.alphafold_db`	BioPython version too old	Upgrade: `pip install --upgrade biopython>=1.80`
GCS download fails	gsutil not configured	Run `gcloud auth login` or use anonymous access for public data
BigQuery quota exceeded	Free tier limit (1 TB/month)	Optimize queries with `LIMIT`; use GCS for bulk file access instead

Bundled Resources

references/api_schemas_reference.md

Detailed API data schemas and lookup tables: REST API response fields, mmCIF data categories, confidence JSON schema, PAE JSON schema, BigQuery metadata table fields, HTTP error codes, rate limiting guidelines, and version history (v1–v4). Consult for field-level details when parsing API responses or building custom queries. Scripts functionality (none in original). Original api_reference.md content partially relocated to Core API (endpoints, common code patterns) and Key Concepts (confidence thresholds, file types table); schemas, field catalogs, and error codes retained in this reference.

Related Skills

autodock-vina — molecular docking using AlphaFold structures as receptor
biopython — general protein structure parsing and analysis beyond AlphaFold

References

AlphaFold DB: https://alphafold.ebi.ac.uk/
API Documentation: https://alphafold.ebi.ac.uk/api-docs
Jumper et al. (2021) Nature 596, 583–589: https://doi.org/10.1038/s41586-021-03819-2
Varadi et al. (2024) Nucleic Acids Res. 52, D368–D375: https://doi.org/10.1093/nar/gkad1011
BioPython AlphaFold module: https://biopython.org/docs/dev/api/Bio.PDB.alphafold_db.html
Google Cloud AlphaFold: https://console.cloud.google.com/marketplace/product/bigquery-public-data/deepmind-alphafold