Industrial-strength NLP with spaCy 3.x for text processing and custom classifier training. Use when "installing spaCy", "selecting model for nlp" (en_core_web_sm/md/lg/trf), "tokenization", "POS tagging", "named entity recognition" (NER), "dependency parsing", "training TextCategorizer models", "troubleshooting spaCy errors" (E050/E941 model errors, E927 version mismatch, memory issues), "batch processing with nlp.pipe", or "deploying nlp models to production". Includes data preparation scripts, config templates, and FastAPI serving examples.
/plugin marketplace add SpillwaveSolutions/spacy-nlp-agentic-skill/plugin install spacy-nlp@spacy-nlp-agentic-skillThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config_textcat.cfgassets/training_data_template.jsonreferences/basic-usage.mdreferences/installation.mdreferences/production.mdreferences/text-classification.mdreferences/troubleshooting.mdscripts/evaluate_model.pyscripts/generate_config.pyscripts/prepare_training_data.pyscripts/serve_model.pyProduction-ready NLP with spaCy 3.x. This skill covers installation through deployment.
In Scope:
Out of Scope (use other tools/skills):
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
# Entities
for ent in doc.ents:
print(ent.text, ent.label_)
# Tokens with attributes
for token in doc:
print(token.text, token.pos_, token.dep_)
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
| Model | Size | Speed | Use Case |
|---|---|---|---|
en_core_web_sm | 12 MB | Fastest | Prototyping, speed-critical |
en_core_web_md | 40 MB | Fast | General use with word vectors |
en_core_web_lg | 560 MB | Fast | Semantic similarity tasks |
en_core_web_trf | 438 MB | Slow | Maximum accuracy (GPU) |
import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")
For detailed installation options (conda, GPU, transformers): See references/installation.md
nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")
# Tokenization + attributes
for token in doc:
print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")
for ent in doc.ents:
print(ent.text, ent.label_) # "Apple Inc." ORG, "Steve Jobs" PERSON
For entity types, filtering, and span details: See references/basic-usage.md
# WRONG - slow
for text in texts:
doc = nlp(text) # Don't do this
# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
process(doc)
# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))
# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])
For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md
Train custom text classifiers with TextCategorizer.
scripts/prepare_training_data.pyscripts/generate_config.py or use assets/config_textcat.cfgpython -m spacy debug data config.cfg (catches issues before training)python -m spacy train config.cfg --output ./outputscripts/evaluate_model.pynlp = spacy.load("./output/model-best")Training data uses spaCy's DocBin format. Example input (JSON):
[
{"text": "Quarterly revenue exceeded expectations", "label": "Business"},
{"text": "Fixed null pointer exception in parser", "label": "Programming"},
{"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]
Convert with script:
python scripts/prepare_training_data.py \
--input data.json \
--output-train train.spacy \
--output-dev dev.spacy \
--split 0.8
# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"
# Or use template
cp assets/config_textcat.cfg config.cfg
# Train
python -m spacy train config.cfg --output ./output
# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0
nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}") # DevOps: 94.2%
For detailed training guide: See references/text-classification.md
OSError: [E050] Can't find model 'en_core_web_sm'
Fix:
python -m spacy download en_core_web_sm
Alternative (avoids path issues):
import en_core_web_sm
nlp = en_core_web_sm.load()
Symptoms: OOM errors, slow processing
Fixes:
# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])
# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
doc = nlp(chunk)
# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
for doc in nlp.pipe(batch):
process(doc)
import spacy
# Must call BEFORE loading model
if spacy.prefer_gpu():
print("Using GPU")
else:
print("GPU not available")
nlp = spacy.load("en_core_web_trf") # Now loads on GPU
spaCy 2.x models do not work with spaCy 3.x. Check compatibility:
python -m spacy validate
For more troubleshooting: See references/troubleshooting.md
python -m spacy package ./output/model-best ./packages \
--name my_classifier \
--version 1.0.0
pip install ./packages/en_my_classifier-1.0.0/
Use the production template:
python scripts/serve_model.py --model ./output/model-best --port 8000
Or customize from template:
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("en_my_classifier")
@app.post("/classify")
async def classify(text: str):
with nlp.memory_zone():
doc = nlp(text)
return {
"category": max(doc.cats, key=doc.cats.get),
"scores": doc.cats
}
| Technique | Speedup | When to Use |
|---|---|---|
| Disable components | 2-3x | Don't need all annotations |
nlp.pipe() | 5-10x | Processing multiple texts |
| Multiprocessing | 2-4x | CPU-bound, many cores |
| GPU | 2-5x | Transformer models |
For evaluation metrics and hyperparameter tuning: See references/production.md
| Script | Purpose | Usage |
|---|---|---|
prepare_training_data.py | Convert JSON to DocBin | python scripts/prepare_training_data.py --input data.json |
generate_config.py | Create training config | python scripts/generate_config.py --categories "A,B,C" |
evaluate_model.py | Detailed metrics | python scripts/evaluate_model.py --model ./output/model-best |
serve_model.py | FastAPI server | python scripts/serve_model.py --model ./model --port 8000 |
| Asset | Purpose | Usage |
|---|---|---|
config_textcat.cfg | Base training config | Copy and customize for your labels |
training_data_template.json | Data format example | Reference for preparing your data |
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.