spaCy NLP

Production-ready NLP with spaCy 3.x. This skill covers installation through deployment.

Quick Start
Installation
Text Processing
Training Classifiers
Troubleshooting
Production Deployment

Scope

In Scope:

spaCy 3.x installation and text processing
TextCategorizer training for document classification
Production deployment and optimization patterns

Out of Scope (use other tools/skills):

Training custom NER models (different workflow)
spaCy 2.x (deprecated, incompatible with 3.x)
Rule-based matching (EntityRuler, Matcher, PhraseMatcher)
Custom tokenizers or language models

Quick Start

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Entities
for ent in doc.ents:
    print(ent.text, ent.label_)

# Tokens with attributes
for token in doc:
    print(token.text, token.pos_, token.dep_)

Installation

Standard Setup

pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm

Model Selection

Model	Size	Speed	Use Case
`en_core_web_sm`	12 MB	Fastest	Prototyping, speed-critical
`en_core_web_md`	40 MB	Fast	General use with word vectors
`en_core_web_lg`	560 MB	Fast	Semantic similarity tasks
`en_core_web_trf`	438 MB	Slow	Maximum accuracy (GPU)

Verify Installation

import spacy
print(spacy.__version__)
nlp = spacy.load("en_core_web_sm")
doc = nlp("Test sentence.")
print(f"Tokens: {len(doc)}")

For detailed installation options (conda, GPU, transformers): See references/installation.md

Text Processing

Basic Pipeline

nlp = spacy.load("en_core_web_sm")
doc = nlp("The striped bats are hanging on their feet.")

# Tokenization + attributes
for token in doc:
    print(f"{token.text:10} | {token.lemma_:10} | {token.pos_:6} | {token.dep_}")

Named Entity Recognition

for ent in doc.ents:
    print(ent.text, ent.label_)  # "Apple Inc." ORG, "Steve Jobs" PERSON

For entity types, filtering, and span details: See references/basic-usage.md

Batch Processing (Critical for Production)

# WRONG - slow
for text in texts:
    doc = nlp(text)  # Don't do this

# CORRECT - fast
for doc in nlp.pipe(texts, batch_size=50):
    process(doc)

# With multiprocessing
docs = list(nlp.pipe(texts, n_process=4))

Disable Unused Components

# Only need NER - disable the rest for 2x speed
nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger", "lemmatizer"])

For Doc/Token/Span details, noun chunks, similarity: See references/basic-usage.md

Training Classifiers

Train custom text classifiers with TextCategorizer.

Workflow Overview

Prepare data → Run scripts/prepare_training_data.py
Generate config → Run scripts/generate_config.py or use assets/config_textcat.cfg
Validate → python -m spacy debug data config.cfg (catches issues before training)
Train → python -m spacy train config.cfg --output ./output
Evaluate → Run scripts/evaluate_model.py
Use → nlp = spacy.load("./output/model-best")

Data Format

Training data uses spaCy's DocBin format. Example input (JSON):

[
  {"text": "Quarterly revenue exceeded expectations", "label": "Business"},
  {"text": "Fixed null pointer exception in parser", "label": "Programming"},
  {"text": "Kubernetes deployment manifest updated", "label": "DevOps"}
]

Convert with script:

python scripts/prepare_training_data.py \
  --input data.json \
  --output-train train.spacy \
  --output-dev dev.spacy \
  --split 0.8

Training Command

# Generate optimized config
python scripts/generate_config.py --categories "Business,Technology,Programming,DevOps"

# Or use template
cp assets/config_textcat.cfg config.cfg

# Train
python -m spacy train config.cfg --output ./output

# With GPU
python -m spacy train config.cfg --output ./output --gpu-id 0

Using Trained Model

nlp = spacy.load("./output/model-best")
doc = nlp("Deploy the application to Kubernetes cluster")
predicted = max(doc.cats, key=doc.cats.get)
confidence = doc.cats[predicted]
print(f"{predicted}: {confidence:.1%}")  # DevOps: 94.2%

For detailed training guide: See references/text-classification.md

Troubleshooting

Model Not Found (E050)

OSError: [E050] Can't find model 'en_core_web_sm'

Fix:

python -m spacy download en_core_web_sm

Alternative (avoids path issues):

import en_core_web_sm
nlp = en_core_web_sm.load()

Memory Issues

Symptoms: OOM errors, slow processing

Fixes:

# 1. Disable unused components
nlp = spacy.load("en_core_web_sm", exclude=["parser", "ner"])

# 2. Process in chunks
for chunk in chunk_text(large_text, max_length=100000):
    doc = nlp(chunk)

# 3. Use memory zones (spaCy 3.8+)
with nlp.memory_zone():
    for doc in nlp.pipe(batch):
        process(doc)

GPU Not Working

import spacy

# Must call BEFORE loading model
if spacy.prefer_gpu():
    print("Using GPU")
else:
    print("GPU not available")

nlp = spacy.load("en_core_web_trf")  # Now loads on GPU

Version Compatibility

spaCy 2.x models do not work with spaCy 3.x. Check compatibility:

python -m spacy validate

For more troubleshooting: See references/troubleshooting.md

Production Deployment

Package Model

python -m spacy package ./output/model-best ./packages \
  --name my_classifier \
  --version 1.0.0

pip install ./packages/en_my_classifier-1.0.0/

FastAPI Server

Use the production template:

python scripts/serve_model.py --model ./output/model-best --port 8000

Or customize from template:

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_my_classifier")

@app.post("/classify")
async def classify(text: str):
    with nlp.memory_zone():
        doc = nlp(text)
        return {
            "category": max(doc.cats, key=doc.cats.get),
            "scores": doc.cats
        }

Performance Optimization

Technique	Speedup	When to Use
Disable components	2-3x	Don't need all annotations
`nlp.pipe()`	5-10x	Processing multiple texts
Multiprocessing	2-4x	CPU-bound, many cores
GPU	2-5x	Transformer models

For evaluation metrics and hyperparameter tuning: See references/production.md

Scripts Reference

Script	Purpose	Usage
`prepare_training_data.py`	Convert JSON to DocBin	`python scripts/prepare_training_data.py --input data.json`
`generate_config.py`	Create training config	`python scripts/generate_config.py --categories "A,B,C"`
`evaluate_model.py`	Detailed metrics	`python scripts/evaluate_model.py --model ./output/model-best`
`serve_model.py`	FastAPI server	`python scripts/serve_model.py --model ./model --port 8000`

Assets Reference

Asset	Purpose	Usage
`config_textcat.cfg`	Base training config	Copy and customize for your labels
`training_data_template.json`	Data format example	Reference for preparing your data

using-spacy-nlp