Skill

There Is No Spoon — ML Primer Skill

Explains ML fundamentals to transformers for engineers using first-principles analogies like polarizing filters for neurons and Python scripts for SVG visualizations.

Python

ai-ml

Install

npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37

Tool Access

This skill uses the workspace's default tool permissions.

Preview

```markdown

SKILL.md

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

139.2k

mcp-builder

9 files

Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).

anthropics-skills-13

124.2k

canvas-design

20 files

Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.

anthropics-skills-13

124.2k

Stats

Stars36

Forks8

Last CommitMar 31, 2026

Actions

View Source View Plugin View on GitHub View README

Generate all figures

Requires only matplotlib and numpy:

pip install matplotlib numpy

Then run each script individually:

python3 scripts/01_neuron_hyperplane.py
python3 scripts/02_activation_functions.py
python3 scripts/03_paper_folding.py
python3 scripts/04_derivatives.py
python3 scripts/05_chain_rule.py
python3 scripts/06_attention.py
python3 scripts/07_ffn_volumetric.py
python3 scripts/08_residual_connections.py
python3 scripts/09_dot_products.py
python3 scripts/10_loss_landscapes.py
python3 scripts/11_combination_rules.py
python3 scripts/12_gating_operations.py

Or regenerate all at once:

for f in scripts/*.py; do python3 "$f"; done

Figures are written to figures/.

Project Structure

thereisnospoon/
├── ml-primer.md          # The full primer — primary content
├── SYLLABUS.md           # Full topic map / table of contents
├── figures/              # SVG/PNG visualizations (auto-generated)
│   ├── logo.svg
│   ├── 01_neuron_hyperplane.*
│   └── ...
└── scripts/              # Python figure-generation scripts
    ├── 01_neuron_hyperplane.py
    ├── 02_activation_functions.py
    └── ...

Key Concepts & Navigation

Part 1 — Fundamentals

Section	Core Analogy	Key Insight
The Neuron	Polarizing filter	Dot product as directional agreement
Composition	Paper folding	Depth = exponential crease capacity
Learning	Pipeline valves	Gradient flow through the network
Generalization	Occam's razor	Why overparameterized nets generalize
Representation	Shadows/directions	Superposition in feature space

Part 2 — Architectures

Section	Core Analogy	When to Reach For It
Convolution	Sliding template	Spatial/local structure, translation invariance
Attention	Weighted spotlight	Long-range dependencies, variable-length sequences
Recurrence	State machine	Sequential state with bounded compute
Graph ops	Message passing	Relational / graph-structured data
SSMs	Continuous dynamics	Long sequences, efficient inference
Transformer	Full assembly	General-purpose sequence modeling

Part 3 — Gates as Control Systems

Gate primitives (scalar, vector, matrix), soft logic composition, branching, routing, recursion within a forward pass.

Code Examples

Neuron from scratch (the primer's core primitive)

import numpy as np

def neuron(x: np.ndarray, w: np.ndarray, b: float) -> float:
    """
    Single neuron: dot product + bias + nonlinearity.
    Conceptually: how much does input x align with direction w?
    """
    pre_activation = np.dot(w, x) + b   # directional agreement
    return np.maximum(0, pre_activation) # ReLU nonlinearity

# Example: 3-dimensional input
x = np.array([0.5, -0.3, 0.8])
w = np.array([1.0,  0.0, 0.5])  # "cares about" dims 0 and 2
b = -0.2

output = neuron(x, w, b)
print(f"Neuron output: {output:.4f}")

Dense layer (width and composition)

import numpy as np

def dense_layer(X: np.ndarray, W: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    X: (batch, in_features)
    W: (in_features, out_features)
    b: (out_features,)
    Returns: (batch, out_features) after ReLU
    """
    return np.maximum(0, X @ W + b)

# Two-layer MLP: paper folding twice
np.random.seed(42)
X  = np.random.randn(8, 4)      # 8 examples, 4 features
W1 = np.random.randn(4, 16) * 0.1
b1 = np.zeros(16)
W2 = np.random.randn(16, 2) * 0.1
b2 = np.zeros(2)

hidden = dense_layer(X, W1, b1)   # fold once
output = dense_layer(hidden, W2, b2)  # fold again
print(f"Output shape: {output.shape}")  # (8, 2)

Scaled dot-product attention (the transformer's core op)

import numpy as np

def scaled_dot_product_attention(
    Q: np.ndarray,
    K: np.ndarray,
    V: np.ndarray,
    mask: np.ndarray = None
) -> tuple[np.ndarray, np.ndarray]:
    """
    Q: (seq, d_k)   — queries: what am I looking for?
    K: (seq, d_k)   — keys:    what do I offer?
    V: (seq, d_v)   — values:  what do I actually contain?

    Analogy: attention scores = spotlight intensity
             softmax = normalized routing weights
             output = weighted sum of values
    """
    d_k = Q.shape[-1]
    
    # Alignment scores (how much each query matches each key)
    scores = Q @ K.T / np.sqrt(d_k)
    
    # Causal mask for autoregressive decoding
    if mask is not None:
        scores = np.where(mask, scores, -1e9)
    
    # Softmax: turn scores into a probability distribution
    scores_exp = np.exp(scores - scores.max(axis=-1, keepdims=True))
    attn_weights = scores_exp / scores_exp.sum(axis=-1, keepdims=True)
    
    # Weighted aggregation of values
    output = attn_weights @ V
    return output, attn_weights

# Example: 4-token sequence, d_k=8, d_v=8
seq_len, d_k, d_v = 4, 8, 8
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)

# Causal mask: position i can only attend to positions <= i
causal_mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))

out, weights = scaled_dot_product_attention(Q, K, V, mask=causal_mask)
print(f"Attention output shape: {out.shape}")       # (4, 8)
print(f"Attention weights shape: {weights.shape}")  # (4, 4)

Numerical gradient check (backprop intuition)

import numpy as np

def numerical_gradient(f, x: np.ndarray, eps: float = 1e-5) -> np.ndarray:
    """
    Approximate gradient using finite differences.
    Useful for verifying analytic gradients.
    Analogy: tilt-and-measure — how does output change per unit nudge?
    """
    grad = np.zeros_like(x)
    for i in range(x.size):
        x_plus  = x.copy(); x_plus.flat[i]  += eps
        x_minus = x.copy(); x_minus.flat[i] -= eps
        grad.flat[i] = (f(x_plus) - f(x_minus)) / (2 * eps)
    return grad

# Test: gradient of sum-of-squares loss
def loss(w):
    return np.sum(w ** 2)

w = np.array([1.0, 2.0, -0.5])
grad_numerical = numerical_gradient(loss, w)
grad_analytic  = 2 * w  # d/dw sum(w^2) = 2w

print(f"Numerical gradient: {grad_numerical}")
print(f"Analytic gradient:  {grad_analytic}")
print(f"Max error: {np.max(np.abs(grad_numerical - grad_analytic)):.2e}")

Residual connection (the transformer's training enabler)

import numpy as np

def residual_block(x: np.ndarray, sublayer_fn, *args) -> np.ndarray:
    """
    x + sublayer(x): skip connection guarantees identity path.
    Analogy: bypass valve — gradient can always flow through unchanged.
    Critical for training deep networks (solves vanishing gradient).
    """
    return x + sublayer_fn(x, *args)

# Simulate: residual attention block
def mock_attention(x, W):
    """Simplified: project, attend, project back."""
    return np.tanh(x @ W) * 0.1  # small update

x = np.random.randn(4, 8)
W = np.random.randn(8, 8) * 0.1

out = residual_block(x, mock_attention, W)
print(f"Input norm:  {np.linalg.norm(x):.4f}")
print(f"Output norm: {np.linalg.norm(out):.4f}")
# Output is close to input — the residual preserves signal

Scalar gate (Part 3 primitive)

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def scalar_gate(value: np.ndarray, gate_logit: float) -> np.ndarray:
    """
    g = sigmoid(logit) ∈ (0, 1)
    output = g * value
    
    Analogy: dimmer switch — how much of this value passes through?
    Used in: LSTMs, GRUs, mixture-of-experts routing
    """
    g = sigmoid(gate_logit)
    return g * value

# Interpolation gate (LSTM-style)
def interpolate_gate(
    new_val: np.ndarray,
    old_val: np.ndarray,
    gate_logit: float
) -> np.ndarray:
    """How much to update vs. retain state."""
    g = sigmoid(gate_logit)
    return g * new_val + (1 - g) * old_val

state   = np.array([0.8, -0.3, 0.5])
new_info = np.array([0.1,  0.9, 0.2])

# gate_logit=2.0 → mostly update; gate_logit=-2.0 → mostly retain
updated = interpolate_gate(new_info, state, gate_logit=2.0)
retained = interpolate_gate(new_info, state, gate_logit=-2.0)
print(f"Mostly update: {updated.round(3)}")
print(f"Mostly retain: {retained.round(3)}")

Regenerating a Single Figure

Each script in scripts/ is self-contained. To modify and regenerate figure 06 (attention):

# Edit the script
$EDITOR scripts/06_attention.py

# Regenerate
python3 scripts/06_attention.py
# Output written to figures/06_attention.*

Common Patterns & Design Decisions

Matching architecture to problem (from Topology section)

Grid / spatial data (images)       → Convolution
Variable-length sequences          → Transformer (attention)
Sequential state, bounded compute  → RNN / SSM
Relational / graph structure       → GNN (message passing)
Tabular, low-dim, no structure     → MLP
Everything else at scale           → Transformer

Choosing depth vs width

More depth  → more folds in representation space (exponential capacity)
             → better for hierarchical features
             → harder to train (use residuals + normalization)

More width  → more directions per layer (linear capacity)
             → better for parallel feature detection at same level
             → easier to train, diminishing returns faster

Loss curve diagnostics (from Appendix)

Symptom	Likely Cause	Fix
Loss not decreasing	LR too low, dead ReLUs, bad init	Raise LR, check activations
Loss exploding	LR too high, no gradient clipping	Lower LR, add clipping
Train ↓ / Val ↑ (overfitting)	Too much capacity, too little data	Dropout, weight decay, more data
Train stuck high	Underfitting	More capacity, more epochs, lower LR
Loss oscillates	LR too high	LR schedule, lower base LR

Interactive Use with an AI Agent

Feed the primer to any AI coding assistant for conversational exploration:

Read ml-primer.md. I'm an engineer learning ML fundamentals.
Walk me through the section on [topic]. I want to understand
it well enough to reason about design decisions, not just
recite definitions. Push back if I get something wrong.

Effective question patterns:

"Why does X work? What would break if we removed it?"
"How do X and Y differ in terms of inductive bias?"
"Give me a concrete example where I'd choose X over Y."
"What's the failure mode of this approach?"

Contributing

PRs welcome. Keep the tone:

Direct, concrete — no hedging
Analogies over notation — analogy is the primary explanation
When-to-use over how-it-works — design decision focus

# Fork, clone, branch
git checkout -b improve/section-name

# Make changes to ml-primer.md or scripts/
# Regenerate affected figures if scripts changed
python3 scripts/XX_affected_figure.py

git commit -m "improve: clearer analogy for [concept]"
git push origin improve/section-name
# Open PR

License

MIT — see LICENSE.