From aradotso-trending-skills-37
Build and experiment with Recurrent-Depth Transformer models using OpenMythos, featuring looped recurrent blocks, MLA/GQA attention, and sparse MoE for inference-time scaling.
npx claudepluginhub joshuarweaver/cascade-ai-ml-agents-misc-1 --plugin aradotso-trending-skills-37This skill uses the workspace's default tool permissions.
```markdown
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
---
name: openmythos-recurrent-transformer
description: Build and experiment with Recurrent-Depth Transformer (RDT) models using OpenMythos, a theoretical reconstruction of the Claude Mythos architecture with looped transformers, MLA/GQA attention, and sparse MoE.
triggers:
- implement a looped transformer model
- build a recurrent depth transformer
- use OpenMythos for inference time scaling
- configure MLA or GQA attention in OpenMythos
- set up mixture of experts with recurrent blocks
- train a model with adaptive loop iterations
- generate text with variable reasoning depth
- explore compute-adaptive transformer architectures
---
# OpenMythos Recurrent-Depth Transformer
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
OpenMythos is an open-source theoretical implementation of a Recurrent-Depth Transformer (RDT) inspired by the Claude Mythos architecture. It divides computation into three stages: **Prelude** (standard transformer layers run once), a **Recurrent Block** (looped up to `max_loop_iters` times with stable injection), and a **Coda** (standard transformer layers run once). Attention is switchable between MLA (Multi-head Latent Attention) and GQA (Grouped Query Attention), and feed-forward layers use sparse MoE with routed and shared experts.
## Installation
```bash
git clone https://github.com/The-Swarm-Corporation/OpenMythos.git
cd OpenMythos
pip install -r requirements.txt
Input → Embedding
↓
[Prelude] — N standard transformer layers, run once
↓
[Recurrent Block] — looped T times per forward pass
↑_________↓ h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
↓
[Coda] — N standard transformer layers, run once
↓
Output Logits
e = encoded input from Prelude (injected every loop to prevent drift)A, B = learned stable injection parameters (spectral radius ρ(A) < 1 enforced)"mla"): Multi-head Latent Attention — uses KV LoRA compression, separate RoPE and NoPE head dimensions"gqa"): Grouped Query Attention — fewer KV heads than Q heads, simpler configMythosConfig Parameters| Parameter | Type | Description |
|---|---|---|
vocab_size | int | Vocabulary size |
dim | int | Model hidden dimension |
n_heads | int | Number of attention query heads |
n_kv_heads | int | Number of KV heads (GQA ratio = n_heads/n_kv_heads) |
max_seq_len | int | Maximum sequence length |
max_loop_iters | int | Maximum recurrent loop iterations |
prelude_layers | int | Number of Prelude transformer layers |
coda_layers | int | Number of Coda transformer layers |
n_experts | int | Total number of MoE routed experts |
n_shared_experts | int | Always-active shared experts |
n_experts_per_tok | int | Top-K experts selected per token |
expert_dim | int | Hidden dim inside each expert FFN |
lora_rank | int | LoRA rank for injection parameters |
attn_type | str | "mla" or "gqa" |
kv_lora_rank | int | MLA only: KV compression rank |
q_lora_rank | int | MLA only: Q compression rank |
qk_rope_head_dim | int | MLA only: RoPE head dimension |
qk_nope_head_dim | int | MLA only: NoPE head dimension |
v_head_dim | int | MLA only: Value head dimension |
import torch
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=32000,
dim=512,
n_heads=8,
n_kv_heads=2, # GQA: 4x fewer KV heads
max_seq_len=2048,
max_loop_iters=8,
prelude_layers=2,
coda_layers=2,
n_experts=16,
n_shared_experts=2,
n_experts_per_tok=2,
expert_dim=256,
lora_rank=16,
attn_type="gqa",
)
model = OpenMythos(cfg)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# Forward pass with 4 reasoning loops
ids = torch.randint(0, cfg.vocab_size, (2, 64))
logits = model(ids, n_loops=4)
print(f"Logits shape: {logits.shape}") # (2, 64, 32000)
import torch
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=32000,
dim=512,
n_heads=8,
n_kv_heads=8, # MLA uses full heads
max_seq_len=2048,
max_loop_iters=16,
prelude_layers=2,
coda_layers=2,
n_experts=16,
n_shared_experts=2,
n_experts_per_tok=2,
expert_dim=256,
lora_rank=16,
attn_type="mla",
# MLA-specific compression dims
kv_lora_rank=64,
q_lora_rank=128,
qk_rope_head_dim=32,
qk_nope_head_dim=32,
v_head_dim=32,
)
model = OpenMythos(cfg)
ids = torch.randint(0, cfg.vocab_size, (1, 128))
# More loops = deeper implicit reasoning
logits_shallow = model(ids, n_loops=2)
logits_deep = model(ids, n_loops=16)
print(f"Shallow logits: {logits_shallow.shape}")
print(f"Deep logits: {logits_deep.shape}")
import torch
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=50257,
dim=768,
n_heads=12,
n_kv_heads=4,
max_seq_len=1024,
max_loop_iters=12,
prelude_layers=2,
coda_layers=2,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=512,
lora_rank=32,
attn_type="gqa",
)
model = OpenMythos(cfg)
model.eval()
prompt_ids = torch.randint(0, cfg.vocab_size, (1, 16))
# Simple task: fewer loops
with torch.no_grad():
easy_output = model.generate(
prompt_ids,
max_new_tokens=32,
n_loops=2, # shallow reasoning
)
# Complex task: more loops (same model, more compute)
with torch.no_grad():
hard_output = model.generate(
prompt_ids,
max_new_tokens=32,
n_loops=12, # deep reasoning
)
print(f"Easy output shape: {easy_output.shape}")
print(f"Hard output shape: {hard_output.shape}")
Always verify the spectral radius constraint after initialization and training:
import torch
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=1000,
dim=256,
n_heads=8,
n_kv_heads=2,
max_seq_len=128,
max_loop_iters=8,
prelude_layers=1,
coda_layers=1,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=64,
lora_rank=8,
attn_type="gqa",
)
model = OpenMythos(cfg)
# Check injection matrix spectral radius — must be < 1
A = model.recurrent.injection.get_A()
spectral_radius = A.max().item()
print(f"Spectral radius ρ(A): {spectral_radius:.6f}")
assert spectral_radius < 1.0, "UNSTABLE: spectral radius >= 1!"
print("✓ Stability constraint satisfied")
import torch
import torch.nn.functional as F
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=1000,
dim=256,
n_heads=8,
n_kv_heads=2,
max_seq_len=64,
max_loop_iters=16,
prelude_layers=1,
coda_layers=1,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=64,
lora_rank=8,
attn_type="gqa",
)
model = OpenMythos(cfg)
model.eval()
ids = torch.randint(0, cfg.vocab_size, (1, 32))
targets = torch.randint(0, cfg.vocab_size, (1, 32))
results = {}
with torch.no_grad():
for n_loops in [1, 2, 4, 8, 16]:
logits = model(ids, n_loops=n_loops)
loss = F.cross_entropy(
logits.view(-1, cfg.vocab_size),
targets.view(-1)
)
results[n_loops] = loss.item()
print(f"n_loops={n_loops:2d} → loss={loss.item():.4f}")
# Expect diminishing returns (saturating exponential decay)
print("\nDelta losses (should decrease):")
loop_counts = sorted(results.keys())
for i in range(1, len(loop_counts)):
delta = results[loop_counts[i-1]] - results[loop_counts[i]]
print(f" {loop_counts[i-1]}→{loop_counts[i]} loops: Δloss={delta:.4f}")
import torch
import torch.nn.functional as F
from torch.optim import AdamW
from open_mythos.main import OpenMythos, MythosConfig
cfg = MythosConfig(
vocab_size=10000,
dim=512,
n_heads=8,
n_kv_heads=2,
max_seq_len=256,
max_loop_iters=8,
prelude_layers=2,
coda_layers=2,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=256,
lora_rank=16,
attn_type="gqa",
)
model = OpenMythos(cfg)
optimizer = AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
def train_step(model, optimizer, input_ids, target_ids, n_loops=4):
model.train()
optimizer.zero_grad()
logits = model(input_ids, n_loops=n_loops)
loss = F.cross_entropy(
logits.view(-1, cfg.vocab_size),
target_ids.view(-1),
)
loss.backward()
# Gradient clipping recommended for looped models
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
return loss.item()
def check_stability(model):
A = model.recurrent.injection.get_A()
rho = A.max().item()
return rho
# Example training iterations
for step in range(10):
batch_size, seq_len = 4, 64
input_ids = torch.randint(0, cfg.vocab_size, (batch_size, seq_len))
target_ids = torch.randint(0, cfg.vocab_size, (batch_size, seq_len))
# Curriculum: start with fewer loops, increase over training
n_loops = min(2 + step // 3, cfg.max_loop_iters)
loss = train_step(model, optimizer, input_ids, target_ids, n_loops=n_loops)
rho = check_stability(model)
print(f"Step {step:3d} | loss={loss:.4f} | ρ(A)={rho:.4f} | loops={n_loops}")
if rho >= 1.0:
print("WARNING: Spectral radius constraint violated!")
import torch
from open_mythos.main import OpenMythos, MythosConfig
def build_model(attn_type: str) -> OpenMythos:
base = dict(
vocab_size=8000,
dim=256,
n_heads=8,
max_seq_len=128,
max_loop_iters=6,
prelude_layers=1,
coda_layers=1,
n_experts=8,
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=64,
lora_rank=8,
attn_type=attn_type,
)
if attn_type == "gqa":
cfg = MythosConfig(**base, n_kv_heads=2)
else: # mla
cfg = MythosConfig(
**base,
n_kv_heads=8,
kv_lora_rank=32,
q_lora_rank=64,
qk_rope_head_dim=16,
qk_nope_head_dim=16,
v_head_dim=16,
)
return OpenMythos(cfg)
for attn_type in ["gqa", "mla"]:
model = build_model(attn_type)
param_count = sum(p.numel() for p in model.parameters())
ids = torch.randint(0, 8000, (2, 32))
logits = model(ids, n_loops=4)
rho = model.recurrent.injection.get_A().max().item()
print(f"{attn_type.upper()}: params={param_count:,} | "
f"logits={logits.shape} | ρ(A)={rho:.4f}")
OpenMythos(config)Main model class.
forward(input_ids, n_loops=None)
input_ids: LongTensor of shape (batch, seq_len)n_loops: int, number of recurrent iterations (default: config.max_loop_iters)FloatTensor of shape (batch, seq_len, vocab_size)generate(input_ids, max_new_tokens, n_loops=None)
input_ids: LongTensor of shape (batch, seq_len)max_new_tokens: intn_loops: int, loops per forward step during generationLongTensor of shape (batch, seq_len + max_new_tokens)model.recurrent.injection.get_A()Returns the learned injection matrix A. Check A.max().item() < 1.0 for stability.
Start with fewer loops and increase gradually — helps with initial stability:
n_loops = min(max_loops, 2 + global_step // 1000)
Use fewer loops for simple/fast tasks, more for complex reasoning — same weights, adaptive compute:
# Classification / simple completion
logits = model(ids, n_loops=2)
# Multi-step reasoning / hard math
logits = model(ids, n_loops=model.cfg.max_loop_iters)
Always clip gradients when training looped models — prevents the rare instability when gradients backpropagate through many loop unrolls:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
The router uses dynamic bias adjustment — no special loss term needed, but monitor expert utilization during training to detect collapse.
| Problem | Cause | Fix |
|---|---|---|
ρ(A) >= 1.0 after training | Injection constraint violated | Lower learning rate; the constraint should be enforced by construction — check model version |
| Loss NaN / explosion | Residual stream diverging | Enable grad clipping (max_norm=1.0); reduce n_loops early in training |
RuntimeError on MLA config | Missing MLA-specific params | Ensure kv_lora_rank, q_lora_rank, qk_rope_head_dim, qk_nope_head_dim, v_head_dim are all set |
OOM with high n_loops | Memory scales with loop unrolls during training | Use torch.no_grad() for eval; reduce batch size during high-loop training |
GQA n_kv_heads error | Must divide n_heads evenly | Ensure n_heads % n_kv_heads == 0 |
| Slow generation | Generating with max loops | Reduce n_loops in generate() for faster inference |
OpenMythos/
├── open_mythos/
│ └── main.py # OpenMythos, MythosConfig, all sub-modules
├── docs/
│ └── open_mythos.md # Full API reference
├── requirements.txt
└── README.md