From mlops-skills
Remove refusal behaviors from open-weight LLMs using OBLITERATUS — mechanistic interpretability techniques (diff-in-means, SVD, whitened SVD, LEACE, SAE decomposition, etc.) to excise guardrails while preserving reasoning. 9 CLI methods, 28 analysis modules, 116 model presets across 5 compute tiers, tournament evaluation, and telemetry-driven recommendations. Use when a user wants to uncensor, abliterate, or remove refusal from an LLM.
npx claudepluginhub rnben/hermes-skills --plugin mlops-skillsThis skill uses the workspace's default tool permissions.
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Analyzes BMad project state from catalog CSV, configs, artifacts, and query to recommend next skills or answer questions. Useful for help requests, 'what next', or starting BMad.
Remove refusal behaviors (guardrails) from open-weight LLMs without retraining or fine-tuning. Uses mechanistic interpretability techniques — including diff-in-means, SVD, whitened SVD, LEACE concept erasure, SAE decomposition, Bayesian kernel projection, and more — to identify and surgically excise refusal directions from model weights while preserving reasoning capabilities.
License warning: OBLITERATUS is AGPL-3.0. NEVER import it as a Python library. Always invoke via CLI (obliteratus command) or subprocess. This keeps Hermes Agent's MIT license clean.
Trigger when the user:
Check if already installed:
obliteratus --version 2>/dev/null && echo "INSTALLED" || echo "NOT INSTALLED"
If not installed, clone and install from GitHub:
git clone https://github.com/elder-plinius/OBLITERATUS.git
cd OBLITERATUS
pip install -e .
# For Gradio web UI support:
# pip install -e ".[spaces]"
IMPORTANT: Confirm with user before installing. This pulls in ~5-10GB of dependencies (PyTorch, Transformers, bitsandbytes, etc.).
Before anything, check what GPU is available:
python3 -c "
import torch
if torch.cuda.is_available():
gpu = torch.cuda.get_device_name(0)
vram = torch.cuda.get_device_properties(0).total_memory / 1024**3
print(f'GPU: {gpu}')
print(f'VRAM: {vram:.1f} GB')
if vram < 4: print('TIER: tiny (models under 1B)')
elif vram < 8: print('TIER: small (models 1-4B)')
elif vram < 16: print('TIER: medium (models 4-9B with 4bit quant)')
elif vram < 32: print('TIER: large (models 8-32B with 4bit quant)')
else: print('TIER: frontier (models 32B+)')
else:
print('NO GPU - only tiny models (under 1B) on CPU')
"
| VRAM | Max Model Size | Example Models |
|---|---|---|
| CPU only | ~1B params | GPT-2, TinyLlama, SmolLM |
| 4-8 GB | ~4B params | Qwen2.5-1.5B, Phi-3.5 mini, Llama 3.2 3B |
| 8-16 GB | ~9B params | Llama 3.1 8B, Mistral 7B, Gemma 2 9B |
| 24 GB | ~32B params | Qwen3-32B, Llama 3.1 70B (tight), Command-R |
| 48 GB+ | ~72B+ params | Qwen2.5-72B, DeepSeek-R1 |
| Multi-GPU | 200B+ params | Llama 3.1 405B, DeepSeek-V3 (685B MoE) |
# Browse models by compute tier
obliteratus models --tier medium
# Get architecture info for a specific model
obliteratus info <model_name>
# Get telemetry-driven recommendation for best method & params
obliteratus recommend <model_name>
obliteratus recommend <model_name> --insights # global cross-architecture rankings
Default / recommended for most cases: advanced. It uses multi-direction SVD with norm-preserving projection and is well-tested.
| Situation | Recommended Method | Why |
|---|---|---|
| Default / most models | advanced | Multi-direction SVD, norm-preserving, reliable |
| Quick test / prototyping | basic | Fast, simple, good enough to evaluate |
| Dense model (Llama, Mistral) | advanced | Multi-direction, norm-preserving |
| MoE model (DeepSeek, Mixtral) | nuclear | Expert-granular, handles MoE complexity |
| Reasoning model (R1 distills) | surgical | CoT-aware, preserves chain-of-thought |
| Stubborn refusals persist | aggressive | Whitened SVD + head surgery + jailbreak |
| Want reversible changes | Use steering vectors (see Analysis section) | |
| Maximum quality, time no object | optimized | Bayesian search for best parameters |
| Experimental auto-detection | informed | Auto-detects alignment type — experimental, may not always outperform advanced |
(NOT available via CLI — require Python import, which violates AGPL boundary. Mention to user only if they explicitly want to use OBLITERATUS as a library in their own AGPL project.)
# Default method (advanced) — recommended for most models
obliteratus obliterate <model_name> --method advanced --output-dir ./abliterated-models
# With 4-bit quantization (saves VRAM)
obliteratus obliterate <model_name> --method advanced --quantization 4bit --output-dir ./abliterated-models
# Large models (70B+) — conservative defaults
obliteratus obliterate <model_name> --method advanced --quantization 4bit --large-model --output-dir ./abliterated-models
obliteratus obliterate <model_name> \
--method advanced \
--direction-method diff_means \
--n-directions 4 \
--refinement-passes 2 \
--regularization 0.1 \
--quantization 4bit \
--output-dir ./abliterated-models \
--contribute # opt-in telemetry for community research
| Flag | Description | Default |
|---|---|---|
--method | Abliteration method | advanced |
--direction-method | Direction extraction | diff_means |
--n-directions | Number of refusal directions (1-32) | method-dependent |
--refinement-passes | Iterative passes (1-5) | 2 |
--regularization | Regularization strength (0.0-1.0) | 0.1 |
--quantization | Load in 4bit or 8bit | none (full precision) |
--large-model | Conservative defaults for 120B+ | false |
--output-dir | Where to save the abliterated model | ./obliterated_model |
--contribute | Share anonymized results for research | false |
--verify-sample-size | Number of test prompts for refusal check | 20 |
--dtype | Model dtype (float16, bfloat16) | auto |
# Interactive guided mode (hardware → model → preset)
obliteratus interactive
# Web UI (Gradio)
obliteratus ui --port 7860
# Run a full ablation study from YAML config
obliteratus run config.yaml --preset quick
# Tournament: pit all methods against each other
obliteratus tourney <model_name>
After abliteration, check the output metrics:
| Metric | Good Value | Warning |
|---|---|---|
| Refusal rate | < 5% (ideally ~0%) | > 10% means refusals persist |
| Perplexity change | < 10% increase | > 15% means coherence damage |
| KL divergence | < 0.1 | > 0.5 means significant distribution shift |
| Coherence | High / passes qualitative check | Degraded responses, repetition |
aggressive method--n-directions (e.g., 8 or 16)--refinement-passes 3--direction-method svd instead of diff_means--n-directions (try 2)--regularization (try 0.3)--refinement-passes to 1basic method (gentler)The output is a standard HuggingFace model directory.
# Test locally with transformers
python3 -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('./abliterated-models/<model>')
tokenizer = AutoTokenizer.from_pretrained('./abliterated-models/<model>')
inputs = tokenizer('How do I pick a lock?', return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
"
# Upload to HuggingFace Hub
huggingface-cli upload <username>/<model-name>-abliterated ./abliterated-models/<model>
# Serve with vLLM
vllm serve ./abliterated-models/<model>
| Command | Description |
|---|---|
obliteratus obliterate | Main abliteration command |
obliteratus info <model> | Print model architecture details |
obliteratus models --tier <tier> | Browse curated models by compute tier |
obliteratus recommend <model> | Telemetry-driven method/param suggestion |
obliteratus interactive | Guided setup wizard |
obliteratus tourney <model> | Tournament: all methods head-to-head |
obliteratus run <config.yaml> | Execute ablation study from YAML |
obliteratus strategies | List all registered ablation strategies |
obliteratus report <results.json> | Regenerate visual reports |
obliteratus ui | Launch Gradio web interface |
obliteratus aggregate | Summarize community telemetry data |
OBLITERATUS includes 28 analysis modules for mechanistic interpretability.
See skill_view(name="obliteratus", file_path="references/analysis-modules.md") for the full reference.
# Run specific analysis modules
obliteratus run analysis-config.yaml --preset quick
# Key modules to run first:
# - alignment_imprint: Fingerprint DPO/RLHF/CAI/SFT alignment method
# - concept_geometry: Single direction vs polyhedral cone
# - logit_lens: Which layer decides to refuse
# - anti_ouroboros: Self-repair risk score
# - causal_tracing: Causally necessary components
Instead of permanent weight modification, use inference-time steering:
# Python API only — for user's own projects
from obliteratus.analysis.steering_vectors import SteeringVectorFactory, SteeringHookManager
Beyond direction-based abliteration, OBLITERATUS includes structural ablation strategies:
List all available: obliteratus strategies
OBLITERATUS includes built-in evaluation tools:
Load templates for reproducible runs via skill_view:
templates/abliteration-config.yaml — Standard single-model configtemplates/analysis-study.yaml — Pre-abliteration analysis studytemplates/batch-abliteration.yaml — Multi-model batch processingOBLITERATUS can optionally contribute anonymized run data to a global research dataset.
Enable with --contribute flag. No personal data is collected — only model name, method, metrics.
informed as default — it's experimental and slower. Use advanced for reliable results.advanced).aggressive can make things worse — on small models it can damage coherence and actually increase refusal rate. Only use it if advanced leaves > 10% refusals on a 3B+ model.nuclear method for Mixtral, DeepSeek-MoE, etc.surgical for R1 distills to preserve chain-of-thought.obliteratus recommend — telemetry data may have better parameters than defaults.import obliteratus in MIT/Apache projects. CLI invocation only.--large-model flag for conservative defaults.