From cybersecurity-skills
Detects model stealing, model inversion, and membership inference via inference-API abuse using query monitoring, output perturbation, and red-teaming.
How this skill is triggered — by the user, by Claude, or both
Slash command
/cybersecurity-skills:detecting-model-extraction-attacksThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
> **Authorized Use Only:** The extraction, inversion, and membership-inference techniques described here are intended for defenders testing their own models and for red teams operating under written authorization. Querying a third-party model to clone it, reconstruct its training data, or infer membership without permission may violate terms of service, copyright, and privacy law.
Authorized Use Only: The extraction, inversion, and membership-inference techniques described here are intended for defenders testing their own models and for red teams operating under written authorization. Querying a third-party model to clone it, reconstruct its training data, or infer membership without permission may violate terms of service, copyright, and privacy law.
Model extraction is the family of attacks in which an adversary abuses a model's inference API to steal value that the model owner intended to keep private. MITRE ATLAS catalogs these under AML.T0024 — Exfiltration via AI Inference API, in the Exfiltration tactic, with three sub-techniques:
All three share a common signal: an attacker must send many queries, often crafted to probe the decision boundary (high-entropy, near-boundary, synthetic, or systematically grid-sampled inputs), and frequently requests full confidence vectors / logits rather than just the top label. Detection therefore centers on per-principal query monitoring, input-distribution analysis, and confidence-exposure controls, while defense centers on rate limiting, output perturbation, and reducing the information returned per query. This skill follows the MITRE ATLAS technique definition for AML.T0024 (https://atlas.mitre.org/techniques/AML.T0024) and the NIST AI RMF MEASURE function (MEASURE-2.6, security and resilience of the AI system).
pip install adversarial-robustness-toolbox scikit-learn numpy
| ID | Name (MITRE ATLAS) | Tactic |
|---|---|---|
| AML.T0024 | Exfiltration via AI Inference API | Exfiltration |
| AML.T0024.000 | Infer Training Data Membership | Exfiltration |
| AML.T0024.001 | Invert AI Model | Exfiltration |
| AML.T0024.002 | Extract ML Model | Exfiltration |
Capture the fields a detector needs. Per request, log the principal (API key / IP / account), timestamp, an input fingerprint, and whether the caller requested probabilities/logits.
import hashlib, json, time
def log_inference(principal, features, returned_probs):
record = {
"ts": time.time(),
"principal": principal,
# hash inputs so logs don't store raw sensitive data
"input_hash": hashlib.sha256(json.dumps(features, sort_keys=True).encode()).hexdigest(),
"wants_probs": returned_probs,
"n_features": len(features),
}
with open("inference_audit.jsonl", "a") as f:
f.write(json.dumps(record) + "\n")
Score each principal on the three signals that distinguish extraction from normal use: high query volume in a window, high unique-input ratio (attackers rarely repeat), and a high rate of full-probability requests.
import collections, json
def score_principals(audit_path="inference_audit.jsonl", window_qps_threshold=100):
by_principal = collections.defaultdict(lambda: {"q": 0, "uniq": set(), "probs": 0})
for line in open(audit_path):
r = json.loads(line)
p = by_principal[r["principal"]]
p["q"] += 1
p["uniq"].add(r["input_hash"])
p["probs"] += int(r["wants_probs"])
findings = []
for principal, p in by_principal.items():
uniq_ratio = len(p["uniq"]) / max(p["q"], 1)
prob_ratio = p["probs"] / max(p["q"], 1)
suspicious = p["q"] > window_qps_threshold and uniq_ratio > 0.9 and prob_ratio > 0.8
findings.append({"principal": principal, "queries": p["q"],
"unique_ratio": round(uniq_ratio, 3),
"prob_request_ratio": round(prob_ratio, 3),
"suspected_extraction": suspicious})
return sorted(findings, key=lambda x: -x["queries"])
Use ART's CopycatCNN (or KnockoffNets) to train a surrogate from black-box queries and report fidelity at a given query budget. Low query budget + high agreement = high risk.
import numpy as np
from art.estimators.classification import SklearnClassifier
from art.attacks.extraction import KnockoffNets
from sklearn.ensemble import RandomForestClassifier
# victim is your already-trained model wrapped for ART
victim = SklearnClassifier(model=trained_model) # your production model
thief_model = RandomForestClassifier(n_estimators=100)
thief = SklearnClassifier(model=thief_model)
attack = KnockoffNets(classifier=victim, batch_size_fit=64,
batch_size_query=64, nb_epochs=10, nb_stolen=2000)
stolen = attack.extract(x=x_pool, thief_classifier=thief) # 2000-query budget
agreement = np.mean(stolen.predict(x_test).argmax(1) == victim.predict(x_test).argmax(1))
print(f"Surrogate fidelity (agreement with victim): {agreement:.2%} at 2000 queries")
Run ART's black-box membership-inference attack. An accuracy meaningfully above 50% indicates the model leaks membership (AML.T0024.000).
from art.attacks.inference.membership_inference import MembershipInferenceBlackBox
mia = MembershipInferenceBlackBox(victim, attack_model_type="rf")
# fit the attack on a labeled split of known members / non-members
mia.fit(x_train[:500], y_train[:500], x_test[:500], y_test[:500])
member_pred = mia.infer(x_train[500:1000], y_train[500:1000])
nonmember_pred = mia.infer(x_test[500:1000], y_test[500:1000])
acc = (member_pred.mean() + (1 - nonmember_pred.mean())) / 2
print(f"Membership-inference accuracy: {acc:.2%} (0.50 = no leakage)")
Reduce the information returned and the query economics. Re-run steps 3 and 4 after each control to confirm extractability drops.
# (a) Label-only responses: never return full probability vectors to untrusted callers.
def respond(probs, trusted):
return int(probs.argmax()) if not trusted else probs.tolist()
# (b) Confidence rounding / output perturbation (raises queries needed for inversion):
def perturb(probs, decimals=2, noise=0.01):
p = np.round(probs, decimals) + np.random.normal(0, noise, probs.shape)
p = np.clip(p, 0, None)
return p / p.sum()
Defense in depth combines these with strict per-principal rate limiting, anomaly alerting from step 2, ART's ReverseSigmoid / prediction-poisoning postprocessor, and watermarking so an extracted surrogate remains attributable.
Wire step-2 findings into your SIEM. On a confirmed extraction pattern: throttle or revoke the API key, switch the principal to label-only responses, preserve the audit log as evidence, and assess membership-inference exposure for any sensitive training data.
| Resource | Link |
|---|---|
| MITRE ATLAS AML.T0024 — Exfiltration via AI Inference API | https://atlas.mitre.org/techniques/AML.T0024 |
| Adversarial Robustness Toolbox (ART) | https://github.com/Trusted-AI/adversarial-robustness-toolbox |
| ART extraction attacks (CopycatCNN, KnockoffNets) | https://adversarial-robustness-toolbox.readthedocs.io/ |
| MITRE ATLAS Matrix | https://atlas.mitre.org/matrices/ATLAS |
| NIST AI RMF (MEASURE function) | https://www.nist.gov/itl/ai-risk-management-framework |
| Signal | Normal use | Extraction behavior |
|---|---|---|
| Query volume per principal | Bounded, bursty | Very high, sustained |
| Unique-input ratio | Repeats common inputs | Near-1.0 (rarely repeats) |
| Confidence-vector requests | Mostly top label | Demands full probs/logits |
| Input distribution | In-distribution | Near-boundary / synthetic / grid |
| Inter-query timing | Human-paced | Automated, regular |
npx claudepluginhub mukul975/anthropic-cybersecurity-skills --plugin cybersecurity-skillsDetects model stealing, model inversion, and membership inference via inference-API abuse using query monitoring, output perturbation, and red-teaming.
Audits AI/ML models for privacy leakage via training data extraction, membership inference attacks, model inversion, and attribute inference using ML Privacy Meter, ART, and Foolbox.
Detects inference endpoints missing authentication or rate limiting, enabling model theft via systematic queries. Use when building or auditing LLM-serving infrastructure.