Skill

model-dos

Flags Model DoS vulnerabilities in LLM API handlers like unbounded prompts, missing max_tokens, unbounded context, and no rate limiting. Suggests fixes and verification checklist.

security

api-development

npx claudepluginhub thejefflarson/soundcheck --plugin soundcheck

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Protects against resource exhaustion caused by unbounded prompts, missing token caps,

SKILL.md

Similar Skills

model-theft

Prevents model theft in LLM inference endpoints via checks for authentication requirements, per-user rate limits, stripped logprobs/embeddings, and extraction pattern monitoring.

soundcheck

llm-integration

Integrates local LLMs using llama.cpp and Ollama with secure model loading, inference optimization, prompt handling, and defenses against prompt injection, model theft, and DoS attacks. Ideal for privacy-focused AI inference.

3 files

martinholovsky-claude-skills-generator

langchain-security-basics

1.9k

Applies LangChain security best practices: secrets management, prompt injection defense, safe tool execution, and LLM output validation for production apps.

3 tools

langchain-pack

Stats

Stars13

Forks0

Last CommitApr 18, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Model Denial of Service Security Check (OWASP LLM04:2025)

What this checks

Protects against resource exhaustion caused by unbounded prompts, missing token caps, or absent rate limiting. Attackers can submit enormous or recursive inputs that inflate inference costs, saturate GPU/CPU, and deny service to legitimate users.

Vulnerable patterns

LLM API calls with no max_tokens parameter — model generates until its internal limit
No input length validation before sending to the inference endpoint
Multi-turn chat that accumulates context indefinitely across turns
No per-user or per-IP rate limiting on the prompt endpoint

Fix immediately

Flag the vulnerable code and explain the risk. Then suggest a fix that establishes these properties:

Every LLM call sets an explicit output cap — max_tokens, max_output_tokens, or the provider equivalent. Leaving it at the provider default lets a single request run for minutes and rack up dollars in tokens.
Prompt input is length-capped at the handler boundary before it reaches the inference client. Measured in chars, bytes, or tokens — the exact unit doesn't matter as long as the cap runs before the upstream call.
Conversation context is bounded. Either the handler is stateless single-turn, or accumulated history is trimmed to a fixed turn or token budget before every call. Unbounded history is an attacker's favorite amplifier.
Per-identifier throttling (per user, per API key, per IP) runs on every LLM endpoint. In-process token bucket, framework middleware, or reverse-proxy rule — anything that survives alias/batch tricks and prevents one caller from pinning the endpoint.
Every inference call has a deadline. SDK timeout, HTTP client timeout, request-context cancellation — a hung upstream must not be able to indefinitely occupy a worker.

Anchor — shape, not implementation:

require(len(user_text) <= MAX_CHARS)
require(rate_limiter.allow(user_id))
history = trim(history, MAX_TURNS)
resp = llm.call(history + [user_text], max_tokens=512, timeout=30)

Verification

Confirm the following properties hold (language-agnostic):

Every LLM API call sets an explicit output cap on generated tokens — never left to the provider default
Caller-supplied prompt text is length-capped (chars, bytes, or tokens) and rejected at the handler boundary before reaching the inference client
Conversation context fed to the model is bounded: either the handler is single-turn and stores no history at all, or any accumulated history is trimmed to a fixed turn/token budget before the call
Every LLM endpoint enforces a per-identifier throttle (per user, per API key, or per IP) through any mechanism — in-process bucket, framework middleware, reverse-proxy rule — not just a global concurrency cap
Every inference call runs under an explicit deadline expressed through any available mechanism — SDK timeout parameter, HTTP client read/write timeout, request context or cancellation deadline, or framework-level request timeout — so a hung upstream cannot pin a worker indefinitely

References

CWE-400 (Uncontrolled Resource Consumption)
CWE-770 (Allocation of Resources Without Limits or Throttling)
OWASP LLM04:2025 Model Denial of Service