From token-budget
Plan, implement, and monitor token budgets for LLM-based applications. Covers three areas — context budgets (fitting content into context windows), cost budgets (tracking spend per user/service/month), and quota/rate budgets (rate limiting LLM calls). Use this skill whenever the user mentions token budgets, token limits, LLM costs, context window management, token tracking, API budgets, rate limiting for LLMs, token usage monitoring, or wants to control how much context is sent to an LLM. Also trigger when the user is building a ChatModelListener, token counter, or usage tracker for LangChain4j/Quarkus.
npx claudepluginhub mgoericke/javamark-claude-plugins --plugin token-budgetThis skill uses the workspace's default tool permissions.
Help plan, implement, and monitor token budgets in LLM-based applications. Token budgets come in three flavors — often used in combination:
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Help plan, implement, and monitor token budgets in LLM-based applications. Token budgets come in three flavors — often used in combination:
Each flavor addresses a different concern but they share infrastructure (token counting, tracking, configuration). This skill guides the user through an interview to understand their needs, then generates the appropriate Quarkus/LangChain4j code.
Before generating code, ask these questions to understand the scope. Skip questions that are already answered from context.
Which use case? Context / Cost / Quota / Combination?
Which LLM provider? Ollama / LM Studio / OpenAI / Anthropic / Azure OpenAI / other?
How many services/agents make LLM calls?
Is there a monthly monetary budget?
Which framework? Quarkus + LangChain4j (primary) / Spring + LangChain4j / other?
Should token usage be persisted?
After the interview, determine which budget types to generate and read the corresponding reference file(s) from references/ for implementation details.
All three budget types share a common entry point — the ChatModelListener from LangChain4j. This CDI bean intercepts every LLM call and provides access to input/output token counts.
@ApplicationScoped
public class TokenTrackingListener implements ChatModelListener {
@Override
public void onResponse(ChatModelResponseContext context) {
var usage = context.response().tokenUsage();
int inputTokens = usage.inputTokenCount();
int outputTokens = usage.outputTokenCount();
// Route to the appropriate budget handler(s)
}
}
This listener is the foundation. Each budget type adds its own logic on top.
Before sending a request, you often need to estimate how many tokens the content will use. The skill generates a TokenEstimator utility:
Based on the interview answers, read the appropriate reference file(s) and generate code:
| Use Case | Reference File | Key Artifacts |
|---|---|---|
| Context Budget | references/context-budget.md | TokenBudgetService, TokenEstimator, prioritization config |
| Cost Budget | references/cost-budget.md | TokenUsageTracker, REST endpoints, alert events, Flyway migration |
| Quota Budget | references/quota-budget.md | TokenRateLimiter, priority queue, Prometheus metrics |
AtomicLong / ConcurrentHashMap for in-memory trackingChatModelListener abstracts the providerquarkus-micrometer-registry-prometheusapplication.properties with sensible defaultsGenerated code follows these patterns:
{project.package}.token (or {module}.control if using BCE)@ApplicationScoped for all services@ConfigProperty with prefix token-budgetLog.infof() for budget consumption, Log.warnf() for threshold alerts@QuarkusTest with REST Assured for endpoint testing