From posthog
Investigates PostHog LLM evaluations (Hog code-based and LLM-judge): lists/inspects/manages configs, runs/tests on generations, queries pass/fail results, summarizes patterns for debugging failures and common modes.
npx claudepluginhub anthropics/claude-plugins-official --plugin posthogThis skill uses the workspace's default tool permissions.
PostHog evaluations score `$ai_generation` events. Each evaluation is one of two types,
Manages LLM-as-judge evaluation workflows on Arize: create/update evaluators, run evals on spans/experiments, manage tasks, trigger-runs, column mapping, continuous monitoring. For hallucination, faithfulness, correctness checks.
Creates and runs LLM-as-judge evaluators on Arize to score spans, experiments, and tasks for hallucination, faithfulness, correctness, and relevance.
Sets up LLM-judge evaluation to extract canonical use cases for PostHog AI features from traces and streams results to Slack as live feed. Use to analyze production usage patterns at scale.
Share bugs, ideas, or general feedback.
PostHog evaluations score $ai_generation events. Each evaluation is one of two types,
both first-class:
hog — deterministic Hog code that returns true/false (and optionally N/A).
Best for objective rule-based checks: format validation (JSON parses, schema matches),
length limits, keyword presence/absence, regex patterns, structural assertions, latency
thresholds, cost guards. Cheap, fast, reproducible — no LLM call per run. Prefer this
when the criterion can be expressed as code.llm_judge — an LLM scores generations against a prompt you write. Best for
subjective or fuzzy checks: tone, helpfulness, hallucination detection, off-topic
drift, instruction-following. Costs an LLM call per run and requires AI data
processing approval at the org level.Results from both types land in ClickHouse as $ai_evaluation events with the same
schema, so the read/query/summary workflows are identical regardless of evaluator type —
the only thing that changes is whether $ai_evaluation_reasoning was written by Hog
code or by an LLM.
This skill covers the full lifecycle: list/inspect/manage evaluation configs (Hog or LLM judge), run them on specific generations, query individual results, and get an AI-generated summary of pass/fail/N/A patterns across many runs.
| Tool | Purpose |
|---|---|
posthog:llma-evaluation-list | List/search evaluation configs (filter by name, enabled flag) |
posthog:llma-evaluation-get | Get a single evaluation config by UUID |
posthog:llma-evaluation-create | Create a new llm_judge or hog evaluation |
posthog:llma-evaluation-update | Update an existing evaluation (name, prompt, enabled, …) |
posthog:llma-evaluation-delete | Soft-delete an evaluation |
posthog:llma-evaluation-run | Run an evaluation against a specific $ai_generation event |
posthog:llma-evaluation-test-hog | Dry-run Hog source against recent generations (no save) |
posthog:llma-evaluation-summary-create | AI-powered summary of pass/fail/N/A patterns across runs |
posthog:execute-sql | Ad-hoc HogQL over $ai_evaluation events |
posthog:query-llm-trace | Drill into the underlying generation that an evaluation scored |
All llma-evaluation-* tools are defined in products/llm_analytics/mcp/tools.yaml.
Every run of an evaluation emits an $ai_evaluation event. Key properties:
| Property | Meaning |
|---|---|
$ai_evaluation_id | UUID of the evaluation config |
$ai_evaluation_name | Human-readable name |
$ai_target_event_id | UUID of the $ai_generation event being scored |
$ai_trace_id | Parent trace ID (for jumping to the trace UI) |
$ai_evaluation_result | true = pass, false = fail |
$ai_evaluation_reasoning | Free-text explanation (set by the LLM judge or Hog code) |
$ai_evaluation_applicable | false when the evaluator decided the generation is N/A |
When $ai_evaluation_applicable = false, the run counts as N/A regardless of $ai_evaluation_result.
For evaluations that don't support N/A, this property may be null — treat null as "applicable".
Works the same way for llm_judge and hog evaluations — the differences only matter
when you eventually go to fix the evaluator (edit the prompt vs. edit the Hog source).
posthog:llma-evaluation-list
{ "search": "hallucination", "enabled": true }
Look at the returned id, name, evaluation_type, and either:
evaluation_config.prompt for an llm_judgeevaluation_config.source for a hog evaluatorThe Hog source is the ground truth for why a hog evaluator passes or fails — read it before assuming the failure is in the generation.
posthog:llma-evaluation-summary-create
{
"evaluation_id": "<uuid>",
"filter": "fail"
}
Returns:
overall_assessment — natural-language summaryfail_patterns — grouped patterns with title, description, frequency, and example_generation_idspass_patterns and na_patterns — same shape, populated when filter includes themrecommendations — actionable next stepsstatistics — total_analyzed, pass_count, fail_count, na_countThe endpoint analyses the most recent ~250 runs (EVALUATION_SUMMARY_MAX_RUNS).
Results are cached for one hour per (evaluation_id, filter, set_of_generation_ids).
Pass force_refresh: true to recompute.
Compare filters in two calls to spot what's distinctive about failures vs passes:
posthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "pass" }
Then diff the pass_patterns against the fail_patterns from Step 2.
Each pattern surfaces example_generation_ids. Pull the underlying trace for the most
representative example:
posthog:query-llm-trace
{ "traceId": "<trace_id>", "dateRange": {"date_from": "-30d"} }
(If you only have a generation ID, query for it via execute-sql first to find the
parent trace ID — see below.)
The summary is LLM-generated and should be verified. Use execute-sql to count and
spot-check:
posthog:execute-sql
SELECT
properties.$ai_target_event_id AS generation_id,
properties.$ai_trace_id AS trace_id,
properties.$ai_evaluation_reasoning AS reasoning,
timestamp
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '<evaluation_uuid>'
AND properties.$ai_evaluation_result = false
AND (
properties.$ai_evaluation_applicable IS NULL
OR properties.$ai_evaluation_applicable != false
)
AND timestamp >= now() - INTERVAL 7 DAY
ORDER BY timestamp DESC
LIMIT 25
The N/A guard (IS NULL OR != false) is important — it matches the same logic the
backend uses to bucket runs.
Use this when the user pastes a trace/generation URL and asks "what would evaluation X say about this?".
posthog:llma-evaluation-run
{
"evaluationId": "<eval_uuid>",
"target_event_id": "<generation_event_uuid>",
"timestamp": "2026-04-01T19:39:20Z",
"event": "$ai_generation"
}
The timestamp is required for an efficient ClickHouse lookup of the target event.
Pass distinct_id if you have it — it speeds up the lookup further.
Reach for this first when the criterion is rule-based — it's cheaper, faster, and
reproducible. Prototype with llma-evaluation-test-hog (no save):
posthog:llma-evaluation-test-hog
{
"source": "return event.properties.$ai_output_choices[1].content contains 'sorry';",
"sample_count": 5,
"allows_na": false
}
The handler returns the boolean result for each of the most recent N $ai_generation
events. Iterate on the source until it behaves as expected, then promote it via
llma-evaluation-create:
posthog:llma-evaluation-create
{
"name": "Output is valid JSON",
"description": "Fails when the assistant message can't be parsed as JSON",
"evaluation_type": "hog",
"evaluation_config": {
"source": "let raw := event.properties.$ai_output_choices[1].content; try { jsonParseStr(raw); return true; } catch { return false; }"
},
"output_type": "boolean",
"enabled": true
}
Hog evaluators have full access to the event and its properties — common patterns include schema validation, length/token limits, regex matches, and tool-call shape checks. Because they're deterministic, results are reproducible across reruns and trivially diff-able.
Use this when the criterion is fuzzy and a code rule would be brittle (tone, factuality,
helpfulness, on-topic-ness). There's no equivalent of llma-evaluation-test-hog for LLM
judges — the typical loop is to create the evaluator with enabled: false, run it
manually against a handful of representative generations via llma-evaluation-run, inspect
the results, refine the prompt with llma-evaluation-update, and then flip enabled: true
when you're satisfied:
posthog:llma-evaluation-create
{
"name": "Response stays on-topic",
"description": "LLM judge — fails if the assistant changes topic from the user's question",
"evaluation_type": "llm_judge",
"evaluation_config": {
"prompt": "You are evaluating whether the assistant's reply stays on-topic relative to the user's most recent question. Return true if it does, false if the assistant changed the subject. Return N/A if the user did not actually ask a question."
},
"output_type": "boolean",
"output_config": { "allows_na": true },
"model_configuration": {
"provider": "openai",
"model": "gpt-5-mini"
},
"enabled": false
}
Then dry-run against a known-good and a known-bad generation:
posthog:llma-evaluation-run
{
"evaluationId": "<new_eval_uuid>",
"target_event_id": "<generation_uuid>",
"timestamp": "2026-04-01T19:39:20Z"
}
LLM judges require organisation AI data processing approval. Hog evaluators do not.
| Action | Tool |
|---|---|
| Add a Hog evaluator | llma-evaluation-create with evaluation_type: "hog" and evaluation_config.source |
| Add an LLM-judge evaluator | llma-evaluation-create with evaluation_type: "llm_judge", evaluation_config.prompt, and a model_configuration |
| Tweak the source or prompt | llma-evaluation-update (edits evaluation_config.source for Hog, evaluation_config.prompt for LLM judge) |
| Toggle N/A handling | llma-evaluation-update with output_config.allows_na |
| Disable temporarily | llma-evaluation-update with enabled: false |
| Remove | llma-evaluation-delete (soft-delete via PATCH {deleted: true}) |
llm_judge evaluations require AI data processing approval at the org level
(is_ai_data_processing_approved). The same gate applies to
llma-evaluation-summary-create. Hog evaluations do not require this gate
— they run as plain code on the ingestion pipeline.
Reach for Hog by default. Switch to LLM judge only when the criterion can't be expressed as code.
| Use Hog when… | Use LLM judge when… |
|---|---|
| The check is structural (JSON parses, schema matches) | The check is about meaning (on-topic, helpful, factual) |
| You need a deterministic, reproducible result | A small amount of judgement variability is acceptable |
| The criterion is cheap to compute | The criterion requires reading and understanding text |
| You can't get AI data processing approval | You have approval and the criterion is genuinely fuzzy |
| You need to enforce a hard limit (length, cost, etc.) | You need to rate a quality dimension |
| You want sub-millisecond evaluation | A few hundred milliseconds + LLM cost are acceptable |
A common pattern is to layer them: a Hog evaluator gates obvious format/length
violations cheaply, and an LLM-judge evaluator only fires on the generations that pass
the Hog gate (via conditions).
The summarisation tool works the same way regardless of whether the evaluator is hog
or llm_judge — it analyses the resulting $ai_evaluation events, not the evaluator
itself. The fix path differs (edit Hog source vs. edit prompt) but the diagnosis is
identical.
llma-evaluation-list — confirm the evaluation is still enabled and unchanged
(compare evaluation_config.source or evaluation_config.prompt to the version you
expect)
llma-evaluation-summary-create with filter: "fail" — get the dominant
failure patterns and example IDs
SQL count of fails per day to confirm the regression window:
SELECT toDate(timestamp) AS day, count() AS fails
FROM events
WHERE event = '$ai_evaluation'
AND properties.$ai_evaluation_id = '<uuid>'
AND properties.$ai_evaluation_result = false
AND timestamp >= now() - INTERVAL 30 DAY
GROUP BY day
ORDER BY day
Drill into a representative trace per pattern via query-llm-trace
filter: "pass", one with filter: "fail"pass_patterns and fail_patterns describe similar content:
llm_judge: the prompt or rubric is probably ambiguous — reword
evaluation_config.prompt and use llma-evaluation-updatehog evaluator: the rule is probably under- or over-matching — read the
source via llma-evaluation-get, narrow the predicate, and retest with
llma-evaluation-test-hog before pushing the fix via llma-evaluation-updateHog evaluators are reproducible — if the source hasn't changed, identical inputs should yield identical outputs. When fail rates jump for a Hog evaluator:
llma-evaluation-get — note the current source and updated_atllma-evaluation-test-hog with a
modified conditions filter that targets themposthog:llma-evaluation-summary-create
{ "evaluation_id": "<uuid>", "filter": "na" }
Inspect na_patterns to see whether the N/A logic is doing the right thing. If a
pattern in na_patterns looks like something that should have been scored:
llm_judge: the applicability instruction in the prompt is too broad — narrow
ithog evaluator with output_config.allows_na: true: the source is returning
null (or whatever the N/A signal is) too eagerly — tighten the preconditionllma-evaluation-run with the trace's generation ID and timestamp. Useful for spot-checking
or wiring evaluations into a larger agent loop.
https://app.posthog.com/llm-analytics/evaluationshttps://app.posthog.com/llm-analytics/evaluations/<evaluation_id>exploring-llm-traces skill's URL conventionsAlways surface the relevant link so the user can verify in the UI.
(evaluation_id, filter) are cheap; use
force_refresh: true only when you genuinely need fresh analysisgeneration_ids: [...] to scope a summary to a specific cohort of runs (max 250)statistics block in the summary response is computed from raw data, not the LLM
— trust those counts even if a pattern's frequency field is qualitativellma-evaluation-list (e.g. by author or model
configuration), fall back to execute-sql against the evaluations Postgres table or
the $ai_evaluation ClickHouse eventsllma-evaluation-* tools use evaluation:read for read tools and evaluation:write for
mutating tools; llma-evaluation-summary-create uses llm_analytics:writellma-evaluation-test-hog
with the suspect source against the failing generations is the fastest way to bisect
whether the change is in the evaluator or in the producer of the generationsmodel_configuration