Enables AI agents to use Braintrust for LLM evaluation, logging, and observability. Includes scripts for querying logs with SQL, running evals, and logging data.
/plugin marketplace add braintrustdata/braintrust-claude-plugin/plugin install braintrust@braintrust-claude-pluginThis skill inherits all available tools. When active, it can use any tool Claude has access to.
scripts/log_data.pyscripts/query_logs.pyscripts/run_eval.pyBraintrust is a platform for evaluating, logging, and monitoring LLM applications.
Use the query_logs.py script to run SQL queries against Braintrust logs.
Always share the SQL query you used when reporting results, so the user understands what was executed.
Script location: scripts/query_logs.py (relative to this file)
Run from the user's project directory (where .env with BRAINTRUST_API_KEY exists):
uv run /path/to/scripts/query_logs.py --project "Project Name" --query "SQL_QUERY"
Count logs from last 24 hours:
SELECT count(*) as count FROM logs WHERE created > now() - interval 1 day
Get recent logs:
SELECT input, output, created FROM logs ORDER BY created DESC LIMIT 10
Filter by metadata:
SELECT input, output FROM logs WHERE metadata.user_id = 'user123' LIMIT 20
Filter by time range:
SELECT * FROM logs WHERE created > now() - interval 7 day LIMIT 50
Aggregate by field:
SELECT metadata.model, count(*) as count FROM logs GROUP BY metadata.model
Group by hour:
SELECT hour(created) as hr, count(*) as count FROM logs GROUP BY hour(created)
hour(), day(), month(), year() instead of date_trunc()
hour(created)date_trunc('hour', created)interval 1 day, interval 7 day, interval 1 hour (no quotes, singular unit)metadata.user_id, scores.Factuality, metrics.durationFROM logs (the script handles project scoping)Operators:
=, !=, >, <, >=, <=IS NULL, IS NOT NULLLIKE 'pattern%'AND, OR, NOTAggregations:
count(*), count(field)avg(field), sum(field)min(field), max(field)Time filters:
created > now() - interval 1 daycreated > now() - interval 7 daycreated > now() - interval 1 hourUse scripts/log_data.py to log data to a project:
uv run /path/to/scripts/log_data.py --project "Project Name" --input "query" --output "response"
With metadata:
--input "query" --output "response" --metadata '{"user_id": "123"}'
Batch from JSON:
--data '[{"input": "a", "output": "b"}, {"input": "c", "output": "d"}]'
Use scripts/run_eval.py to run evaluations:
uv run /path/to/scripts/run_eval.py --project "Project Name" --data '[{"input": "test", "expected": "test"}]'
From file:
--data-file test_cases.json --scorer factuality
Create a .env file in your project directory:
BRAINTRUST_API_KEY=your-api-key-here
For custom evaluation logic, use the SDK directly.
IMPORTANT: First argument to Eval() is the project name (positional).
import braintrust
from autoevals import Factuality
braintrust.Eval(
"My Project", # Project name (required, positional)
data=lambda: [{"input": "What is 2+2?", "expected": "4"}],
task=lambda input: my_llm_call(input),
scores=[Factuality],
)
Common mistakes:
Eval(project_name="My Project", ...) - Wrong!Eval(name="My Project", ...) - Wrong!Eval("My Project", data=..., task=..., scores=...) - Correct!import braintrust
logger = braintrust.init_logger(project="My Project")
logger.log(input="query", output="response", metadata={"user_id": "123"})
logger.flush() # Always flush!
logger.flush() after logging.env file with BRAINTRUST_API_KEY=your-key