From backtester-agent
Executes Sentinel backtests from GitHub issue specifications. This skill is a pure executor — it reads a backtest issue, resolves the symbol universe, runs the test (locally or on EC2 spot), posts results back to the issue, and logs them. It does NOT interpret results or make recommendations. Use when: executing a backtest issue, "run the backtest", "execute issue #N", "spin up a spot instance", or any mention of running a backtest. For creating backtest issues or analyzing results, use the sentinel-engineer skill instead.
How this skill is triggered — by the user, by Claude, or both
Slash command
/backtester-agent:backtest-runnerThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are a backtest executor. You run backtests as specified in GitHub issues, capture
You are a backtest executor. You run backtests as specified in GitHub issues, capture results, and post them back. You do NOT analyze, interpret, or recommend — that's the engineering session's job.
Every backtest run starts from a GitHub issue. The issue body specifies the configuration:
gh issue view <number>
The issue should contain:
ab-comparison.mjs or run-backtest.mjs)main if not specified)If any of these are missing, comment on the issue asking for clarification. Don't guess.
The trading universe is managed in the GitHub wiki page "Trading Universe". Read it to get the current symbol list:
# Clone the wiki
git clone https://github.com/dsieczko/sentinel.wiki.git /tmp/sentinel-wiki
cat /tmp/sentinel-wiki/Trading-Universe.md
If the issue says "all symbols" or "full universe", use the complete list from the wiki. If it specifies a subset, use that subset.
IMPORTANT: Always pull the latest code before building. If the issue specifies a
branch, check out that branch. Otherwise default to main.
IMPORTANT: Always use npx pnpm (not bare pnpm).
cd sentinel
git checkout main && git pull origin main
# Or if the issue specifies a branch:
# git checkout feat/some-branch && git pull origin feat/some-branch
npx pnpm install
npx pnpm run build
npx pnpm --filter @sentinel/core exec -- npx vitest run
npx pnpm --filter @sentinel/backtest exec -- npx vitest run
If tests fail, comment on the issue with the failure and stop. Don't run backtests against broken code.
Local — for runs expected to complete in under 5 minutes:
node packages/backtest/scripts/run-backtest.mjs <args>
EC2 Spot — for larger runs (full universe, 9+ months data). See "EC2 Spot Execution" section below.
Both scripts support the same core flags:
node packages/backtest/scripts/ab-comparison.mjs [batch1|batch2|batch3|all] \
[--overrides='{"key": "value"}'] \
[--risk-per-trade=1000] \
[--no-gates]
node packages/backtest/scripts/run-backtest.mjs [batch1|batch2|all] \
[--overrides='{"key": "value"}'] \
[--risk-per-trade=1000] \
[--no-gates] \
[--reclaim-only]
Flags:
--overrides='JSON' — Dot-notation config overrides. Example:
--overrides='{"systemA.qualityGates.regimeGate": "no-bearish", "systemA.trigger.minScore": 55}'
--risk-per-trade=1000 — Risk per trade in dollars for drawdown calculation (default: $1,000). Always pass this flag.--no-gates — Skip quality gates entirely--reclaim-only — (run-backtest.mjs only) Filter to reclaim entries onlyOverride key examples (dot-notation into BacktestConfigSnapshot):
systemA.qualityGates.regimeGate # 'no-bearish' | 'bullish-only' | null
systemA.trigger.minScore # Minimum score for trigger (default 55)
systemA.trigger.minLevelStrength # Min level strength for trigger (default 40)
systemA.trigger.blockedFactors # Array of factor IDs that block triggers
shared.entryFilters.minMarketCapB # Minimum market cap in billions (default 2)
shared.exitStrategy.scaleOutFraction # % to scale out at T1 (default 0.25)
shared.exitStrategy.useTargetBasedExits # Use target-based vs ATR exits (default true)
shared.exitStrategy.lockInR # Lock in at Nx risk (default 1.0)
shared.exitStrategy.trailDistanceR # Trail distance from peak in R (default 1.0)
shared.exitStrategy.maxGivebackR # Max giveback from peak in R (default 1.0)
shared.timeExits.weekendFlatten # Flatten before weekends (default false)
shared.outcomeTracking.primaryStopType # 'tight' | 'standard' | 'level' | 'lod'
shared.scoring.factorWeights.* # Override individual factor weights
Both scripts automatically load QQQ, SPY, and DIA bars from S3 when available. No flag needed. If index bars aren't found, the script falls back to the static regime value in the config (default: 'neutral').
The scripts automatically load data/etf-mapping.json for market cap tier stats.
If the mapping lacks market cap values, run issue #106 first:
node packages/backtest/scripts/populate-etf-mapping.mjs --force
Capture full stdout:
node packages/backtest/scripts/<script> <args> 2>&1 | tee /tmp/backtest-$(date +%Y%m%d-%H%M%S).txt
The run-backtest.mjs script produces a -trades.csv file alongside each JSON result.
These CSVs contain individual trade data and MUST be uploaded to S3 and linked in the
issue comment so they can be retrieved later without re-running the backtest.
for f in results/*-trades.csv; do
aws s3 cp "$f" "s3://sentinel-backtest-data/results/$(basename $f)"
done
In the issue comment, include an S3 link for each run's CSV:
### Trade-Level CSVs
- Run 1: `s3://sentinel-backtest-data/results/<issue>-r1-trades.csv`
- Run 2: `s3://sentinel-backtest-data/results/<issue>-r2-trades.csv`
Do NOT skip this step. Trade-level data is required for per-symbol analysis.
CRITICAL: Always post ALL dimension breakdowns. Do not summarize to "Best/Worst" — post the full tables.
Comment the structured results on the GitHub issue. Include ALL sections:
## Backtest Results — [date]
### Configuration
- Universe: [symbols count]
- Script: [full command]
- Environment: [local/EC2 spot c6i.xlarge]
- Risk per trade: $1,000
- Data range: [start - end]
- Runtime: [time]
- Branch: [main or feature branch]
### Head-to-Head Comparison
[Full table: Entries, WR, Avg R, PF, Max DD for each run]
### By Stop Type / By Month / By Hour of Day (ET) / By Day of Week
### By Entry Type / By Market Cap Tier / By Trade Type
### By Level Strength Range / By Exit Reason / By Market Regime
### Key Observations
### Raw Data
Do NOT omit any dimension. If a dimension has no data, note "N/A".
Append a summary to docs/BACKTEST-LOG.md, commit, push, open PR:
git checkout -b docs/backtest-log-rN
git add docs/BACKTEST-LOG.md
git commit -m "docs: update BACKTEST-LOG.md with Round N results (#issue)"
git push -u origin docs/backtest-log-rN
gh pr create --title "docs: update BACKTEST-LOG.md with Round N results" --body "..."
git checkout main
gh issue close <number> --comment "Backtest complete. Results posted above."
For runs that would take too long locally. Full universe (~101 symbols, 9 months) takes ~2.5 hours.
sentinel-spot-rolesg-00278cf2cbc87596e (the sentinel-backtest SG)sentinel-keysentinel-github-credentials (key: GITHUB_PAT). The repo is PRIVATE — you MUST use this PAT for git clone.CRITICAL: Bar data lives in S3, NOT fetched from Alpaca during backtests.
s3://sentinel-backtest-data/
├── backtest/ # Bar data (1min, daily, weekly JSON files)
│ ├── AAPL_1min.json
│ ├── AAPL_daily.json
│ ├── AAPL_weekly.json
│ └── ... (all symbols)
├── results/ # Backtest output CSVs and JSON
└── etf-mapping.json # Stock-to-ETF sector mapping
Use the sync-data script to pull bar data to the EC2 instance BEFORE running:
# Pull bar data from S3 to local disk (run on EC2 after build)
node packages/backtest/scripts/sync-data.mjs --pull --prefix=backtest
# After backtest, push results to S3
node packages/backtest/scripts/sync-data.mjs --push --prefix=results
The user-data script MUST include the sync-data pull step after build and before running the backtest. Without this, the backtest has no bar data.
shutdown -h after 18000s. Full-universe backtest takes 2.5-3h plus pnpm install + tsc build (~5 min) + S3 sync (~1 min) — the older 3-hour timeout was killing runs mid-replay (lessons learned R17/R18 era). 5h gives enough headroom even on slow EC2 startup.Project=sentinel-backtest tag required by IAMIMPORTANT: Do NOT inline user-data in the --user-data flag — heredoc quoting is
fragile, especially on Windows. Always write to a file first, then reference with
--user-data file:///tmp/userdata.sh.
cat > /tmp/userdata-r1.sh << 'USERDATA'
#!/bin/bash
set -euxo pipefail
yum install -y git nodejs20
npm install -g pnpm
# Get GitHub PAT from Secrets Manager (repo is PRIVATE)
PAT=$(aws secretsmanager get-secret-value \
--secret-id sentinel-github-credentials \
--region us-east-1 \
--query 'SecretString' --output text | python3 -c "import sys,json; print(json.load(sys.stdin)['GITHUB_PAT'])")
cd /home/ec2-user
git clone https://${PAT}@github.com/dsieczko/sentinel.git
cd sentinel
# If testing a feature branch, add: git checkout feat/branch-name
npx pnpm install && npx pnpm run build
# Pull bar data from S3 (REQUIRED — bar data is NOT fetched from Alpaca)
node packages/backtest/scripts/sync-data.mjs --pull --prefix=backtest
echo "READY" > /home/ec2-user/backtest-ready
chown ec2-user:ec2-user /home/ec2-user/backtest-ready
# Auto-terminate after 5 hours safety net (full-universe backtest is 2.5-3h)
(sleep 18000 && shutdown -h now) &
USERDATA
Critical: persistent output paths must be under /home/ec2-user/, NOT /tmp/. /tmp is wiped on EC2 stop/start. If a backtest gets killed by the safety timeout and the instance is restarted to retrieve results, anything in /tmp is gone. The backtest-ready marker, the backtest-done marker, and the actual .txt output file all live under /home/ec2-user/. Lost an entire R17 run to this in 2026-03-20.
INSTANCE_ID=$(aws ec2 run-instances \
--image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
--instance-type c6i.xlarge \
--instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"one-time"}}' \
--iam-instance-profile Name=sentinel-spot-role \
--key-name sentinel-key \
--security-group-ids sg-00278cf2cbc87596e \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=sentinel-backtest},{Key=Project,Value=sentinel-backtest}]' \
--user-data file:///tmp/userdata-r1.sh \
--query 'Instances[0].InstanceId' --output text)
echo "Launched: $INSTANCE_ID"
aws ec2 wait instance-running --instance-ids $INSTANCE_ID
INSTANCE_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID \
--query 'Reservations[0].Instances[0].PublicIpAddress' --output text)
# Wait for setup
for i in $(seq 1 20); do
ssh -o StrictHostKeyChecking=no -i ~/.ssh/sentinel-key.pem ec2-user@$INSTANCE_IP \
"test -f /home/ec2-user/backtest-ready && echo READY || echo WAITING" 2>/dev/null && break
sleep 30
done
# Run backtest — pipe stdout to a file under /home/ec2-user/ (NOT /tmp, which is wiped on stop/start)
ssh -i ~/.ssh/sentinel-key.pem ec2-user@$INSTANCE_IP \
"cd /home/ec2-user/sentinel && node packages/backtest/scripts/<script> <args>" \
| tee /home/ec2-user/backtest-results.txt
# ALWAYS terminate when done — see "Safer terminate pattern" below for the right shape
aws ec2 terminate-instances --instance-ids $INSTANCE_ID
aws ec2 wait instance-terminated --instance-ids $INSTANCE_ID
DEFAULT: Launch one EC2 spot instance per test. Spot is billed per-second — running 6 tests on 6 instances costs the same as 1 instance for 6x longer, but finishes 6x faster.
For a batch of runs, write a userdata file per run, then launch all simultaneously:
declare -A INSTANCES
for RUN in r1 r2 r3 r4 r5 r6; do
# Write per-run userdata (customize branch/overrides as needed)
cat > /tmp/userdata-${RUN}.sh << 'USERDATA'
#!/bin/bash
set -euxo pipefail
yum install -y git nodejs20
npm install -g pnpm
PAT=$(aws secretsmanager get-secret-value --secret-id sentinel-github-credentials --region us-east-1 --query 'SecretString' --output text | python3 -c "import sys,json; print(json.load(sys.stdin)['GITHUB_PAT'])")
cd /home/ec2-user
git clone https://${PAT}@github.com/dsieczko/sentinel.git
cd sentinel
npx pnpm install && npx pnpm run build
echo "READY" > /home/ec2-user/backtest-ready
chown ec2-user:ec2-user /home/ec2-user/backtest-ready
(sleep 18000 && shutdown -h now) &
USERDATA
INSTANCE_ID=$(aws ec2 run-instances \
--image-id resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64 \
--instance-type c6i.xlarge \
--instance-market-options '{"MarketType":"spot","SpotOptions":{"SpotInstanceType":"one-time"}}' \
--iam-instance-profile Name=sentinel-spot-role \
--key-name sentinel-key \
--security-group-ids sg-00278cf2cbc87596e \
--tag-specifications "ResourceType=instance,Tags=[{Key=Name,Value=sentinel-bt-${RUN}},{Key=Project,Value=sentinel-backtest}]" \
--user-data file:///tmp/userdata-${RUN}.sh \
--query 'Instances[0].InstanceId' --output text)
INSTANCES[$RUN]=$INSTANCE_ID
echo "Run $RUN -> $INSTANCE_ID"
done
Then poll, execute, collect results, and terminate each instance.
populate-etf-mapping.mjs --force if data/etf-mapping.json lacks marketCap fieldsmarketRegime: 'auto-detect' scenariosAccount-level vCPU quota in us-east-1 is 64. c6i.xlarge is 4 vCPU each → max 16 running concurrently. With ~3 baseline t3.large instances always on (Tailscale proxy etc.), the practical cap is ~14 concurrent backtest instances. For R17 (12 runs in parallel) this was fine. For larger parallel batches, either request a quota bump or switch to c6i.large (2 vCPU, doubles practical parallelism but halves per-instance CPU and roughly doubles wall-clock time per run).
Pipes around aws ec2 terminate-instances can silently fail. Specifically this pattern looks like it works but doesn't always actually terminate:
aws ec2 describe-instances ... | tr '\t' '\n' | while read ID; do
aws ec2 terminate-instances --instance-ids "$ID"
done
The while loop runs in a subshell, errors don't propagate, and instances stay running for another hour costing money. Lost EC2 time to this.
Use this pattern instead:
# Collect IDs into a bash array, then call terminate explicitly + wait
INSTANCE_IDS=( i-abc i-def i-ghi )
aws ec2 terminate-instances --instance-ids "${INSTANCE_IDS[@]}"
aws ec2 wait instance-terminated --instance-ids "${INSTANCE_IDS[@]}"
The explicit array + explicit wait fails loudly if anything goes wrong. Always pair terminate-instances with wait instance-terminated so you know the cleanup actually happened.
runPortfolioSim(outcomes, config) is a pure post-processing filter on TradeOutcome[] — it does NOT re-run the replay. This means you can run the base backtest once (~2.5h on c6i.xlarge for the full universe) and then iterate through 5-10 different portfolio configs in milliseconds each.
Use this pattern for any portfolio-constrained tests. R19 (run-r19-portfolio-sim.mjs) and R20 (run-r20-regime-exposure.mjs) both use it. Anyone running portfolio configs going forward should use the post-process pattern, not re-run the base replay per config.
The PortfolioSimResult includes: takenTrades, rejectedTrades, per-month breakdown (after commit 9d30bbb), maxConcurrentPositions, avgExposurePct, cash tracking, totalTaxPaid, quarterly tax withdrawal.
For custom scripts you don't want to commit to a feature branch during an active backtest session (because issue/PR overhead would slow you down and you're validating an approach, not proposing it as the final form), you can base64-encode the script and embed it in the EC2 user-data:
SCRIPT_B64=$(base64 packages/backtest/scripts/run-rN-experiment.mjs)
# In the heredoc user-data, after the git clone + build:
echo '${SCRIPT_B64}' | base64 -d > packages/backtest/scripts/run-rN-experiment.mjs
chown ec2-user:ec2-user packages/backtest/scripts/run-rN-experiment.mjs
User-data has a 64KB limit — raw scripts under ~45KB fit comfortably after base64 expansion. Worked for R19 and R20 wrapper scripts.
Caveat: This bypasses code review. Only use when the script is throwaway/exploratory. Anything that should land in the codebase needs the standard issue/PR flow. Tag the output file with a comment line like # Script source: base64-embedded in userdata, not on branch so anyone retrieving the result later knows where the script lived.
These are open-but-actionable substrate issues. Working around them is a normal part of running backtests today. Each has a filed issue — the workarounds here will go away once the issues land.
Trade CSV is missing rMultiple (issue #1500). The *-trades.csv exporter declares the rMultiple column header but always writes empty values. CSV-based analysis can compute win rate (from isWin) but cannot compute profit factor or average R-multiple. Workaround until #1500 lands: for PF/avg R metrics, parse the .txt stdout file instead of the CSV.
Snapshot override wiring gap (issue #1499). When a new tunable parameter is added to BacktestConfigSnapshot, someone must wire the snapshot field back into the config object passed to BacktestRunner.run() in run-backtest.mjs. Default values work via DEFAULT_CONFIG spread, but --overrides='...' CLI overrides are silently ignored if the wiring is missing. Workaround until #1499 lands: when running with overrides on a new config path, smoke-test with a deliberately out-of-range value first and confirm the behavior actually changes. If the run looks identical to default, the wiring is missing — file a fix before burning EC2 time on the actual sweep.
Monthly exposure aggregation returns 0% (issue #1501). computeMonthlyBreakdown() in portfolio-sim.ts always returns avgExposurePct: 0.0 per month even when actual exposure is non-trivial. Per-month P&L, win rate, and trade counts are correct — only the exposure roll-up is broken. Workaround until #1501 lands: treat per-month exposure values as missing data, not as zero exposure. Use the aggregate avgExposurePct from the full-run result instead.
When a bug-fixing PR lands that affects historical backtest output, the previously-logged numbers in docs/backtests/log.md become stale. There is no automated revalidation process. The backtester's responsibility:
## Historical impact section in the PR body listing which docs/backtests/log.md entries are now stale.<!-- DD VALUE STALE pre-PR-#NNNN --> (or equivalent) in the log so future readers know to discount the stale values.The canonical example is PR #562 (DD calculation peak-from-start bug). It invalidated DD numbers for all pre-#562 entries. R17 r4 and R18 r6 were re-run manually post-fix; the rest of the log was not. Any DD value that looks implausibly low (e.g. 0.00R or 1.00R on a config with non-trivial trade volume) on a pre-2026-03-20 entry should be assumed stale.
When posting backtest results or coordinating with other agents in Discord:
.txt in S3, or split manually into multiple messages with a clear "1/3", "2/3" prefix per part. Don't rely on any plugin to chunk for you — multi-part sends drop the tail more than half the time.@mention explicit bot IDs from the agent directory in CLAUDE.md, never role mentions, for action items. Role mentions (<@&roleId>) are fine for announcements but don't reliably push-notify a specific agent — the recipient is ambiguous. Bot IDs (<@1234567890>) are unambiguous and produce a real notification.docs/backtests/log.md — the canonical chronological log of backtest results. Append a new entry after every round. (Not docs/BACKTEST-LOG.md — that path is historical and the file lives under docs/backtests/ now.)s3://sentinel-backtest-data/results/ — full trade-level CSV outputs uploaded after each round.docs/backtests/log.mdnpx claudepluginhub dsieczko/claude-agent-plugins --plugin backtester-agentQuick-reference card listing all ponytail modes (Lite, Full, Ultra), skills, and commands. Useful for discovering or recalling ponytail capabilities.