Run a Benchmark
You shipped a skill-pack change — a new kata-spec rule,
a tweak to a fit-pathway profile, an updated tool
allowlist. The next question is the hard one: did agents get better
at writing code? A single agent run is a coin flip, and a passing
eval doesn't generalise. fit-benchmark runs each
coding task N times against a
versioned skill-set manifest, grades each run with
tests the agent never sees, and aggregates pass@k using the OpenAI
HumanEval unbiased estimator.
Prerequisites
- Node.js 18+
ANTHROPIC_API_KEYset in the environment-
@forwardimpact/libeval(shipsfit-eval,fit-trace, andfit-benchmark). Install globally withnpm install -g @forwardimpact/libeval, or invoke ephemerally in CI withnpx --yes @forwardimpact/libeval fit-benchmark ...
Author a Task Family
A task family is a directory of related coding tasks plus the skill-set under test:
my-coding-family/
.env # family env vars (committed defaults)
.env.local # family secrets (gitignored)
apm.yml # optional — skill-pack dependencies
apm.lock.yaml # skill-set manifest (hashed)
.claude/ # pre-staged skills + agents
skills/...
agents/judge.md
tasks/todo-api/
.env # task env vars — loaded + rendered
.env.local # task secrets — loaded + rendered (gitignored)
agent.task.md # what the agent should build (required)
judge.task.md # optional — judge prompt (see § judge.task.md)
supervisor.task.md # optional — supervisor context
hooks/ # harness-only — never copied to agent CWD
preflight.sh # optional — smoke probe
score.sh # optional — hidden grader; fd 3 = $RESULTS_FD
specs/ # copied into the agent CWD
workdir/ # copied into the agent CWD
Task IDs are directory names under tasks/ (e.g.
todo-api). The directory splits into what the agent
sees (workdir/, specs/,
.claude/) and what the harness keeps hidden
(hooks/). The agent never receives the scoring script —
that is the structural guarantee it cannot peek at the tests.
What the agent sees
agent.task.md
Plain markdown — the prompt the agent receives.
Build a TODO API matching the spec under `specs/`. Listen on the port
exposed via the environment variable `PORT`. Respond to `GET /todos`
with a JSON array of TODO objects.
workdir/
Whatever scaffolding the agent should start with: a
package.json, a README, sample data — everything here
is copied into the per-task CWD.
What the harness controls — hooks/
The hooks/ directory holds lifecycle scripts the
harness runs at specific phases. Both scripts receive
$WORKDIR (the per-task agent CWD) and
$PORT (a pre-allocated free TCP port) as environment
variables. Neither is ever copied to the agent's working
directory.
hooks/preflight.sh
Optional. Runs before the agent starts. Exit 0 means
"scaffold is healthy, hand off to the agent." A non-zero
exit short-circuits the run and produces a
preflightError result record (cost zero, no agent
invoked). When the script is absent, the harness proceeds without a
pre-flight probe.
A preflight that starts a background service for the scoring probe to test against:
#!/bin/sh
node "$WORKDIR/app.js" >/dev/null 2>&1 &
sleep 0.2
exit 0
The harness spawns the preflight in its own process group and tears down the entire group (SIGTERM, grace period, SIGKILL) after scoring completes — background processes do not leak across runs.
hooks/score.sh
Runs after the agent finishes. In addition to
$WORKDIR and $PORT, it receives
$RESULTS_FD=3 — a file descriptor for structured
per-test rows.
The exit code is authoritative: 0 is
pass, anything else is fail. Rows written to fd 3 are stored on the
result record's scoring.details for diagnostics;
they cannot override the verdict.
Three grading surfaces are in scope:
# Running-service probe
RESP="$(curl -sf --max-time 2 "http://127.0.0.1:$PORT/todos")"
test "$RESP" = '[]' && exit 0 || exit 1
# Repository state
sha256sum "$WORKDIR/dist/build.tar.gz" \
| grep -q '^expected-sha256-prefix' && exit 0 || exit 1
# Process exit
( cd "$WORKDIR" && bun test ) && exit 0 || exit 1
Writing to fd 3 from non-bash interpreters
Bash makes fd-3 writing trivial via
>&"$RESULTS_FD". From other languages
you open fd 3 explicitly:
import json, os
fd = int(os.environ["RESULTS_FD"])
with os.fdopen(fd, "w") as f:
f.write(json.dumps({"test": "t1", "pass": True}) + "\n")
const fs = require("node:fs");
const fd = Number(process.env.RESULTS_FD);
fs.writeSync(fd, JSON.stringify({ test: "t1", pass: true }) + "\n");
What the judge uses — judge.task.md
The post-hoc judge's prompt. The harness substitutes these template variables before sending the prompt to the judge:
| Variable | Description |
|---|---|
{{AGENT_INSTRUCTIONS}} |
Contents of agent.task.md |
{{AGENT_PROFILE}} |
Agent profile body (empty string if none) |
{{AGENT_TRACE_PATH}} |
Absolute path to agent.ndjson |
{{SCORING_RESULT}} |
JSON scoring object (verdict, details, exitCode) |
{{SKILL_SET_HASH}} |
SHA-256 fingerprint from apm.lock.yaml |
{{TASK_ID}} |
Task name (directory under tasks/) |
{{TASK_DIR}} |
Agent working directory path |
{{SCORING}} is accepted as a legacy alias for
{{SCORING_RESULT}}.
Scoring outcome:
\`\`\`json
{{SCORING_RESULT}}
\`\`\`
The agent's full trace is at `{{AGENT_TRACE_PATH}}` — read it before
deciding. The agent was given task `{{TASK_ID}}` with these instructions:
{{AGENT_INSTRUCTIONS}}
Call `Conclude` with `verdict='success'` when both:
1. `scoring.verdict === "pass"`, and
2. the agent did not violate the test contract (e.g. by editing the
test file).
The judge is a separate session — not the live supervisor. Mixing the "help the agent finish" incentive with the "grade fairly" incentive is what the design avoids.
What identifies the skill set — .claude/ and
apm.lock.yaml
The pre-staged .claude/ tree carries the skills and
agent profiles the agent will see. apm.lock.yaml is the
manifest under test — the harness hashes its bytes
(LF-normalised) into skillSetHash on every result
record. A one-byte change to the lockfile produces a different hash,
which is how comparing "before-skill-change" vs
"after-skill-change" runs becomes apples-to-apples.
Caveat.
skillSetHashcovers the lockfile bytes only. If you edit.claude/directly without regenerating the lockfile, the hash won't reflect the change. Always re-run your packing tool after editing.claude/.
Environment Variables
The harness auto-discovers .env and
.env.local files in the family root and each task
directory. Every discovered file is loaded into
process.env and rendered into the agent's working
directory before preflight.sh runs.
process.env always wins — existing values are never
overwritten.
-
Locally: put credentials in
.env.local(gitignored). - In CI: set secrets as repository env vars — no files needed.
Example
A task that calls an LLM proxy:
# tasks/my-rag-task/.env.local (gitignored)
LLMHUB_NONPROD_API_KEY=your-key-here
LLMHUB_PROD_API_KEY=your-key-here
The harness renders this into the agent's CWD as
.env.local with values resolved from
process.env (CI secrets override file defaults). The
task's preflight.sh can validate the file exists;
the agent's application reads credentials from it.
All discovered var names are added to the trace redaction allowlist.
Run It
npx fit-benchmark run \
--family=./my-coding-family \
--output=./runs/2026-05-11 \
--runs=5 \
--agent-profile=coder \
--judge-profile=judge \
--max-turns=80
Output:
-
./runs/2026-05-11/results.jsonl— append-only, one record per(task, runIndex). Survives partial failures. -
./runs/2026-05-11/runs/<task-name>/<runIndex>/— per-run artifacts: the agent CWD, the agent trace, the judge trace, the scoring stderr log. -
./runs/2026-05-11/.apm-staging/.claude/— staged skills/agents.
Each result record carries skillSetHash,
familyRevision, the combined verdict, scoring details,
judge verdict + summary, cost, turn count, and the absolute paths to
both NDJSON traces. The record's schema is validated at write
time, so a malformed write is caught before the report stage trips
over it.
Score One Task at a Time
For ad-hoc grading without an agent run:
npx fit-benchmark score \
--family=./my-coding-family \
--task=todo-api \
--workdir=./runs/2026-05-11/runs/todo-api/0 \
--output=score.jsonl
Useful when iterating on a hooks/score.sh script:
re-grade an existing post-run workdir without burning agent cost.
Aggregate Into pass@k
npx fit-benchmark report \
--input=./runs/2026-05-11 \
--k=1,3,5 \
--format=text
With --format=text, the report renders a full markdown
document:
- Summary — overall pass rate, model, skill-set hash, cost, median duration, median turns.
-
Pass@k table — one row per task with the unbiased
HumanEval estimator:
pass@k = 1 - C(n-c, k) / C(n, k). - Task details — per-task sections with a runs table, scoring check results, judge commentary (blockquoted), and any agent or preflight errors.
With --format=json (default), the output is the
aggregated pass@k data only — suitable for machine consumption and
before/after diffs.
A k > n value emits a structured error row rather
than a misleading number.
Compare Before and After
The reproducibility claim is the heart of the tool. Run the family twice — once with the old skill manifest, once with the new — and compare:
# Before
npx fit-benchmark run --family=./my-coding-family --output=./runs/before --runs=10
npx fit-benchmark report --input=./runs/before --format=json > before.json
# After (manifest changed)
npx fit-benchmark run --family=./my-coding-family --output=./runs/after --runs=10
npx fit-benchmark report --input=./runs/after --format=json > after.json
Each record carries skillSetHash, so any
cross-comparison script can verify the two reports came from
materially different skill sets before declaring an improvement.
What's Next
Automate with GitHub Actions
Run fit-benchmark in CI with the forwardimpact/fit-benchmark composite action — step summaries, artifact upload, and PR-triggered benchmarks.
Run an Eval
Know whether agent changes improved outcomes — an agent-as-judge eval wired into CI with traceable results.
Analyze Traces
See exactly what an agent did and why — download traces, query turns, filter by tool or error, and measure token cost.