Run a Benchmark

You shipped a skill-pack change — a new kata-spec rule, a tweak to a fit-pathway profile, an updated tool allowlist. The next question is the hard one: did agents get better at writing code? A single agent run is a coin flip, and a passing eval doesn't generalise. fit-benchmark runs each coding task N times against a versioned skill-set manifest, grades each run with tests the agent never sees, and aggregates pass@k using the OpenAI HumanEval unbiased estimator.

Prerequisites

  • Node.js 18+
  • ANTHROPIC_API_KEY set in the environment
  • @forwardimpact/libeval (ships fit-eval, fit-trace, and fit-benchmark). Install globally with npm install -g @forwardimpact/libeval, or invoke ephemerally in CI with npx --yes @forwardimpact/libeval fit-benchmark ...

Author a Task Family

A task family is a directory of related coding tasks plus the skill-set under test:

my-coding-family/
  .env                                   # family env vars (committed defaults)
  .env.local                             # family secrets (gitignored)
  apm.yml                                # optional — skill-pack dependencies
  apm.lock.yaml                          # skill-set manifest (hashed)
  .claude/                               # pre-staged skills + agents
    skills/...
    agents/judge.md
  tasks/todo-api/
    .env                                 # task env vars — loaded + rendered
    .env.local                           # task secrets — loaded + rendered (gitignored)
    agent.task.md                        # what the agent should build (required)
    judge.task.md                        # optional — judge prompt (see § judge.task.md)
    supervisor.task.md                   # optional — supervisor context
    hooks/                               # harness-only — never copied to agent CWD
      preflight.sh                       # optional — smoke probe
      score.sh                           # optional — hidden grader; fd 3 = $RESULTS_FD
    specs/                               # copied into the agent CWD
    workdir/                             # copied into the agent CWD

Task IDs are directory names under tasks/ (e.g. todo-api). The directory splits into what the agent sees (workdir/, specs/, .claude/) and what the harness keeps hidden (hooks/). The agent never receives the scoring script — that is the structural guarantee it cannot peek at the tests.

What the agent sees

agent.task.md

Plain markdown — the prompt the agent receives.

Build a TODO API matching the spec under `specs/`. Listen on the port
exposed via the environment variable `PORT`. Respond to `GET /todos`
with a JSON array of TODO objects.

workdir/

Whatever scaffolding the agent should start with: a package.json, a README, sample data — everything here is copied into the per-task CWD.

What the harness controls — hooks/

The hooks/ directory holds lifecycle scripts the harness runs at specific phases. Both scripts receive $WORKDIR (the per-task agent CWD) and $PORT (a pre-allocated free TCP port) as environment variables. Neither is ever copied to the agent's working directory.

hooks/preflight.sh

Optional. Runs before the agent starts. Exit 0 means "scaffold is healthy, hand off to the agent." A non-zero exit short-circuits the run and produces a preflightError result record (cost zero, no agent invoked). When the script is absent, the harness proceeds without a pre-flight probe.

A preflight that starts a background service for the scoring probe to test against:

#!/bin/sh
node "$WORKDIR/app.js" >/dev/null 2>&1 &
sleep 0.2
exit 0

The harness spawns the preflight in its own process group and tears down the entire group (SIGTERM, grace period, SIGKILL) after scoring completes — background processes do not leak across runs.

hooks/score.sh

Runs after the agent finishes. In addition to $WORKDIR and $PORT, it receives $RESULTS_FD=3 — a file descriptor for structured per-test rows.

The exit code is authoritative: 0 is pass, anything else is fail. Rows written to fd 3 are stored on the result record's scoring.details for diagnostics; they cannot override the verdict.

Three grading surfaces are in scope:

# Running-service probe
RESP="$(curl -sf --max-time 2 "http://127.0.0.1:$PORT/todos")"
test "$RESP" = '[]' && exit 0 || exit 1
# Repository state
sha256sum "$WORKDIR/dist/build.tar.gz" \
  | grep -q '^expected-sha256-prefix' && exit 0 || exit 1
# Process exit
( cd "$WORKDIR" && bun test ) && exit 0 || exit 1

Writing to fd 3 from non-bash interpreters

Bash makes fd-3 writing trivial via >&"$RESULTS_FD". From other languages you open fd 3 explicitly:

import json, os
fd = int(os.environ["RESULTS_FD"])
with os.fdopen(fd, "w") as f:
    f.write(json.dumps({"test": "t1", "pass": True}) + "\n")
const fs = require("node:fs");
const fd = Number(process.env.RESULTS_FD);
fs.writeSync(fd, JSON.stringify({ test: "t1", pass: true }) + "\n");

What the judge uses — judge.task.md

The post-hoc judge's prompt. The harness substitutes these template variables before sending the prompt to the judge:

Variable Description
{{AGENT_INSTRUCTIONS}} Contents of agent.task.md
{{AGENT_PROFILE}} Agent profile body (empty string if none)
{{AGENT_TRACE_PATH}} Absolute path to agent.ndjson
{{SCORING_RESULT}} JSON scoring object (verdict, details, exitCode)
{{SKILL_SET_HASH}} SHA-256 fingerprint from apm.lock.yaml
{{TASK_ID}} Task name (directory under tasks/)
{{TASK_DIR}} Agent working directory path

{{SCORING}} is accepted as a legacy alias for {{SCORING_RESULT}}.

Scoring outcome:

\`\`\`json
{{SCORING_RESULT}}
\`\`\`

The agent's full trace is at `{{AGENT_TRACE_PATH}}` — read it before
deciding. The agent was given task `{{TASK_ID}}` with these instructions:

{{AGENT_INSTRUCTIONS}}

Call `Conclude` with `verdict='success'` when both:

1. `scoring.verdict === "pass"`, and
2. the agent did not violate the test contract (e.g. by editing the
   test file).

The judge is a separate session — not the live supervisor. Mixing the "help the agent finish" incentive with the "grade fairly" incentive is what the design avoids.

What identifies the skill set — .claude/ and apm.lock.yaml

The pre-staged .claude/ tree carries the skills and agent profiles the agent will see. apm.lock.yaml is the manifest under test — the harness hashes its bytes (LF-normalised) into skillSetHash on every result record. A one-byte change to the lockfile produces a different hash, which is how comparing "before-skill-change" vs "after-skill-change" runs becomes apples-to-apples.

Caveat. skillSetHash covers the lockfile bytes only. If you edit .claude/ directly without regenerating the lockfile, the hash won't reflect the change. Always re-run your packing tool after editing .claude/.

Environment Variables

The harness auto-discovers .env and .env.local files in the family root and each task directory. Every discovered file is loaded into process.env and rendered into the agent's working directory before preflight.sh runs. process.env always wins — existing values are never overwritten.

  • Locally: put credentials in .env.local (gitignored).
  • In CI: set secrets as repository env vars — no files needed.

Example

A task that calls an LLM proxy:

# tasks/my-rag-task/.env.local (gitignored)
LLMHUB_NONPROD_API_KEY=your-key-here
LLMHUB_PROD_API_KEY=your-key-here

The harness renders this into the agent's CWD as .env.local with values resolved from process.env (CI secrets override file defaults). The task's preflight.sh can validate the file exists; the agent's application reads credentials from it.

All discovered var names are added to the trace redaction allowlist.

Run It

npx fit-benchmark run \
  --family=./my-coding-family \
  --output=./runs/2026-05-11 \
  --runs=5 \
  --agent-profile=coder \
  --judge-profile=judge \
  --max-turns=80

Output:

  • ./runs/2026-05-11/results.jsonl — append-only, one record per (task, runIndex). Survives partial failures.
  • ./runs/2026-05-11/runs/<task-name>/<runIndex>/ — per-run artifacts: the agent CWD, the agent trace, the judge trace, the scoring stderr log.
  • ./runs/2026-05-11/.apm-staging/.claude/ — staged skills/agents.

Each result record carries skillSetHash, familyRevision, the combined verdict, scoring details, judge verdict + summary, cost, turn count, and the absolute paths to both NDJSON traces. The record's schema is validated at write time, so a malformed write is caught before the report stage trips over it.

Score One Task at a Time

For ad-hoc grading without an agent run:

npx fit-benchmark score \
  --family=./my-coding-family \
  --task=todo-api \
  --workdir=./runs/2026-05-11/runs/todo-api/0 \
  --output=score.jsonl

Useful when iterating on a hooks/score.sh script: re-grade an existing post-run workdir without burning agent cost.

Aggregate Into pass@k

npx fit-benchmark report \
  --input=./runs/2026-05-11 \
  --k=1,3,5 \
  --format=text

With --format=text, the report renders a full markdown document:

  • Summary — overall pass rate, model, skill-set hash, cost, median duration, median turns.
  • Pass@k table — one row per task with the unbiased HumanEval estimator: pass@k = 1 - C(n-c, k) / C(n, k).
  • Task details — per-task sections with a runs table, scoring check results, judge commentary (blockquoted), and any agent or preflight errors.

With --format=json (default), the output is the aggregated pass@k data only — suitable for machine consumption and before/after diffs.

A k > n value emits a structured error row rather than a misleading number.

Compare Before and After

The reproducibility claim is the heart of the tool. Run the family twice — once with the old skill manifest, once with the new — and compare:

# Before
npx fit-benchmark run --family=./my-coding-family --output=./runs/before --runs=10
npx fit-benchmark report --input=./runs/before --format=json > before.json

# After (manifest changed)
npx fit-benchmark run --family=./my-coding-family --output=./runs/after --runs=10
npx fit-benchmark report --input=./runs/after --format=json > after.json

Each record carries skillSetHash, so any cross-comparison script can verify the two reports came from materially different skill sets before declaring an improvement.

What's Next