# Run a Benchmark


You shipped a skill-pack change — a new `kata-spec` rule, a tweak to a
`fit-pathway` profile, an updated tool allowlist. The next question is the
hard one: did agents get better at writing code? A single agent run is a
coin flip, and a passing eval doesn't generalise. `fit-benchmark` runs each
coding task **N times** against a **versioned skill-set manifest**, grades
each run with tests the agent never sees, and aggregates pass@k using the
OpenAI HumanEval unbiased estimator.

## Prerequisites

- Node.js 18+
- `ANTHROPIC_API_KEY` set in the environment
- `@forwardimpact/libeval` (ships `fit-eval`, `fit-trace`, and
  `fit-benchmark`). Install globally with
  `npm install -g @forwardimpact/libeval`, or invoke ephemerally in CI
  with `npx --yes @forwardimpact/libeval fit-benchmark ...`

## Author a Task Family

A task family is a directory of related coding tasks plus the skill-set
under test:

```
my-coding-family/
  .env                                   # family env vars (committed defaults)
  .env.local                             # family secrets (gitignored)
  apm.yml                                # optional — skill-pack dependencies
  apm.lock.yaml                          # skill-set manifest (hashed)
  .claude/                               # pre-staged skills + agents
    skills/...
    agents/judge.md
  tasks/todo-api/
    .env                                 # task env vars — loaded + rendered
    .env.local                           # task secrets — loaded + rendered (gitignored)
    agent.task.md                        # what the agent should build (required)
    judge.task.md                        # optional — judge prompt (see § judge.task.md)
    supervisor.task.md                   # optional — supervisor context
    hooks/                               # harness-only — never copied to agent CWD
      preflight.sh                       # optional — smoke probe
      score.sh                           # optional — hidden grader; fd 3 = $RESULTS_FD
    specs/                               # copied into the agent CWD
    workdir/                             # copied into the agent CWD
```

Task IDs are directory names under `tasks/` (e.g. `todo-api`). The directory
splits into what the agent sees (`workdir/`, `specs/`, `.claude/`) and what the
harness keeps hidden (`hooks/`). The agent never receives the scoring script —
that is the structural guarantee it cannot peek at the tests.

### What the agent sees

#### `agent.task.md`

Plain markdown — the prompt the agent receives.

```md
Build a TODO API matching the spec under `specs/`. Listen on the port
exposed via the environment variable `PORT`. Respond to `GET /todos`
with a JSON array of TODO objects.
```

#### `workdir/`

Whatever scaffolding the agent should start with: a `package.json`, a
README, sample data — everything here is copied into the per-task CWD.

### What the harness controls — `hooks/`

The `hooks/` directory holds lifecycle scripts the harness runs at
specific phases. Both scripts receive `$WORKDIR` (the per-task agent
CWD) and `$PORT` (a pre-allocated free TCP port) as environment
variables. Neither is ever copied to the agent's working directory.

#### `hooks/preflight.sh`

Optional. Runs before the agent starts. Exit `0` means "scaffold is
healthy, hand off to the agent." A non-zero exit short-circuits the run
and produces a `preflightError` result record (cost zero, no agent
invoked). When the script is absent, the harness proceeds without a
pre-flight probe.

A preflight that starts a background service for the scoring probe to
test against:

```sh
#!/bin/sh
node "$WORKDIR/app.js" >/dev/null 2>&1 &
sleep 0.2
exit 0
```

The harness spawns the preflight in its own process group and tears down
the entire group (SIGTERM, grace period, SIGKILL) after scoring
completes — background processes do not leak across runs.

#### `hooks/score.sh`

Runs after the agent finishes. In addition to `$WORKDIR` and `$PORT`,
it receives `$RESULTS_FD=3` — a file descriptor for structured per-test
rows.

The **exit code is authoritative**: `0` is pass, anything else is fail.
Rows written to fd 3 are stored on the result record's `scoring.details`
for diagnostics; they cannot override the verdict.

Three grading surfaces are in scope:

```sh
# Running-service probe
RESP="$(curl -sf --max-time 2 "http://127.0.0.1:$PORT/todos")"
test "$RESP" = '[]' && exit 0 || exit 1
```

```sh
# Repository state
sha256sum "$WORKDIR/dist/build.tar.gz" \
  | grep -q '^expected-sha256-prefix' && exit 0 || exit 1
```

```sh
# Process exit
( cd "$WORKDIR" && bun test ) && exit 0 || exit 1
```

#### Writing to fd 3 from non-bash interpreters

Bash makes fd-3 writing trivial via `>&"$RESULTS_FD"`. From other
languages you open fd 3 explicitly:

```python
import json, os
fd = int(os.environ["RESULTS_FD"])
with os.fdopen(fd, "w") as f:
    f.write(json.dumps({"test": "t1", "pass": True}) + "\n")
```

```js
const fs = require("node:fs");
const fd = Number(process.env.RESULTS_FD);
fs.writeSync(fd, JSON.stringify({ test: "t1", pass: true }) + "\n");
```

### What the judge uses — `judge.task.md`

The post-hoc judge's prompt. The harness substitutes these template
variables before sending the prompt to the judge:

| Variable | Description |
| --- | --- |
| `{{AGENT_INSTRUCTIONS}}` | Contents of `agent.task.md` |
| `{{AGENT_PROFILE}}` | Agent profile body (empty string if none) |
| `{{AGENT_TRACE_PATH}}` | Absolute path to `agent.ndjson` |
| `{{SCORING_RESULT}}` | JSON scoring object (verdict, details, exitCode) |
| `{{SKILL_SET_HASH}}` | SHA-256 fingerprint from `apm.lock.yaml` |
| `{{TASK_ID}}` | Task name (directory under `tasks/`) |
| `{{TASK_DIR}}` | Agent working directory path |

`{{SCORING}}` is accepted as a legacy alias for `{{SCORING_RESULT}}`.

```md
Scoring outcome:

\`\`\`json
{{SCORING_RESULT}}
\`\`\`

The agent's full trace is at `{{AGENT_TRACE_PATH}}` — read it before
deciding. The agent was given task `{{TASK_ID}}` with these instructions:

{{AGENT_INSTRUCTIONS}}

Call `Conclude` with `verdict='success'` when both:

1. `scoring.verdict === "pass"`, and
2. the agent did not violate the test contract (e.g. by editing the
   test file).
```

The judge is a separate session — not the live supervisor. Mixing the
"help the agent finish" incentive with the "grade fairly" incentive is
what the design avoids.

### What identifies the skill set — `.claude/` and `apm.lock.yaml`

The pre-staged `.claude/` tree carries the skills and agent profiles the
agent will see. `apm.lock.yaml` is the **manifest under test** — the
harness hashes its bytes (LF-normalised) into `skillSetHash` on every
result record. A one-byte change to the lockfile produces a different
hash, which is how comparing "before-skill-change" vs
"after-skill-change" runs becomes apples-to-apples.

> **Caveat.** `skillSetHash` covers the lockfile bytes only. If you edit
> `.claude/` directly without regenerating the lockfile, the hash won't
> reflect the change. Always re-run your packing tool after editing
> `.claude/`.

## Environment Variables

The harness auto-discovers `.env` and `.env.local` files in the family
root and each task directory. Every discovered file is loaded into
`process.env` and rendered into the agent's working directory before
`preflight.sh` runs. `process.env` always wins — existing values are
never overwritten.

- **Locally:** put credentials in `.env.local` (gitignored).
- **In CI:** set secrets as repository env vars — no files needed.

### Example

A task that calls an LLM proxy:

```sh
# tasks/my-rag-task/.env.local (gitignored)
LLMHUB_NONPROD_API_KEY=your-key-here
LLMHUB_PROD_API_KEY=your-key-here
```

The harness renders this into the agent's CWD as `.env.local` with
values resolved from `process.env` (CI secrets override file defaults).
The task's `preflight.sh` can validate the file exists; the agent's
application reads credentials from it.

All discovered var names are added to the trace redaction allowlist.

## Run It

```sh
npx fit-benchmark run \
  --family=./my-coding-family \
  --output=./runs/2026-05-11 \
  --runs=5 \
  --agent-profile=coder \
  --judge-profile=judge \
  --max-turns=80
```

Output:

- `./runs/2026-05-11/results.jsonl` — append-only, one record per
  `(task, runIndex)`. Survives partial failures.
- `./runs/2026-05-11/runs/<task-name>/<runIndex>/` — per-run artifacts:
  the agent CWD, the agent trace, the judge trace, the scoring stderr
  log.
- `./runs/2026-05-11/.apm-staging/.claude/` — staged skills/agents.

Each result record carries `skillSetHash`, `familyRevision`, the
combined verdict, scoring details, judge verdict + summary, cost, turn
count, and the absolute paths to both NDJSON traces. The record's
schema is validated at write time, so a malformed write is caught
before the report stage trips over it.

## Score One Task at a Time

For ad-hoc grading without an agent run:

```sh
npx fit-benchmark score \
  --family=./my-coding-family \
  --task=todo-api \
  --workdir=./runs/2026-05-11/runs/todo-api/0 \
  --output=score.jsonl
```

Useful when iterating on a `hooks/score.sh` script: re-grade an existing
post-run workdir without burning agent cost.

## Aggregate Into pass@k

```sh
npx fit-benchmark report \
  --input=./runs/2026-05-11 \
  --k=1,3,5 \
  --format=text
```

With `--format=text`, the report renders a full markdown document:

- **Summary** — overall pass rate, model, skill-set hash, cost, median
  duration, median turns.
- **Pass@k table** — one row per task with the unbiased HumanEval
  estimator: `pass@k = 1 - C(n-c, k) / C(n, k)`.
- **Task details** — per-task sections with a runs table, scoring check
  results, judge commentary (blockquoted), and any agent or preflight
  errors.

With `--format=json` (default), the output is the aggregated pass@k
data only — suitable for machine consumption and before/after diffs.

A `k > n` value emits a structured error row rather than a misleading
number.

## Compare Before and After

The reproducibility claim is the heart of the tool. Run the family
twice — once with the old skill manifest, once with the new — and
compare:

```sh
# Before
npx fit-benchmark run --family=./my-coding-family --output=./runs/before --runs=10
npx fit-benchmark report --input=./runs/before --format=json > before.json

# After (manifest changed)
npx fit-benchmark run --family=./my-coding-family --output=./runs/after --runs=10
npx fit-benchmark report --input=./runs/after --format=json > after.json
```

Each record carries `skillSetHash`, so any cross-comparison script can
verify the two reports came from materially different skill sets before
declaring an improvement.

## What's Next

<div class="grid">

<a href="ci-workflow/">
<h3>Automate with GitHub Actions</h3>
<p>Run fit-benchmark in CI with the forwardimpact/fit-benchmark composite action — step summaries, artifact upload, and PR-triggered benchmarks.</p>
</a>
<a href="../run-eval/">
<h3>Run an Eval</h3>
<p>Know whether agent changes improved outcomes — an agent-as-judge eval wired into CI with traceable results.</p>
</a>
<a href="../trace-analysis/">
<h3>Analyze Traces</h3>
<p>See exactly what an agent did and why — download traces, query turns, filter by tool or error, and measure token cost.</p>
</a>

</div>
