Run an Eval
You changed an agent profile, a tool allowlist, or a system prompt
-- and now you need to know whether things got better or worse.
fit-eval supervise runs a
judge agent alongside a
target agent in a relay loop: the judge watches the
target work turn-by-turn and calls Conclude with a
verdict. The exit code (0 pass, 1 fail)
drops into GitHub Actions like any other check. The NDJSON trace
captures every turn so you can inspect what happened with
fit-trace.
Prerequisites
- Node.js 18+
ANTHROPIC_API_KEYset in the environment-
@forwardimpact/libeval(ships bothfit-evalandfit-trace). Install globally withnpm install -g @forwardimpact/libeval, or invoke ephemerally in CI withnpx --yes @forwardimpact/libeval fit-eval ...
Write the task
A task file is a plain markdown prompt -- what the target agent should do. Keep it specific and measurable.
<!-- evals/refactor-utils/task.md -->
Refactor `src/utils/format.js` so that `formatDate` and `formatCurrency`
share a single locale-resolution helper. Do not change the public API of
either function. Add unit tests covering the en-US, en-GB, and de-DE
locales. Run the test suite and confirm it passes before finishing.
Write the judge profile
The judge is an agent profile at
.claude/agents/<name>.md. The runtime appends an
orchestration trailer explaining the available tools -- your profile
only needs to define what good looks like.
<!-- .claude/agents/refactor-judge.md -->
---
name: refactor-judge
description: Judge a refactor of shared formatting utilities.
---
You are evaluating a refactor of `src/utils/format.js`. Watch the agent's
work and call `Conclude` when the session is finished.
Pass criteria -- all must hold:
- `formatDate` and `formatCurrency` share a single locale-resolution helper.
- The public signatures of both functions are unchanged.
- New tests exist for en-US, en-GB, and de-DE.
- The full test suite passes on the agent's final run.
If the agent strays, use `Redirect` to bring it back on task. If it claims
to be done, verify the criteria yourself with `Read` and `Bash` before
calling `Conclude`. Conclude with `success: false` if any criterion fails;
include a one-paragraph summary of the gap.
Give the judge read-only tools via
--supervisor-allowed-tools (typically
Read,Grep,Bash). A judge with Edit access
can rewrite the target's work and mask failures.
Run the eval locally
npx fit-eval supervise \
--task-file=evals/refactor-utils/task.md \
--supervisor-profile=refactor-judge \
--supervisor-cwd=. \
--supervisor-allowed-tools=Read,Grep,Bash \
--agent-cwd=/tmp/refactor-sandbox \
--max-turns=20 \
--output=trace.ndjson
--agent-cwd should be a sandbox copy of your repo since
the target agent edits files there. When omitted,
fit-eval creates a temporary directory. The judge stays
in --supervisor-cwd to inspect the target's work
without writing to it.
Exit code 0 means the judge concluded with
success: true. Exit code 1 means
success: false, the turn limit was reached, or an error
occurred.
Run the eval in GitHub Actions
A two-step workflow is enough: run the eval, then split and upload the trace.
# .github/workflows/eval.yml
name: Agent eval
on:
push:
branches: [main]
pull_request:
jobs:
refactor-utils:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Run eval
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mkdir -p /tmp/sandbox /tmp/trace
cp -r . /tmp/sandbox
npx --yes @forwardimpact/libeval fit-eval supervise \
--task-file=evals/refactor-utils/task.md \
--supervisor-profile=refactor-judge \
--supervisor-cwd=. \
--supervisor-allowed-tools=Read,Grep,Bash \
--agent-cwd=/tmp/sandbox \
--max-turns=20 \
--output=/tmp/trace/trace.ndjson
- name: Split trace
if: always()
run: |
npx --yes @forwardimpact/libeval fit-trace split \
/tmp/trace/trace.ndjson \
--mode=supervise \
--output-dir=/tmp/trace
- name: Upload trace
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-trace
path: /tmp/trace/*.ndjson
if: always() on the split and upload steps preserves
the trace even when the eval fails -- which is when you most need
it. split --mode=supervise produces
trace-agent.ndjson and
trace-supervisor.ndjson alongside the original combined
trace.
Read the results
When an eval fails, download the artifact and start with
overview and timeline to orient, then
drill into the verdict.
npx fit-trace runs # find the failed run
npx fit-trace download <run-id> # downloads and auto-converts
npx fit-trace overview structured.json
npx fit-trace timeline structured.json
npx fit-trace tool structured.json Conclude
The Conclude tool call carries the judge's verdict
and summary. From there, follow the timeline backwards to find the
turn where the agent went wrong.
Run npx fit-trace --help for the full command surface.
Scale to a suite
Each eval is a task.md plus a judge profile. Add a
matrix to fan them out:
strategy:
fail-fast: false
matrix:
eval:
- { task: refactor-utils, judge: refactor-judge }
- { task: fix-flaky-test, judge: test-judge }
- { task: add-rate-limiter, judge: ratelimit-judge }
fail-fast: false ensures every eval runs and produces a
trace, not just the first failure.
Tips
-
--max-turns=0removes the turn cap. Use it for exploratory local runs; always set a real budget in CI. -
--task-amendappends extra text to the task without editing the task file -- useful for parameterizing the same task across a matrix. - The judge profile is a system prompt, not a contract. It steers the judge but does not bind it. Treat eval verdicts like a code review from a strong but fallible reviewer -- useful signal, not ground truth.