Agent Evaluations
Agent Evaluations
fit-eval is the plumbing for agent-as-judge
evaluations. You write a judge agent and a
target agent, then
fit-eval supervise runs them together in a relay loop:
the judge sees the target's work turn-by-turn and signals the
verdict by calling its Conclude tool. The exit code
(0 pass, 1 fail) lets GitHub Actions
surface eval results like any other check, and the NDJSON trace
captures the full session for inspection with
fit-trace.
This guide walks from a single eval definition to a CI workflow that runs an eval suite on every push.
Prerequisites
- Node.js 18+
ANTHROPIC_API_KEYavailable to the workflow- A repository where you want the eval to run
-
The
fit-evalandfit-traceCLIs (both ship in@forwardimpact/libeval) — install once withnpm install -g @forwardimpact/libevaland callfit-eval/fit-tracedirectly, or invoke ephemerally in CI withnpx --yes @forwardimpact/libeval fit-eval ...(no install step needed)
1. Write the task
The task file is a plain markdown prompt — what you want the target agent to do. Keep it specific and measurable; an eval is only as good as the task it asks the agent to perform.
<!-- evals/refactor-utils/task.md -->
Refactor `src/utils/format.js` so that `formatDate` and `formatCurrency`
share a single locale-resolution helper. Do not change the public API of
either function. Add unit tests covering the en-US, en-GB, and de-DE
locales. Run the test suite and confirm it passes before finishing.
2. Write the judge profile
The judge is an agent profile under
.claude/agents/<name>.md. The supervisor runtime
appends an orchestration trailer that explains how to use
Ask, Announce, Redirect, and
Conclude — your profile only needs to specify
what good looks like for this task.
<!-- .claude/agents/refactor-judge.md -->
---
name: refactor-judge
description: Judge a refactor of shared formatting utilities.
---
You are evaluating a refactor of `src/utils/format.js`. Watch the agent's
work and call `Conclude` when the session is finished.
Pass criteria — all must hold:
- `formatDate` and `formatCurrency` share a single locale-resolution helper.
- The public signatures of both functions are unchanged.
- New tests exist for en-US, en-GB, and de-DE.
- The full test suite passes on the agent's final run.
If the agent strays, use `Redirect` to bring it back on task. If it claims
to be done, verify the criteria yourself with `Read` and `Bash` before
calling `Conclude`. Conclude with `success: false` if any criterion fails;
include a one-paragraph summary of the gap.
The judge has its own working directory
(--supervisor-cwd) and tool allowlist
(--supervisor-allowed-tools). Give it whatever it needs
to verify the work — typically Read, Grep,
Bash — but not Write or Edit,
since the judge should not be doing the work.
3. Run the eval locally
npx fit-eval supervise \
--task-file=evals/refactor-utils/task.md \
--supervisor-profile=refactor-judge \
--supervisor-cwd=. \
--supervisor-allowed-tools=Read,Grep,Bash \
--agent-cwd=/tmp/refactor-sandbox \
--allowed-tools=Read,Edit,Write,Bash,Grep,Glob \
--max-turns=50 \
--output=trace.ndjson
--agent-cwd should usually be a sandbox copy of your
repo, since the target agent will edit files there. The judge stays
in --supervisor-cwd to inspect the agent's work
without writing to it.
Exit code 0 means the judge concluded with
success: true; exit code 1 means it
concluded with success: false, ran out of turns, or
errored.
4. Run it in GitHub Actions
A two-step workflow is enough: run the eval, then split and upload the trace so you can inspect it later. The eval's exit code is the job's exit code.
# .github/workflows/eval.yml
name: Agent eval
on:
push:
branches: [main]
pull_request:
jobs:
refactor-utils:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- name: Run eval
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
mkdir -p /tmp/sandbox /tmp/trace
cp -r . /tmp/sandbox
npx --yes @forwardimpact/libeval fit-eval supervise \
--task-file=evals/refactor-utils/task.md \
--supervisor-profile=refactor-judge \
--supervisor-cwd=. \
--supervisor-allowed-tools=Read,Grep,Bash \
--agent-cwd=/tmp/sandbox \
--allowed-tools=Read,Edit,Write,Bash,Grep,Glob \
--max-turns=50 \
--output=/tmp/trace/trace.ndjson
- name: Split trace
if: always()
run: |
npx --yes @forwardimpact/libeval fit-trace split \
/tmp/trace/trace.ndjson \
--mode=supervise \
--output-dir=/tmp/trace
- name: Upload trace
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-trace
path: /tmp/trace/*.ndjson
if: always() on the split and upload steps ensures the
trace is preserved even when the eval fails — which is when you most
need it. split --mode=supervise produces
trace-agent.ndjson and
trace-supervisor.ndjson alongside the original combined
trace.
5. Read the results
When an eval fails, download the artifact and start with
overview and timeline to orient. Then
drill in.
npx fit-trace runs # find the failed run
npx fit-trace download <run-id> # downloads to /tmp/trace-<run-id>/
npx fit-trace overview /tmp/trace-<run-id>/structured.json
npx fit-trace timeline /tmp/trace-<run-id>/structured.json
npx fit-trace tool /tmp/trace-<run-id>/structured.json Conclude
The Conclude tool call carries the judge's verdict
and summary — that's usually where you start when an eval fails.
From there, follow the timeline backwards to find the turn where the
agent went wrong.
See the CLI Reference for the
full fit-trace command surface.
Scaling to a suite
Each eval is a task.md plus one or more judge profiles.
Add a matrix to fan them out:
strategy:
fail-fast: false
matrix:
eval:
- { task: refactor-utils, judge: refactor-judge }
- { task: fix-flaky-test, judge: test-judge }
- { task: add-rate-limiter, judge: ratelimit-judge }
fail-fast: false is important — you want every
eval's trace, not just the first failure's.
Notes
-
--max-turns=0removes the turn cap. Use it for exploratory runs; always set a real budget in CI. -
--task-amendappends extra steering text to the task without editing the task file — useful for parameterising the same task across a matrix. -
Judge tool allowlist matters. A judge with
Editaccess can rewrite the agent's work and mask failures. Restrict it to read-only tools. - The judge's profile is a system prompt, not a contract. It steers the judge but doesn't bind it. Treat eval verdicts as you would a code review from a strong but fallible reviewer — useful signal, not ground truth.
Related
- Trace Analysis — read the NDJSON traces this guide produces, with worked examples including a failed eval.
-
fit-trace — full
fit-traceCLI command surface. - Agent Teams Guide — how agent profiles are authored and what they contain.