Run an Eval

You changed an agent profile, a tool allowlist, or a system prompt -- and now you need to know whether things got better or worse. fit-eval supervise runs a judge agent alongside a target agent in a relay loop: the judge watches the target work turn-by-turn and calls Conclude with a verdict. The exit code (0 pass, 1 fail) drops into GitHub Actions like any other check. The NDJSON trace captures every turn so you can inspect what happened with fit-trace.

Prerequisites

Node.js 18+
ANTHROPIC_API_KEY set in the environment
@forwardimpact/libeval (ships both fit-eval and fit-trace). Install globally with npm install -g @forwardimpact/libeval, or invoke ephemerally in CI with npx --yes @forwardimpact/libeval fit-eval ...

Write the task

A task file is a plain markdown prompt -- what the target agent should do. Keep it specific and measurable.

<!-- evals/refactor-utils/task.md -->
Refactor `src/utils/format.js` so that `formatDate` and `formatCurrency`
share a single locale-resolution helper. Do not change the public API of
either function. Add unit tests covering the en-US, en-GB, and de-DE
locales. Run the test suite and confirm it passes before finishing.

Write the judge profile

The judge is an agent profile at .claude/agents/<name>.md. The runtime appends an orchestration trailer explaining the available tools -- your profile only needs to define what good looks like.

<!-- .claude/agents/refactor-judge.md -->
---
name: refactor-judge
description: Judge a refactor of shared formatting utilities.
---

You are evaluating a refactor of `src/utils/format.js`. Watch the agent's
work and call `Conclude` when the session is finished.

Pass criteria -- all must hold:

- `formatDate` and `formatCurrency` share a single locale-resolution helper.
- The public signatures of both functions are unchanged.
- New tests exist for en-US, en-GB, and de-DE.
- The full test suite passes on the agent's final run.

If the agent strays, use `Redirect` to bring it back on task. If it claims
to be done, verify the criteria yourself with `Read` and `Bash` before
calling `Conclude`. Conclude with `success: false` if any criterion fails;
include a one-paragraph summary of the gap.

Give the judge read-only tools via --supervisor-allowed-tools (typically Read,Grep,Bash). A judge with Edit access can rewrite the target's work and mask failures.

Run the eval locally

npx fit-eval supervise \
  --task-file=evals/refactor-utils/task.md \
  --supervisor-profile=refactor-judge \
  --supervisor-cwd=. \
  --supervisor-allowed-tools=Read,Grep,Bash \
  --agent-cwd=/tmp/refactor-sandbox \
  --max-turns=20 \
  --output=trace.ndjson

--agent-cwd should be a sandbox copy of your repo since the target agent edits files there. When omitted, fit-eval creates a temporary directory. The judge stays in --supervisor-cwd to inspect the target's work without writing to it.

Exit code 0 means the judge concluded with success: true. Exit code 1 means success: false, the turn limit was reached, or an error occurred.

Run the eval in GitHub Actions

A two-step workflow is enough: run the eval, then split and upload the trace.

# .github/workflows/eval.yml
name: Agent eval

on:
  push:
    branches: [main]
  pull_request:

jobs:
  refactor-utils:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          mkdir -p /tmp/sandbox /tmp/trace
          cp -r . /tmp/sandbox
          npx --yes @forwardimpact/libeval fit-eval supervise \
            --task-file=evals/refactor-utils/task.md \
            --supervisor-profile=refactor-judge \
            --supervisor-cwd=. \
            --supervisor-allowed-tools=Read,Grep,Bash \
            --agent-cwd=/tmp/sandbox \
            --max-turns=20 \
            --output=/tmp/trace/trace.ndjson

      - name: Split trace
        if: always()
        run: |
          npx --yes @forwardimpact/libeval fit-trace split \
            /tmp/trace/trace.ndjson \
            --mode=supervise \
            --output-dir=/tmp/trace

      - name: Upload trace
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-trace
          path: /tmp/trace/*.ndjson

if: always() on the split and upload steps preserves the trace even when the eval fails -- which is when you most need it. split --mode=supervise produces trace-agent.ndjson and trace-supervisor.ndjson alongside the original combined trace.

Read the results

When an eval fails, download the artifact and start with overview and timeline to orient, then drill into the verdict.

npx fit-trace runs                              # find the failed run
npx fit-trace download <run-id>                 # downloads and auto-converts
npx fit-trace overview structured.json
npx fit-trace timeline structured.json
npx fit-trace tool structured.json Conclude

The Conclude tool call carries the judge's verdict and summary. From there, follow the timeline backwards to find the turn where the agent went wrong.

Run npx fit-trace --help for the full command surface.

Scale to a suite

Each eval is a task.md plus a judge profile. Add a matrix to fan them out:

strategy:
  fail-fast: false
  matrix:
    eval:
      - { task: refactor-utils, judge: refactor-judge }
      - { task: fix-flaky-test, judge: test-judge }
      - { task: add-rate-limiter, judge: ratelimit-judge }

fail-fast: false ensures every eval runs and produces a trace, not just the first failure.

Tips

--max-turns=0 removes the turn cap. Use it for exploratory local runs; always set a real budget in CI.
--task-amend appends extra text to the task without editing the task file -- useful for parameterizing the same task across a matrix.
The judge profile is a system prompt, not a contract. It steers the judge but does not bind it. Treat eval verdicts like a code review from a strong but fallible reviewer -- useful signal, not ground truth.

What's next

Prove Agent Changes

Reproducible evidence that agent changes improved outcomes — from dataset generation through evaluation to trace analysis.

Analyze Traces

See exactly what an agent did and why — download traces, query turns, filter by tool or error, and measure token cost.