# Run an Eval


You changed an agent profile, a tool allowlist, or a system prompt -- and now
you need to know whether things got better or worse. `fit-eval supervise` runs a
**judge agent** alongside a **target agent** in a relay loop: the judge watches
the target work turn-by-turn and calls `Conclude` with a verdict. The exit code
(`0` pass, `1` fail) drops into GitHub Actions like any other check. The NDJSON
trace captures every turn so you can inspect what happened with `fit-trace`.

## Prerequisites

- Node.js 18+
- `ANTHROPIC_API_KEY` set in the environment
- `@forwardimpact/libeval` (ships both `fit-eval` and `fit-trace`). Install
  globally with `npm install -g @forwardimpact/libeval`, or invoke ephemerally
  in CI with `npx --yes @forwardimpact/libeval fit-eval ...`

## Write the task

A task file is a plain markdown prompt -- what the target agent should do. Keep
it specific and measurable.

```md
<!-- evals/refactor-utils/task.md -->
Refactor `src/utils/format.js` so that `formatDate` and `formatCurrency`
share a single locale-resolution helper. Do not change the public API of
either function. Add unit tests covering the en-US, en-GB, and de-DE
locales. Run the test suite and confirm it passes before finishing.
```

## Write the judge profile

The judge is an agent profile at `.claude/agents/<name>.md`. The runtime appends
an orchestration trailer explaining the available tools -- your profile only
needs to define **what good looks like**.

```md
<!-- .claude/agents/refactor-judge.md -->
---
name: refactor-judge
description: Judge a refactor of shared formatting utilities.
---

You are evaluating a refactor of `src/utils/format.js`. Watch the agent's
work and call `Conclude` when the session is finished.

Pass criteria -- all must hold:

- `formatDate` and `formatCurrency` share a single locale-resolution helper.
- The public signatures of both functions are unchanged.
- New tests exist for en-US, en-GB, and de-DE.
- The full test suite passes on the agent's final run.

If the agent strays, use `Redirect` to bring it back on task. If it claims
to be done, verify the criteria yourself with `Read` and `Bash` before
calling `Conclude`. Conclude with `success: false` if any criterion fails;
include a one-paragraph summary of the gap.
```

Give the judge read-only tools via `--supervisor-allowed-tools` (typically
`Read,Grep,Bash`). A judge with `Edit` access can rewrite the target's work and
mask failures.

## Run the eval locally

```sh
npx fit-eval supervise \
  --task-file=evals/refactor-utils/task.md \
  --supervisor-profile=refactor-judge \
  --supervisor-cwd=. \
  --supervisor-allowed-tools=Read,Grep,Bash \
  --agent-cwd=/tmp/refactor-sandbox \
  --max-turns=20 \
  --output=trace.ndjson
```

`--agent-cwd` should be a sandbox copy of your repo since the target agent edits
files there. When omitted, `fit-eval` creates a temporary directory. The judge
stays in `--supervisor-cwd` to inspect the target's work without writing to it.

Exit code `0` means the judge concluded with `success: true`. Exit code `1`
means `success: false`, the turn limit was reached, or an error occurred.

## Run the eval in GitHub Actions

A two-step workflow is enough: run the eval, then split and upload the trace.

```yaml
# .github/workflows/eval.yml
name: Agent eval

on:
  push:
    branches: [main]
  pull_request:

jobs:
  refactor-utils:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"

      - name: Run eval
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          mkdir -p /tmp/sandbox /tmp/trace
          cp -r . /tmp/sandbox
          npx --yes @forwardimpact/libeval fit-eval supervise \
            --task-file=evals/refactor-utils/task.md \
            --supervisor-profile=refactor-judge \
            --supervisor-cwd=. \
            --supervisor-allowed-tools=Read,Grep,Bash \
            --agent-cwd=/tmp/sandbox \
            --max-turns=20 \
            --output=/tmp/trace/trace.ndjson

      - name: Split trace
        if: always()
        run: |
          npx --yes @forwardimpact/libeval fit-trace split \
            /tmp/trace/trace.ndjson \
            --mode=supervise \
            --output-dir=/tmp/trace

      - name: Upload trace
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: eval-trace
          path: /tmp/trace/*.ndjson
```

`if: always()` on the split and upload steps preserves the trace even when the
eval fails -- which is when you most need it. `split --mode=supervise` produces
`trace-agent.ndjson` and `trace-supervisor.ndjson` alongside the original
combined trace.

## Read the results

When an eval fails, download the artifact and start with `overview` and
`timeline` to orient, then drill into the verdict.

```sh
npx fit-trace runs                              # find the failed run
npx fit-trace download <run-id>                 # downloads and auto-converts
npx fit-trace overview structured.json
npx fit-trace timeline structured.json
npx fit-trace tool structured.json Conclude
```

The `Conclude` tool call carries the judge's verdict and summary. From there,
follow the timeline backwards to find the turn where the agent went wrong.

Run `npx fit-trace --help` for the full command surface.

## Scale to a suite

Each eval is a `task.md` plus a judge profile. Add a matrix to fan them out:

```yaml
strategy:
  fail-fast: false
  matrix:
    eval:
      - { task: refactor-utils, judge: refactor-judge }
      - { task: fix-flaky-test, judge: test-judge }
      - { task: add-rate-limiter, judge: ratelimit-judge }
```

`fail-fast: false` ensures every eval runs and produces a trace, not just the
first failure.

## Tips

- **`--max-turns=0`** removes the turn cap. Use it for exploratory local runs;
  always set a real budget in CI.
- **`--task-amend`** appends extra text to the task without editing the task
  file -- useful for parameterizing the same task across a matrix.
- **The judge profile is a system prompt, not a contract.** It steers the judge
  but does not bind it. Treat eval verdicts like a code review from a strong but
  fallible reviewer -- useful signal, not ground truth.

## What's next

<div class="grid">

<a href="../">
<h3>Prove Agent Changes</h3>
<p>Reproducible evidence that agent changes improved outcomes — from dataset generation through evaluation to trace analysis.</p>
</a>
<a href="../trace-analysis/">
<h3>Analyze Traces</h3>
<p>See exactly what an agent did and why — download traces, query turns, filter by tool or error, and measure token cost.</p>
</a>

</div>
