Prove Agent Changes
You changed an agent profile, tightened a tool allowlist, or rewrote
a system prompt. The question is whether the change actually helped.
Answering that question requires a dataset you can regenerate when
the schema changes, a session that captures every turn, and an
analysis method that connects observed behavior to actionable
findings. This guide walks the full arc with
fit-terrain and fit-eval, then hands off
to fit-trace for the reading.
Prerequisites
- Node.js 18+
-
ANTHROPIC_API_KEYset in the shell (used by bothfit-terrain generateandfit-eval) - A repository where agents will work
- The three CLIs ship in two packages -- install once:
npm install -g @forwardimpact/libeval @forwardimpact/libterrain
Or invoke ephemerally with npx:
npx --yes @forwardimpact/libterrain fit-terrain --help
npx --yes @forwardimpact/libeval fit-eval --help
npx --yes @forwardimpact/libeval fit-trace --help
1. Define the dataset in a DSL file
fit-terrain reads a single .dsl file that
declares everything the pipeline needs: an organization graph,
people distribution, projects, scenarios, an engineering standard,
content types, and external datasets. The pipeline parses the DSL,
generates entities, resolves prose through an LLM-backed cache,
renders output in multiple formats, and validates the result -- all
from one source.
Create a minimal DSL file. This example declares a small organization with one team, a people distribution, and an engineering standard:
// evals/terrain/story.dsl
terrain Acme {
domain "acme.example"
industry "fintech"
seed 42
org headquarters {
name "Acme HQ"
location "London, UK"
}
department engineering {
name "Engineering"
parent headquarters
headcount 20
team payments {
name "Payments Team"
size 8
repos ["payments-api", "ledger-service"]
}
}
people {
count 20
distribution {
J060 50%
J070 30%
J080 20%
}
disciplines {
software_engineering 80%
data_engineering 20%
}
}
standard {
proficiencies [awareness, foundational, working, practitioner, expert]
maturities [emerging, developing, practicing, role_modeling, exemplifying]
levels {
J060 { title "Engineer" rank 2 experience "2-4 years" }
J070 { title "Senior Engineer" rank 3 experience "4-7 years" }
J080 { title "Lead Engineer" rank 4 experience "7-10 years" }
}
capabilities {
delivery {
name "Delivery"
skills [full_stack_development, problem_discovery]
}
reliability {
name "Reliability"
skills [sre_practices, incident_management]
}
}
behaviours {
outcome_ownership { name "Own the Outcome" }
systems_thinking { name "Think in Systems" }
}
disciplines {
software_engineering {
roleTitle "Software Engineer"
specialization "Software Engineering"
core [full_stack_development, sre_practices]
supporting [problem_discovery]
broad [incident_management]
validTracks [null]
}
data_engineering {
roleTitle "Data Engineer"
specialization "Data Engineering"
core [problem_discovery, incident_management]
supporting [full_stack_development]
broad [sre_practices]
validTracks [null]
}
}
tracks {}
drivers {}
}
}
The DSL supports additional blocks -- project,
scenario, snapshots, content,
dataset, and output -- that add projects,
time-based scenarios, external tool-generated datasets (Synthea,
SDV, Faker), and rendered output files. Start small and add blocks
as your evaluation demands more context.
2. Generate and validate the dataset
The pipeline has four verbs. Use them in sequence during setup, then only the ones you need on subsequent runs:
npx fit-terrain check --story=evals/terrain/story.dsl
check parses the DSL, generates entities, and reports
prose cache completeness. On a fresh file every key will be a miss
-- that is expected.
npx fit-terrain generate --story=evals/terrain/story.dsl
generate fills the prose cache via the LLM, then
renders and validates all content. The cache is persisted to
data/synthetic/prose-cache.json by default (override
with --cache). Subsequent runs with the same DSL reuse
cached prose, so only new or changed keys cost API calls.
npx fit-terrain validate --story=evals/terrain/story.dsl
validate runs entity and cross-content checks without
writing files. Use it after editing the DSL to catch structural
errors before a full build.
npx fit-terrain build --story=evals/terrain/story.dsl
build renders and writes all content types. Add
--only=pathway to render only the engineering standard
YAML, or --only=html for knowledge-base documents. The
output lands under data/ in the working directory.
After the build, the data/ tree contains everything the
eval needs: engineering standard definitions, knowledge-base
documents, activity records, and any external datasets declared in
the DSL.
3. Write the eval task and profiles
With the dataset in place, write the task file and agent profiles
that will exercise the change you want to evaluate. The task is a
markdown prompt; the profiles live under
.claude/agents/.
A task for evaluating a refactored formatting utility:
<!-- evals/refactor-utils/task.md -->
Refactor `src/utils/format.js` so that `formatDate` and `formatCurrency`
share a single locale-resolution helper. Do not change the public API of
either function. Add unit tests covering the en-US, en-GB, and de-DE
locales. Run the test suite and confirm it passes before finishing.
A judge profile for supervised evaluation:
<!-- .claude/agents/refactor-judge.md -->
---
name: refactor-judge
description: Evaluate a refactor of shared formatting utilities.
---
You are evaluating a refactor of `src/utils/format.js`. Watch the agent's
work and call `Conclude` when the session is finished.
Pass criteria -- all must hold:
- `formatDate` and `formatCurrency` share a single locale-resolution helper.
- The public signatures of both functions are unchanged.
- New tests exist for en-US, en-GB, and de-DE.
- The full test suite passes on the agent's final run.
If the agent strays, use `Redirect` to bring it back on task. If it claims
to be done, verify the criteria yourself with `Read` and `Bash` before
calling `Conclude`. Conclude with `success: false` if any criterion fails;
include a one-paragraph summary of the gap.
For facilitated sessions with multiple specialists, write a
facilitator profile and one profile per participant. Each
participant only needs to describe its specialism -- the runtime
appends the orchestration tools (Ask,
Answer, Announce, RollCall,
Conclude) automatically.
<!-- .claude/agents/release-facilitator.md -->
---
name: release-facilitator
description: Coordinate a release-readiness review across specialist agents.
---
You are facilitating a release-readiness review. The participants are
`security-engineer`, `release-engineer`, and `technical-writer`.
1. `Announce` the goal: confirm whether the current release is ready to ship.
2. `Ask` each participant for their go/no-go, one at a time.
3. If any participant reports a blocker, `Announce` the blocker so the
others can react, then ask whether they want to revise their position.
4. `Conclude` with `success: true` if all three are go; otherwise
`success: false` with a one-paragraph summary of the blocker.
4. Run the eval
For a supervised evaluation (one agent, one judge):
npx fit-eval supervise \
--task-file=evals/refactor-utils/task.md \
--supervisor-profile=refactor-judge \
--supervisor-cwd=. \
--supervisor-allowed-tools=Read,Grep,Bash \
--agent-cwd=/tmp/refactor-sandbox \
--allowed-tools=Read,Edit,Write,Bash,Grep,Glob \
--max-turns=50 \
--output=trace.ndjson
Exit code 0 means the judge concluded with
success: true; exit code 1 means it
concluded success: false, ran out of turns, or errored.
For a facilitated session (one facilitator, N participants):
npx fit-eval facilitate \
--task-file=sessions/release-review/task.md \
--facilitator-profile=release-facilitator \
--facilitator-cwd=. \
--agent-profiles=security-engineer,release-engineer,technical-writer \
--agent-cwd=. \
--max-turns=20 \
--output=trace.ndjson
Participants share --agent-cwd by default. If two
participants might edit the same file, give each its own working
directory or restrict tool allowlists so only one can write.
--max-turns=20 is the default for
facilitate -- always set a budget so a stuck
participant cannot run the session indefinitely.
The --task-file content is visible to every agent in
the session as the opening prompt. The facilitator profile steers
how the goal is pursued; the participants apply their specialisms.
5. Verify the trace
After the run, confirm the trace file exists and contains the expected structure before investing time in analysis:
npx fit-trace overview trace.ndjson
npx fit-trace timeline trace.ndjson
npx fit-trace stats trace.ndjson
overview reports metadata, turn count, and tool usage
frequency. timeline
prints one line per turn so you can see the shape of the session at
a glance.
stats breaks down token usage and cost.
For supervised and facilitated runs, split the combined trace into per-source files:
npx fit-trace split trace.ndjson --mode=supervise
npx fit-trace split trace.ndjson --mode=facilitate
This produces trace-agent.ndjson and
trace-supervisor.ndjson (for supervise) or
trace-facilitator.ndjson and
trace-<participant>.ndjson (for
facilitate). Per-source traces are essential when
participants disagreed -- you can read each one's view
independently.
6. Analyze traces for findings
The trace is qualitative data. The most useful analysis comes from reading it like a researcher, not running a checklist. Drill into specific tools and message exchanges:
npx fit-trace tool trace.ndjson Conclude
npx fit-trace tool trace.ndjson Ask
npx fit-trace tool trace.ndjson Announce
npx fit-trace filter trace.ndjson --tool Edit
npx fit-trace search trace.ndjson 'error|fail' --context 1
npx fit-trace reasoning trace.ndjson
The Conclude call carries the verdict -- start there
when an eval fails, then follow the timeline backwards. For
facilitated sessions, walk Announce (broadcasts) and
Ask/Answer (targeted exchanges) to see how
the participants converged or where they diverged.
For the full analysis method -- grounded-theory coding, pattern identification, and writing findings that are grounded, testable, and actionable -- see the Trace Analysis guide.
What's next
This guide covered the full arc from dataset definition through session execution to trace verification. Each stage has a dedicated guide for deeper work:
- Agent Evaluations -- write judge profiles, wire evals into CI with GitHub Actions, and scale to a matrix suite.
- Trace Analysis -- the grounded-theory analysis method with two worked examples: an eval that failed and a multi-agent session that stalled.