Generate an Eval Dataset
You need to produce a dataset for an agent evaluation. The dataset
must include an organization graph, people, an engineering standard,
knowledge-base documents, and activity records -- and you need to
regenerate the whole thing when the schema changes.
fit-terrain generate does all of that from a single
.dsl file.
For the end-to-end workflow that connects dataset generation to evaluation sessions and trace analysis, see Prove Whether Agent Changes Improved Outcomes.
Prerequisites
- Node.js 18+
-
ANTHROPIC_API_KEYset in the shell (thegenerateverb calls an LLM to produce realistic prose for each entity) @forwardimpact/libterraininstalled:
npm install -g @forwardimpact/libterrain
Or invoke ephemerally:
npx --yes @forwardimpact/libterrain fit-terrain --help
Write the DSL file
Create a .dsl file that declares the organization,
people distribution, and engineering standard. The minimum viable
DSL needs four top-level blocks:
// evals/terrain/story.dsl
terrain Acme {
domain "acme.example"
industry "fintech"
seed 42
org headquarters {
name "Acme HQ"
location "London, UK"
}
department engineering {
name "Engineering"
parent headquarters
headcount 20
team payments {
name "Payments Team"
size 8
repos ["payments-api", "ledger-service"]
}
}
people {
count 20
distribution { J060 50% J070 30% J080 20% }
disciplines { software_engineering 80% data_engineering 20% }
}
standard {
// Full standard block: proficiencies, maturities, levels,
// capabilities, behaviours, disciplines, tracks, drivers.
// See the complete example in the end-to-end guide.
}
}
A complete standard block with capabilities,
behaviours, disciplines, and levels is shown in the
end-to-end guide. The seed field makes the entity graph deterministic
-- the same seed produces the same people, assignments, and
proficiency ratings on every run.
Generate the dataset
Run generate to fill the prose cache and build all
output:
npx fit-terrain generate --story=evals/terrain/story.dsl
The pipeline walks a DAG of stages in dependency order:
| Stage | What it does |
|---|---|
parse |
Reads and parses the DSL file |
entities |
Generates the organization graph, people, and assignments |
prose-keys |
Collects every key that needs prose (bios, summaries, reviews) |
cache-lookup |
Resolves each key through an LLM, caching results to disk |
skeleton |
Renders deterministic HTML structure for knowledge documents |
enriched |
Fills the skeleton with cached prose |
raw |
Renders raw activity documents |
markdown |
Renders personal markdown documents |
pathway |
Renders engineering standard YAML from the
standard block
|
datasets |
Runs any external dataset tools (Faker, Synthea, SDV) |
validate |
Checks entity consistency and HTML structure |
write |
Merges all output and writes to disk |
The prose cache persists to
data/synthetic/prose-cache.json by default. Subsequent
runs with the same DSL reuse cached prose, so only new or changed
keys cost API calls.
After the run completes, the data/ directory contains
the full dataset:
data/
pathway/ Engineering standard YAML (capabilities, levels, disciplines)
knowledge/ HTML knowledge-base documents with microdata
personal/ Personal markdown documents
activity/ Activity records and evidence
synthetic/ Prose cache
Verify without regenerating
Two verbs let you check the dataset without making LLM calls.
Check cache completeness -- reports how many prose
keys are cached versus missing. Exit code 1 if any key
is a miss:
npx fit-terrain check --story=evals/terrain/story.dsl
Validate structure -- runs entity and cross-content checks without writing files. Use after editing the DSL to catch errors before a full rebuild:
npx fit-terrain validate --story=evals/terrain/story.dsl
Rebuild a subset
When only part of the dataset needs refreshing, use
build with --only to render a single
content type:
npx fit-terrain build --story=evals/terrain/story.dsl --only=pathway
Valid --only values: html,
pathway, raw, markdown.
Omitting --only renders everything.
The build verb uses the existing prose cache but does
not call the LLM. If the cache has misses, the output will include a
warning:
⚠ 12 prose cache misses — run "fit-terrain generate" to fill the cache.
Override defaults
| Option | Default | Purpose |
|---|---|---|
--story |
data/synthetic/story.dsl |
Path to the DSL file |
--cache |
data/synthetic/prose-cache.json |
Path to the prose cache file |
--model |
claude-opus-4-7 (via config) |
LLM model for generate |
All paths are relative to the working directory.
Inspect a pipeline stage
To debug or understand the intermediate output of any stage, use
inspect:
npx fit-terrain inspect entities --story=evals/terrain/story.dsl
This prints the stage's output as formatted JSON. Valid stage
names match the pipeline table above: parse,
entities, prose-keys,
cache-lookup, skeleton,
enriched, raw, markdown,
pathway, datasets, validate,
write.
What's next
- Prove Whether Agent Changes Improved Outcomes -- the full workflow from dataset generation through evaluation sessions to trace analysis.
- Terrain Internals -- architecture, DSL grammar, and node table design.