Generate an Eval Dataset

You need to produce a dataset for an agent evaluation. The dataset must include an organization graph, people, an engineering standard, knowledge-base documents, and activity records -- and you need to regenerate the whole thing when the schema changes. fit-terrain generate does all of that from a single .dsl file.

For the end-to-end workflow that connects dataset generation to evaluation sessions and trace analysis, see Prove Whether Agent Changes Improved Outcomes.

Prerequisites

Node.js 18+
ANTHROPIC_API_KEY set in the shell (the generate verb calls an LLM to produce realistic prose for each entity)
@forwardimpact/libterrain installed:

npm install -g @forwardimpact/libterrain

Or invoke ephemerally:

npx --yes @forwardimpact/libterrain fit-terrain --help

Write the DSL file

Create a .dsl file that declares the organization, people distribution, and engineering standard. The minimum viable DSL needs four top-level blocks:

// evals/terrain/story.dsl

terrain Acme {
  domain "acme.example"
  industry "fintech"
  seed 42

  org headquarters {
    name "Acme HQ"
    location "London, UK"
  }

  department engineering {
    name "Engineering"
    parent headquarters
    headcount 20

    team payments {
      name "Payments Team"
      size 8
      repos ["payments-api", "ledger-service"]
    }
  }

  people {
    count 20
    distribution { J060 50%  J070 30%  J080 20% }
    disciplines  { software_engineering 80%  data_engineering 20% }
  }

  standard {
    // Full standard block: proficiencies, maturities, levels,
    // capabilities, behaviours, disciplines, tracks, drivers.
    // See the complete example in the end-to-end guide.
  }
}

A complete standard block with capabilities, behaviours, disciplines, and levels is shown in the end-to-end guide. The seed field makes the entity graph deterministic -- the same seed produces the same people, assignments, and proficiency ratings on every run.

Generate the dataset

Run generate to fill the prose cache and build all output:

npx fit-terrain generate --story=evals/terrain/story.dsl

The pipeline walks a DAG of stages in dependency order:

Stage	What it does
`parse`	Reads and parses the DSL file
`entities`	Generates the organization graph, people, and assignments
`prose-keys`	Collects every key that needs prose (bios, summaries, reviews)
`cache-lookup`	Resolves each key through an LLM, caching results to disk
`skeleton`	Renders deterministic HTML structure for knowledge documents
`enriched`	Fills the skeleton with cached prose
`raw`	Renders raw activity documents
`markdown`	Renders personal markdown documents
`pathway`	Renders engineering standard YAML from the `standard` block
`datasets`	Runs any external dataset tools (Faker, Synthea, SDV)
`validate`	Checks entity consistency and HTML structure
`write`	Merges all output and writes to disk

The prose cache persists to data/synthetic/prose-cache.json by default. Subsequent runs with the same DSL reuse cached prose, so only new or changed keys cost API calls.

After the run completes, the data/ directory contains the full dataset:

data/
  pathway/          Engineering standard YAML (capabilities, levels, disciplines)
  knowledge/        HTML knowledge-base documents with microdata
  personal/         Personal markdown documents
  activity/         Activity records and evidence
  synthetic/        Prose cache

Verify without regenerating

Two verbs let you check the dataset without making LLM calls.

Check cache completeness -- reports how many prose keys are cached versus missing. Exit code 1 if any key is a miss:

npx fit-terrain check --story=evals/terrain/story.dsl

Validate structure -- runs entity and cross-content checks without writing files. Use after editing the DSL to catch errors before a full rebuild:

npx fit-terrain validate --story=evals/terrain/story.dsl

Rebuild a subset

When only part of the dataset needs refreshing, use build with --only to render a single content type:

npx fit-terrain build --story=evals/terrain/story.dsl --only=pathway

Valid --only values: html, pathway, raw, markdown. Omitting --only renders everything.

The build verb uses the existing prose cache but does not call the LLM. If the cache has misses, the output will include a warning:

⚠ 12 prose cache misses — run "fit-terrain generate" to fill the cache.

Override defaults

Option	Default	Purpose
`--story`	`data/synthetic/story.dsl`	Path to the DSL file
`--cache`	`data/synthetic/prose-cache.json`	Path to the prose cache file
`--model`	`claude-opus-4-7` (via config)	LLM model for `generate`

All paths are relative to the working directory.

Inspect a pipeline stage

To debug or understand the intermediate output of any stage, use inspect:

npx fit-terrain inspect entities --story=evals/terrain/story.dsl

This prints the stage's output as formatted JSON. Valid stage names match the pipeline table above: parse, entities, prose-keys, cache-lookup, skeleton, enriched, raw, markdown, pathway, datasets, validate, write.

What's next

Prove Whether Agent Changes Improved Outcomes -- the full workflow from dataset generation through evaluation sessions to trace analysis.
Terrain Internals -- architecture, DSL grammar, and node table design.