Trace Analysis

Trace Analysis

Once fit-eval has produced an NDJSON trace, the work shifts from running to understanding. fit-trace is the query interface — but the trace is qualitative data, and the most useful analysis comes from reading it like a researcher, not running a checklist.

This guide walks through the method: orient with summary commands, read the full trace, code observations, look for patterns, and synthesize findings that are grounded, testable, and actionable. Two worked examples — an eval that failed and a multi-agent session that stalled — show the method on real-shaped data.

Prerequisites

  • Node.js 18+
  • A trace file (either --output from fit-eval, or downloaded from CI with fit-trace download)
  • Time to read the full trace — skimming produces shallow findings

1. Get the trace

Local runs already produce a trace at --output. For CI runs, list and download:

npx fit-trace runs                    # find the run you want
npx fit-trace download <run-id>       # downloads to /tmp/trace-<run-id>/

For supervised or facilitated runs, split the combined trace into per-source files so you can see what each agent saw:

npx fit-trace split /tmp/trace-<run-id>/structured.json --mode=facilitate

This produces trace-facilitator.ndjson, trace-<participant>.ndjson, etc., which is essential when participants disagreed — you can read each one's view independently.

Either trace form works as input — *.ndjson files from fit-eval --output and structured.json from fit-trace download are interchangeable for every fit-trace query command.

2. Orient

Start with the bird's-eye view before drilling in:

npx fit-trace overview <file>     # metadata, summary, turn count, tool usage
npx fit-trace timeline <file>     # one line per turn
npx fit-trace stats <file>        # tokens and cost

overview tells you how the run ended; timeline shows the shape of the session at a glance. If the timeline is dominated by one tool, that's a hypothesis. If it shows clusters of errors, that's another. Note them — but don't commit to them yet.

3. Read the full trace

The temptation is to jump to errors or search and confirm the obvious. Resist it. Subtle agent failures live in interactions between turns that look fine in isolation.

Open the trace and walk it turn by turn (or batch <from> <to> for chunks). As you read, follow four practices borrowed from grounded-theory research:

  • Begin with no hypothesis. Read before forming opinions about what went wrong. The trace will tell you something you didn't expect — but only if you let it.
  • Use the trace's own language. Label observations with terms from the actual output — error messages, tool names, status codes — not abstract categories you bring to the analysis.
  • Write memos as you go. Short notes on why something surprised you, or connections between observations. Memos written during analysis are far more valuable than retrospective summaries.
  • Read the full trace. Every turn matters. The cause of a turn-30 failure is often visible at turn 8.

4. Look for patterns

As you read, assign short labels (codes) to meaningful events: claimed done without verification, silent retry, redirect ignored, tool error swallowed. Group related codes into categories by asking:

  • What caused this? What happened? What was the context?
  • How did the agent react?
  • What were the consequences?

Then look across codes for:

  • Causal chains — A failed at turn 8, which led B to assume X at turn 15, which produced the wrong result at turn 28.
  • Repeated patterns — the same shape of mistake more than once.
  • Contrasts — the same operation succeeded in one context but failed in another. The difference is the lever.
  • Temporal patterns — early-run vs late-run behavior. Agents often degrade as context fills.

5. Synthesize findings

Strong findings share three traits:

  • Grounded — traceable to specific turns. Cite turn indices.
  • Testable — future traces can confirm or refute them.
  • Actionable — they imply a concrete change to a profile, prompt, tool allowlist, or workflow.

Aim for a central explanation that connects multiple observations, not a bug list. A bug list says what went wrong; a central explanation says why this kind of thing keeps going wrong.

Worked example: supervised eval of a coding agent

A supervise run evaluated a coding agent on the task "add input validation to the user registration endpoint: reject empty emails and reject passwords shorter than 8 characters; add tests." The agent finished, the judge concluded success: false, and CI surfaced the failure. The question is why — the agent's own tests passed.

npx fit-trace overview trace.ndjson
# 22 turns, Conclude success=false
npx fit-trace tool trace.ndjson Conclude
# "validation present on JSON path only; form-encoded path unchanged"
npx fit-trace filter trace.ndjson --tool Read
npx fit-trace filter trace.ndjson --tool Edit

Walking the trace, codes emerge:

  • T2: agent reads the endpoint handler, sees a JSON body parser at the top.
  • T3–T5: agent adds validation inside the JSON branch.
  • T7: agent writes tests that POST JSON. All pass. (Code: tests confirm what was changed, not what was needed.)
  • T9: agent calls Conclude proposing success.
  • T10: judge inspects the route definition and finds the same handler registered for both application/json and application/x-www-form-urlencoded. The form-encoded branch is untouched.
  • T11–T20: judge Redirects, agent investigates, but never reads the form parser path before re-asserting the change is complete. (Code: narrow scope of investigation.)
  • T22: judge concludes failure with the gap noted explicitly.

Central explanation: the agent treats the first input shape it encounters as the full input surface. Its tests reinforce the narrowing because they exercise only what the agent already considered. The agent is locally correct on the path it explored but blind to parallel paths.

Action: add a judge criterion that requires enumerating the request content types before implementing. Or, at the agent profile level, instruct the coder to list every entry path into the handler before editing it. The fix isn't about validation logic — it's about scope discovery before implementation.

Worked example: facilitated triage of a support ticket

A facilitate session triaged an incoming support ticket: "Login broken for users on iOS Safari. Started this morning. 12 customers reporting." The participants were support-engineer (assesses customer impact), platform-engineer (checks recent deploys and infra), and mobile-engineer (checks platform-specific issues). The session concluded by routing to the mobile team for a Safari-version-specific workaround. Three days later the real cause turned out to be a backend deploy that broke user-agent parsing for iOS Safari specifically. The session had the right evidence; it reasoned to the wrong cause.

npx fit-trace split trace.ndjson --mode=facilitate
npx fit-trace timeline trace-facilitator.ndjson
npx fit-trace tool trace-facilitator.ndjson Announce
npx fit-trace reasoning trace-platform-engineer.ndjson

Reading the per-source traces side by side:

  • T3: support-engineer reports 12 affected customers, all iOS Safari, all reporting since 8am. P1 severity.
  • T5: mobile-engineer notes Safari 18 was released yesterday with WebKit changes. Hypothesis: Safari regression. (Code: first plausible cause becomes anchor.)
  • T7: platform-engineer reports a deploy to the auth service at 6am that morning. Notes the timing matches.
  • T9: mobile-engineer responds that the iOS-specific symptom makes a Safari root cause more likely than a backend cause that would affect other browsers. (Code: domain expertise dismisses cross-cutting evidence.)
  • T11: platform-engineer agrees and downgrades the deploy hypothesis. The reasoning trace shows the agent had not yet inspected what the deploy changed. (Code: deferred to confidence, not evidence.)
  • T14: facilitator concludes — assign to mobile team, action is a user-agent workaround for Safari 18.

Central explanation: when one participant has obvious domain expertise, the others defer to it even when their own findings carry equally strong evidence. The session converges on the most-confident voice rather than the most-supported hypothesis. Two independent signals (iOS-specific symptom and matching deploy time) collapsed into one because the participants deliberated sequentially rather than independently.

Action: require participants to state their leading hypothesis with a confidence level before deliberation begins, then surface disagreement explicitly rather than letting it dissolve. In facilitator profiles, add a step that asks each participant what evidence would change your mind — a deploy diff would have caught this one in the room.

What to measure

When the question is quantitative — is the agent getting better? — the metrics are:

  • Token usagestats breaks down input vs. output tokens and cost.
  • Retry countssearch for repeated identical tool calls.
  • Wasted turns — turns that produced no useful progress; count them while reading.
  • Error recovery — did the agent diagnose and adapt, or retry blindly? Compare errors against the immediately following turns.
  • Intent vs. executionreasoning shows what the agent said it would do; tool shows what it did. Mismatches are findings.

Track these across runs over time. A single trace is a snapshot; a series shows whether changes are landing.

  • Agent Evaluations — produce traces with fit-eval supervise; the trace is what you analyze here.
  • Agent Collaboration — produce traces with fit-eval facilitate; the per-source split is essential for multi-agent traces.
  • CLI Reference — the full fit-trace command surface.