Trace Analysis
Trace Analysis
Once fit-eval has produced an NDJSON trace, the work
shifts from running to understanding. fit-trace is the
query interface — but the trace is qualitative data, and the most
useful analysis comes from reading it like a researcher, not running
a checklist.
This guide walks through the method: orient with summary commands, read the full trace, code observations, look for patterns, and synthesize findings that are grounded, testable, and actionable. Two worked examples — an eval that failed and a multi-agent session that stalled — show the method on real-shaped data.
Prerequisites
- Node.js 18+
-
A trace file (either
--outputfromfit-eval, or downloaded from CI withfit-trace download) - Time to read the full trace — skimming produces shallow findings
1. Get the trace
Local runs already produce a trace at --output. For CI
runs, list and download:
npx fit-trace runs # find the run you want
npx fit-trace download <run-id> # downloads to /tmp/trace-<run-id>/
For supervised or facilitated runs, split the combined trace into per-source files so you can see what each agent saw:
npx fit-trace split /tmp/trace-<run-id>/structured.json --mode=facilitate
This produces trace-facilitator.ndjson,
trace-<participant>.ndjson, etc., which is
essential when participants disagreed — you can read each one's
view independently.
Either trace form works as input — *.ndjson files from
fit-eval --output and structured.json from
fit-trace download are interchangeable for every
fit-trace query command.
2. Orient
Start with the bird's-eye view before drilling in:
npx fit-trace overview <file> # metadata, summary, turn count, tool usage
npx fit-trace timeline <file> # one line per turn
npx fit-trace stats <file> # tokens and cost
overview tells you how the run ended;
timeline shows the shape of the session at a glance. If
the timeline is dominated by one tool, that's a hypothesis. If
it shows clusters of errors, that's another. Note them — but
don't commit to them yet.
3. Read the full trace
The temptation is to jump to errors or
search and confirm the obvious. Resist it. Subtle agent
failures live in interactions between turns that look fine in
isolation.
Open the trace and walk it turn by turn (or
batch <from> <to> for chunks). As you read,
follow four practices borrowed from grounded-theory research:
- Begin with no hypothesis. Read before forming opinions about what went wrong. The trace will tell you something you didn't expect — but only if you let it.
- Use the trace's own language. Label observations with terms from the actual output — error messages, tool names, status codes — not abstract categories you bring to the analysis.
- Write memos as you go. Short notes on why something surprised you, or connections between observations. Memos written during analysis are far more valuable than retrospective summaries.
- Read the full trace. Every turn matters. The cause of a turn-30 failure is often visible at turn 8.
4. Look for patterns
As you read, assign short labels (codes) to meaningful events:
claimed done without verification,
silent retry, redirect ignored,
tool error swallowed. Group related codes into
categories by asking:
- What caused this? What happened? What was the context?
- How did the agent react?
- What were the consequences?
Then look across codes for:
- Causal chains — A failed at turn 8, which led B to assume X at turn 15, which produced the wrong result at turn 28.
- Repeated patterns — the same shape of mistake more than once.
- Contrasts — the same operation succeeded in one context but failed in another. The difference is the lever.
- Temporal patterns — early-run vs late-run behavior. Agents often degrade as context fills.
5. Synthesize findings
Strong findings share three traits:
- Grounded — traceable to specific turns. Cite turn indices.
- Testable — future traces can confirm or refute them.
- Actionable — they imply a concrete change to a profile, prompt, tool allowlist, or workflow.
Aim for a central explanation that connects multiple observations, not a bug list. A bug list says what went wrong; a central explanation says why this kind of thing keeps going wrong.
Worked example: supervised eval of a coding agent
A supervise run evaluated a coding agent on the task
"add input validation to the user registration endpoint:
reject empty emails and reject passwords shorter than 8
characters; add tests."
The agent finished, the judge concluded success: false,
and CI surfaced the failure. The question is why — the agent's
own tests passed.
npx fit-trace overview trace.ndjson
# 22 turns, Conclude success=false
npx fit-trace tool trace.ndjson Conclude
# "validation present on JSON path only; form-encoded path unchanged"
npx fit-trace filter trace.ndjson --tool Read
npx fit-trace filter trace.ndjson --tool Edit
Walking the trace, codes emerge:
- T2: agent reads the endpoint handler, sees a JSON body parser at the top.
- T3–T5: agent adds validation inside the JSON branch.
- T7: agent writes tests that POST JSON. All pass. (Code: tests confirm what was changed, not what was needed.)
- T9: agent calls
Concludeproposing success. -
T10: judge inspects the route definition and finds the same
handler registered for both
application/jsonandapplication/x-www-form-urlencoded. The form-encoded branch is untouched. -
T11–T20: judge
Redirects, agent investigates, but never reads the form parser path before re-asserting the change is complete. (Code: narrow scope of investigation.) - T22: judge concludes failure with the gap noted explicitly.
Central explanation: the agent treats the first input shape it encounters as the full input surface. Its tests reinforce the narrowing because they exercise only what the agent already considered. The agent is locally correct on the path it explored but blind to parallel paths.
Action: add a judge criterion that requires enumerating the request content types before implementing. Or, at the agent profile level, instruct the coder to list every entry path into the handler before editing it. The fix isn't about validation logic — it's about scope discovery before implementation.
Worked example: facilitated triage of a support ticket
A facilitate session triaged an incoming support
ticket:
"Login broken for users on iOS Safari. Started this morning.
12 customers reporting."
The participants were support-engineer (assesses
customer impact), platform-engineer (checks recent
deploys and infra), and mobile-engineer (checks
platform-specific issues). The session concluded by routing to the
mobile team for a Safari-version-specific workaround. Three days
later the real cause turned out to be a backend deploy that broke
user-agent parsing for iOS Safari specifically. The session had the
right evidence; it reasoned to the wrong cause.
npx fit-trace split trace.ndjson --mode=facilitate
npx fit-trace timeline trace-facilitator.ndjson
npx fit-trace tool trace-facilitator.ndjson Announce
npx fit-trace reasoning trace-platform-engineer.ndjson
Reading the per-source traces side by side:
-
T3:
support-engineerreports 12 affected customers, all iOS Safari, all reporting since 8am. P1 severity. -
T5:
mobile-engineernotes Safari 18 was released yesterday with WebKit changes. Hypothesis: Safari regression. (Code: first plausible cause becomes anchor.) -
T7:
platform-engineerreports a deploy to the auth service at 6am that morning. Notes the timing matches. -
T9:
mobile-engineerresponds that the iOS-specific symptom makes a Safari root cause more likely than a backend cause that would affect other browsers. (Code: domain expertise dismisses cross-cutting evidence.) -
T11:
platform-engineeragrees and downgrades the deploy hypothesis. The reasoning trace shows the agent had not yet inspected what the deploy changed. (Code: deferred to confidence, not evidence.) - T14: facilitator concludes — assign to mobile team, action is a user-agent workaround for Safari 18.
Central explanation: when one participant has obvious domain expertise, the others defer to it even when their own findings carry equally strong evidence. The session converges on the most-confident voice rather than the most-supported hypothesis. Two independent signals (iOS-specific symptom and matching deploy time) collapsed into one because the participants deliberated sequentially rather than independently.
Action: require participants to state their leading hypothesis with a confidence level before deliberation begins, then surface disagreement explicitly rather than letting it dissolve. In facilitator profiles, add a step that asks each participant what evidence would change your mind — a deploy diff would have caught this one in the room.
What to measure
When the question is quantitative — is the agent getting better? — the metrics are:
-
Token usage —
statsbreaks down input vs. output tokens and cost. -
Retry counts —
searchfor repeated identical tool calls. - Wasted turns — turns that produced no useful progress; count them while reading.
-
Error recovery — did the agent diagnose and
adapt, or retry blindly? Compare
errorsagainst the immediately following turns. -
Intent vs. execution —
reasoningshows what the agent said it would do;toolshows what it did. Mismatches are findings.
Track these across runs over time. A single trace is a snapshot; a series shows whether changes are landing.
Related
-
Agent Evaluations — produce
traces with
fit-eval supervise; the trace is what you analyze here. -
Agent Collaboration —
produce traces with
fit-eval facilitate; the per-source split is essential for multi-agent traces. -
CLI Reference — the full
fit-tracecommand surface.