Automate with GitHub Actions
You have a task family that works locally. Now you want benchmarks
to run automatically — on pull requests that touch your skills, on a
weekly schedule, or on demand. The
forwardimpact/fit-benchmark GitHub Action wraps the
CLI, adds step summaries and artifact upload, and handles timeout
control.
Prerequisites
- A task family (see Run a Benchmark)
-
ANTHROPIC_API_KEYstored as a repository secret
Minimal Workflow
name: Benchmark
on:
workflow_dispatch:
pull_request:
paths:
- ".claude/skills/**"
- "benchmarks/my-family/**"
permissions:
contents: read
jobs:
benchmark:
runs-on: ubuntu-latest
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: forwardimpact/fit-benchmark@v1
with:
family: ./benchmarks/my-family
runs: "5"
judge-profile: judge
The action handles everything after checkout: install dependencies,
run each task N times, append the pass@k report to the GitHub step
summary, and upload results.jsonl as a workflow
artifact.
What the Action Does
- Install apm — downloads and caches the apm binary if not already present.
-
Resolve CLI — uses a local
fit-benchmarkif available, falls back tobunx, thennpx. -
Run — executes
fit-benchmark runwith the provided inputs. -
Report — appends the text report to
GITHUB_STEP_SUMMARY(whensummaryis"true"). -
Upload — uploads
results.jsonlas a workflow artifact (whenupload-resultsis"true").
Inputs
All fit-benchmark run CLI flags are exposed as action
inputs. The action adds CI-specific inputs that have no CLI
equivalent:
| Input | Default | Description |
|---|---|---|
family |
(required) | Path or git URL to a task family |
output |
"benchmark-runs" |
Run-output directory |
runs |
"5" |
Runs per task |
agent-model |
"claude-sonnet-4-6" |
Claude model for the agent-under-test |
supervisor-model |
"claude-opus-4-7" |
Claude model for the supervisor |
judge-model |
"claude-opus-4-7" |
Claude model for the judge |
agent-profile |
Agent-under-test profile name | |
judge-profile |
Judge profile name | |
max-turns |
"50" |
Agent turn budget (0 = unlimited) |
allowed-tools |
"Bash,Read,Glob,Grep,Write,Edit,Agent,TodoWrite"
|
Agent tool allowlist |
k |
"1,3,5" |
Comma-separated k values for pass@k |
format |
"text" |
Report output format |
summary |
"true" |
Append report to GITHUB_STEP_SUMMARY |
upload-results |
"true" |
Upload results.jsonl as artifact |
artifact-name |
"benchmark-results" |
Name for the uploaded artifact |
timeout-minutes |
"60" |
Maximum minutes before cancellation |
Outputs
| Output | Description |
|---|---|
results-path |
Absolute path to results.jsonl |
Use results-path in downstream steps to consume or
compare results programmatically.
Task Secrets
Tasks that declare .env or
.env.local files resolve their variables from the
runner environment. Add the required secrets alongside
ANTHROPIC_API_KEY:
jobs:
benchmark:
runs-on: ubuntu-latest
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
LLMHUB_NONPROD_API_KEY: ${{ secrets.LLMHUB_NONPROD_API_KEY }}
LLMHUB_PROD_API_KEY: ${{ secrets.LLMHUB_PROD_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: forwardimpact/fit-benchmark@v1
with:
family: ./benchmarks/my-family
runs: "5"
The harness reads the task's .env.local for var
names, resolves each from process.env (where the GitHub
secrets live), and renders the file into the agent's working
directory. No prepare.sh or manual staging needed.
Scheduled Runs
Add a cron trigger to track outcomes over time:
on:
schedule:
- cron: "0 6 * * 1"
workflow_dispatch:
Scheduled runs on main create a weekly baseline.
Compare the latest results.jsonl artifact against a
previous week's to detect regressions.
Cost Control
Each run invokes Claude for the agent-under-test, scoring, and judging. Control cost with:
-
runs— fewer runs means lower cost but weaker statistical signal. Five runs is a reasonable floor for pass@k. -
max-turns— caps agent turns per run. Tasks that finish fast rarely need more than 25. -
timeout-minutes— hard cancellation. Default is 60; adjust based on family size. - PR path filters — only run when relevant files change.
Matrix Workflows
When running benchmarks across multiple families in a matrix, use
artifact-name to avoid upload collisions:
strategy:
matrix:
family:
- { path: "./benchmarks/kata-skills", name: "kata" }
- { path: "./benchmarks/fit-skills", name: "fit" }
steps:
- uses: forwardimpact/fit-benchmark@v1
with:
family: ${{ matrix.family.path }}
artifact-name: benchmark-${{ matrix.family.name }}
Verify
After the workflow runs, confirm:
- The step summary shows a pass@k table.
-
The
benchmark-resultsartifact is downloadable from the workflow run. -
The exit code reflects the aggregate verdict —
0when all tasks pass,1otherwise.
What's Next
Run a Benchmark
Prove a skill-pack change improved coding outcomes — run a task family across N runs, grade with hidden tests, and report pass@k.
Run an Eval
Know whether agent changes improved outcomes — an agent-as-judge eval wired into CI with traceable results.
Analyze Traces
See exactly what an agent did and why — download traces, query turns, filter by tool or error, and measure token cost.