# Automate with GitHub Actions


You have a task family that works locally. Now you want benchmarks to run
automatically — on pull requests that touch your skills, on a weekly schedule,
or on demand. The `forwardimpact/fit-benchmark` GitHub Action wraps the CLI,
adds step summaries and artifact upload, and handles timeout control.

## Prerequisites

- A task family (see [Run a Benchmark](/docs/libraries/prove-changes/run-benchmark/))
- `ANTHROPIC_API_KEY` stored as a repository secret

## Minimal Workflow

```yaml
name: Benchmark

on:
  workflow_dispatch:
  pull_request:
    paths:
      - ".claude/skills/**"
      - "benchmarks/my-family/**"

permissions:
  contents: read

jobs:
  benchmark:
    runs-on: ubuntu-latest
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: forwardimpact/fit-benchmark@v1
        with:
          family: ./benchmarks/my-family
          runs: "5"
          judge-profile: judge
```

The action handles everything after checkout: install dependencies, run each
task N times, append the pass@k report to the GitHub step summary, and upload
`results.jsonl` as a workflow artifact.

## What the Action Does

1. **Install apm** — downloads and caches the apm binary if not already present.
2. **Resolve CLI** — uses a local `fit-benchmark` if available, falls back to
   `bunx`, then `npx`.
3. **Run** — executes `fit-benchmark run` with the provided inputs.
4. **Report** — appends the text report to `GITHUB_STEP_SUMMARY` (when
   `summary` is `"true"`).
5. **Upload** — uploads `results.jsonl` as a workflow artifact (when
   `upload-results` is `"true"`).

## Inputs

All `fit-benchmark run` CLI flags are exposed as action inputs. The action adds
CI-specific inputs that have no CLI equivalent:

| Input | Default | Description |
| --- | --- | --- |
| `family` | *(required)* | Path or git URL to a task family |
| `output` | `"benchmark-runs"` | Run-output directory |
| `runs` | `"5"` | Runs per task |
| `agent-model` | `"claude-sonnet-4-6"` | Claude model for the agent-under-test |
| `supervisor-model` | `"claude-opus-4-7"` | Claude model for the supervisor |
| `judge-model` | `"claude-opus-4-7"` | Claude model for the judge |
| `agent-profile` | | Agent-under-test profile name |
| `judge-profile` | | Judge profile name |
| `max-turns` | `"50"` | Agent turn budget (`0` = unlimited) |
| `allowed-tools` | `"Bash,Read,Glob,Grep,Write,Edit,Agent,TodoWrite"` | Agent tool allowlist |
| `k` | `"1,3,5"` | Comma-separated k values for pass@k |
| `format` | `"text"` | Report output format |
| `summary` | `"true"` | Append report to `GITHUB_STEP_SUMMARY` |
| `upload-results` | `"true"` | Upload `results.jsonl` as artifact |
| `artifact-name` | `"benchmark-results"` | Name for the uploaded artifact |
| `timeout-minutes` | `"60"` | Maximum minutes before cancellation |

## Outputs

| Output | Description |
| --- | --- |
| `results-path` | Absolute path to `results.jsonl` |

Use `results-path` in downstream steps to consume or compare results
programmatically.

## Task Secrets

Tasks that declare `.env` or `.env.local` files resolve their variables
from the runner environment. Add the required secrets alongside
`ANTHROPIC_API_KEY`:

```yaml
jobs:
  benchmark:
    runs-on: ubuntu-latest
    env:
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      LLMHUB_NONPROD_API_KEY: ${{ secrets.LLMHUB_NONPROD_API_KEY }}
      LLMHUB_PROD_API_KEY: ${{ secrets.LLMHUB_PROD_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: forwardimpact/fit-benchmark@v1
        with:
          family: ./benchmarks/my-family
          runs: "5"
```

The harness reads the task's `.env.local` for var names, resolves each
from `process.env` (where the GitHub secrets live), and renders the file
into the agent's working directory. No `prepare.sh` or manual staging
needed.

## Scheduled Runs

Add a cron trigger to track outcomes over time:

```yaml
on:
  schedule:
    - cron: "0 6 * * 1"
  workflow_dispatch:
```

Scheduled runs on `main` create a weekly baseline. Compare the latest
`results.jsonl` artifact against a previous week's to detect regressions.

## Cost Control

Each run invokes Claude for the agent-under-test, scoring, and judging. Control
cost with:

- **`runs`** — fewer runs means lower cost but weaker statistical signal.
  Five runs is a reasonable floor for pass@k.
- **`max-turns`** — caps agent turns per run. Tasks that finish fast rarely
  need more than 25.
- **`timeout-minutes`** — hard cancellation. Default is 60; adjust based on
  family size.
- **PR path filters** — only run when relevant files change.

## Matrix Workflows

When running benchmarks across multiple families in a matrix, use
`artifact-name` to avoid upload collisions:

```yaml
strategy:
  matrix:
    family:
      - { path: "./benchmarks/kata-skills", name: "kata" }
      - { path: "./benchmarks/fit-skills", name: "fit" }
steps:
  - uses: forwardimpact/fit-benchmark@v1
    with:
      family: ${{ matrix.family.path }}
      artifact-name: benchmark-${{ matrix.family.name }}
```

## Verify

After the workflow runs, confirm:

1. The step summary shows a pass@k table.
2. The `benchmark-results` artifact is downloadable from the workflow run.
3. The exit code reflects the aggregate verdict — `0` when all tasks pass,
   `1` otherwise.

## What's Next

<div class="grid">

<a href="../">
<h3>Run a Benchmark</h3>
<p>Prove a skill-pack change improved coding outcomes — run a task family across N runs, grade with hidden tests, and report pass@k.</p>
</a>
<a href="../../run-eval/">
<h3>Run an Eval</h3>
<p>Know whether agent changes improved outcomes — an agent-as-judge eval wired into CI with traceable results.</p>
</a>
<a href="../../trace-analysis/">
<h3>Analyze Traces</h3>
<p>See exactly what an agent did and why — download traces, query turns, filter by tool or error, and measure token cost.</p>
</a>

</div>