The whole point of eval-harness is this page: making the exit code of a run
block a bad merge the same way a failing PHPUnit test does.

The exit-code contract

eval-harness:run translates the report into a process exit code:

condition exit code
every metric scored cleanly 0
any metric recorded a SampleFailure (provider/metric contract error) non-zero
strict mode (EVAL_HARNESS_RAISE_EXCEPTIONS=true) hits the first MetricException non-zero

By default, failures are captured, not fatal — one timeout does not abort the
suite — but the captured failure still flips the exit code so CI fails. A strict
lane can opt into aborting on the first error.

The exit code reflects execution health (did every metric score), not a
quality threshold by itself. To gate on a quality bar (e.g. macro-F1 ≥ 0.8),
read the JSON report and assert on it — see Gating on macro-F1
below and Regression gating.

A PR-safe workflow

Run the gate on the same events that change AI behavior — app code, config,
datasets, prompts:

# .github/workflows/eval-gate.yml
name: AI Regression Gate

on:
  pull_request:
    paths:
      - 'app/**'
      - 'config/**'
      - 'eval/**'
      - 'resources/**'

jobs:
  eval-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: shivammathur/setup-php@v2
        with:
          php-version: '8.3'
          tools: composer:v2
      - name: Install dependencies
        run: composer install --no-interaction --prefer-dist --no-progress
      - name: Run eval gate
        env:
          EVAL_HARNESS_JUDGE_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          php artisan eval-harness:run rag.factuality.fy2026 \
            --registrar="App\\Console\\EvalRegistrar" \
            --json --out=eval-report.json --raw-path
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: eval-report
          path: eval-report.json

The if: always() on the upload step is deliberate — you want the report
especially when the gate fails, so the diff is one click away in the PR.

Note the --raw-path flag. A relative --out without it resolves against
the configured reports disk (eval-harness/reports by default), not the
GitHub Actions workspace — so actions/upload-artifact and a later jq eval-report.json would not find the file. --raw-path makes --out a literal
workspace path (the parent directory must already exist; the workspace root
does). Alternatively, drop --raw-path and point the upload/read at the
configured storage path instead.

Keep the gate fast and cheap

The PR gate runs on every push; it must be quick and ideally free.

Prefer offline metrics on the PR lane

exact-match, contains, regex, rouge-l, citation-groundedness,
ordinal-distance, and the entire retrieval-ranking family need no
provider and no network
. A gate built from these costs nothing and runs in
seconds.

Push expensive judges to a nightly lane

llm-as-judge and refusal-quality call a provider per sample. Keep them on
a smaller nightly or release dataset, not the per-push PR gate.

Use a batch profile

--batch-profile=ci applies sane lazy-parallel defaults; --batch-profile=smoke
stays serial for a fast pre-merge check. See
Batch execution.

Gating on a quality threshold

The exit code catches execution failures. To also fail when quality drops below
a bar, read the JSON report’s macro_f1 and assert on it:

php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\\Console\\EvalRegistrar" \
  --json --out=eval-report.json --raw-path

MACRO_F1=$(jq '.macro_f1' eval-report.json)
echo "macro-F1 = $MACRO_F1"
awk "BEGIN { exit !($MACRO_F1 >= 0.80) }" \
  || { echo "::error::macro-F1 $MACRO_F1 below 0.80 floor"; exit 1; }

For drift-relative gating (fail when the score drops more than N points from a
stored baseline) the adversarial lane ships a first-class --regression-gate;
see Regression gating.

What to gate, and where

flowchart TB PR["Pull request push"] --> FAST["PR gate:<br/>offline metrics, macro-F1 floor<br/>fast · free · blocking"] MERGE["Merge to main"] --> NIGHT["Nightly:<br/>judge metrics, larger dataset<br/>adversarial lane, regression gate"] NIGHT --> ALERT["alert on regression"]
Regression gating

Baselines, manifests, and the adversarial regression gate.

Open →

Online monitoring

Catch regressions the dataset never anticipated, in production.

Open →