Scoring saved outputs : eval-harness

Guides
Scoring saved outputs

Sometimes generation and scoring should be decoupled. A nightly job, an
offline batch, or an expensive GPU pipeline produces model responses once; you
want to score those saved outputs in CI without re-running the agent. The
--outputs mode does exactly that.

Why decouple

Expensive or slow generation

If producing answers takes minutes per sample or needs hardware CI doesn’t
have, generate once and score the artifact cheaply on every push.

Deterministic CI

Scoring a fixed outputs file removes provider variance from the gate
entirely — the same outputs always produce the same report.

Reproduce a production incident

Capture the exact answers a model gave in prod, save them, and score them
against your golden dataset to confirm the regression offline.

Compare two systems

Score outputs from model A and model B against the same dataset and metrics,
then diff the two reports.

The outputs file

--outputs accepts JSON or YAML in two shapes.

Map form

List form

Keyed by sample id:

{
  "outputs": {
    "capital-france": "Paris",
    "refund-policy": "30 days from delivery."
  }
}

A list of {id, actual_output} objects:

{
  "outputs": [
    { "id": "capital-france", "actual_output": "Paris" },
    { "id": "refund-policy", "actual_output": "30 days from delivery." }
  ]
}

Each entry is aligned to its dataset sample by id, then scored with the
dataset’s declared metrics.

Running it

php artisan eval-harness:run rag.factuality.fy2026 \
  --registrar="App\Console\EvalRegistrar" \
  --outputs=eval/outputs/factuality.json \
  --json --out=factuality.json

The registrar still registers the dataset and its metrics, but no
eval-harness.sut binding is required in this mode — there is no SUT to
invoke.

When --outputs is set, the run goes straight to scoreOutputs() and
bypasses the batch contract entirely. --batch, --batch-profile,
--concurrency, --rate-limit, and the other dispatch flags are ignored —
there is nothing to dispatch, only saved outputs to score.

Programmatic scoring

The same path is available through the facade — useful inside a test or a custom
command:

use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Facades\EvalFacade;

// The facade class is EvalFacade because `Eval` is a reserved word in PHP and
// cannot be used as a class alias (`use ... as Eval` is a parse error).
EvalFacade::dataset('rag.smoke')
    ->withSamples([
        new DatasetSample(id: 's1', input: ['q' => 'hi'], expectedOutput: 'hello'),
        new DatasetSample(id: 's2', input: ['q' => 'bye'], expectedOutput: 'goodbye'),
    ])
    ->withMetrics(['exact-match'])
    ->register();

$report = EvalFacade::scoreOutputs('rag.smoke', [
    's1' => 'hello',
    's2' => 'wrong answer',
]);

$report is the same EvalReport a full run produces — identical aggregates,
cohorts, histograms, and JSON shape.

Path handling

Relative --out paths resolve against the configured reports disk and prefix
(eval-harness/reports by default). Add --raw-path only when you want a literal
filesystem path and its parent directory already exists.

Running evaluations

The full SUT-driven run for comparison.

Open →

Report contract

The JSON shape both modes produce.

Open →

Last updated: Edit this page

Why decouple

The outputs file

Running it

Programmatic scoring

Path handling

onThisPage