PHP API
The package is primarily driven through Artisan, but the same engine is exposed
as normal Laravel services for test suites, custom dashboards, and host
application workflows.
All examples assume the package service provider has been loaded by Laravel’s
package discovery and the host app has published config/eval-harness.php.
Eval engine
Resolve Padosoft\EvalHarness\EvalEngine from the container when you need to
register datasets and run a system under test directly.
use Padosoft\EvalHarness\EvalEngine;
$report = app(EvalEngine::class)
->registerDataset($dataset)
->run('rag.factuality', fn (string $input) => app(RagAgent::class)->answer($input));
The SUT can be a callable, a SampleRunner, or a SampleInvocation-aware
callable. Use the command path when you want exit-code behavior; use the service
path when another workflow owns orchestration.
Dataset builder
Programmatic datasets are useful for generated adversarial seeds, test fixtures,
or host applications that already store canonical cases elsewhere.
use Padosoft\EvalHarness\Datasets\DatasetBuilder;
$dataset = DatasetBuilder::make('rag.factuality')
->sample(
id: 'refund-policy',
input: 'How long do I have to request a refund?',
expectedOutput: 'Refunds are available within 30 days.',
metadata: ['tags' => ['policy', 'refunds']],
)
->metric('contains')
->metric('llm-as-judge')
->build();
Metrics
Custom metrics implement Padosoft\EvalHarness\Metrics\Metric and return a
MetricScore between 0.0 and 1.0.
use Padosoft\EvalHarness\Datasets\DatasetSample;
use Padosoft\EvalHarness\Metrics\Metric;
use Padosoft\EvalHarness\Metrics\MetricScore;
final class StartsWithExpectedMetric implements Metric
{
public function name(): string
{
return 'starts-with-expected';
}
public function score(DatasetSample $sample, mixed $actual): MetricScore
{
return new MetricScore(str_starts_with((string) $actual, (string) $sample->expectedOutput) ? 1.0 : 0.0);
}
}
Register an instance in the dataset, pass an FQCN string, or bind it behind a
container alias consumed by MetricResolver.
Saved outputs
Use SavedOutputsLoader when generation and scoring are deliberately separated.
The loaded outputs can be passed through the same metrics and report renderers as
a live run.
use Padosoft\EvalHarness\Outputs\SavedOutputsLoader;
$outputs = app(SavedOutputsLoader::class)->load(base_path('storage/eval/outputs.json'));
$report = app(EvalEngine::class)->scoreOutputs('rag.factuality', $outputs);
Online monitor
OnlineMonitor::capture() samples production traffic according to config and
queues JudgeLiveSampleJob when the sample is selected.
use Padosoft\EvalHarness\Online\OnlineMonitor;
app(OnlineMonitor::class)->capture(
dataset: 'rag.factuality',
input: $question,
expectedOutput: $expectedAnswer,
actualOutput: $agentAnswer,
metadata: ['tenant' => $tenantId],
);
Do not put secrets, raw provider payloads, or customer PII in metadata. Reports
and trend APIs are intentionally optimized for quality signals, not audit-log
retention.
Worked example
Build a tiny dataset
Define one or two cases with
DatasetBuilderinside a PHPUnit test.Register a deterministic SUT
Pass a closure that returns fixed outputs so the test is network-free.
Assert the aggregate
Read the returned
EvalReportand assert the metric aggregate or macro-F1
threshold that matters for the code path under test.
Gotchas and limits
The service path does not fail CI by itself
Artisan commands translate report failures into process exit codes. Direct PHP
calls return a report object, so the host test or job must make the assertion.
Provider-backed metrics still need configured clients
llm-as-judge, refusal-quality, cosine-embedding, and bertscore-like
resolve judge or embedding clients from the container. Fake those bindings in
unit tests.
Saved outputs bypass batch options
When scoring precomputed outputs, the package does not dispatch the SUT, so
batch profile, queue, concurrency, timeout, and rate-limit flags are irrelevant.