From Scores to Receipts: Introducing ReceiptBench, a Typed-Receipt Protocol for Governance-Ready Evaluation

From Scores to Receipts: Introducing ReceiptBench, a Typed-Receipt Protocol for Governance-Ready Evaluation

Mathine: Large Language Field Instrumentation Machine
Link: https://doi.org/10.5281/zenodo.18661829 [1]

Agent benchmarks are drifting away from “one number to rule them all.” That shift is not cosmetic — it’s a structural response to deployment reality. Tool-using agents live inside regimes: admissible evidence, budgeted verification, executable tool boundaries, and an explicit definition of closure. When a benchmark compresses all that into a pooled mean, it hides tail-risk and produces false closure: the agent looks “better” while still failing in the slices that dominate governance outcomes.

This paper makes that regime shift executable by proposing ReceiptBench, an evaluation protocol that outputs typed receipts — minimal, structured, time-scoped claims issued by an evaluator/verifier about what occurred during a run — rather than collapsing outcomes into a single scalar score. The point is not to abolish scores; it is to make scores derivative. In ReceiptBench, scores become explicit policy projections over receipt-space, so you can see which governance rule created which number, and you can change the rule without rewriting the benchmark. [1]

We formalize this with two objects. First, the receipt object: a typed claim with clear scope, an issuing verifier, and an interpretation contract. Second, the receiptboard: a vector representation of a run’s receipts that preserves structure (evidence/grounding, tool execution, state-diff verification, closure/contestability) and supports aggregation without erasing regime meaning. This turns “evaluation” into a governed interface: an audit-ready artifact stream rather than a storytelling exercise. [1]

ReceiptBench also sharpens the LLF thesis by making convergence measurable. In the Large Language Fields (LLFs) view, reliability is governed above the model by a reusable, versioned contract layer defining meaning, admissibility, and what counts as “done.” ReceiptBench operationalizes “benchmark convergence” as convergence in receipt distributions by field, where deployability constraints naturally appear as worst-subfield gates and cascade signatures, not as minor score deltas. In other words, tail-risk becomes visible because the benchmark output is already structured like a governance object. [1], [3]–[5]

If your operational question is “Can we deploy this agent under our regime?”, ReceiptBench reframes the answer. You do not want a single number that averages away the boundary. You want receipts that survive sampling, replay, and third-party checking. This connects directly to Field-Driven Design (FDD) as the operational discipline for running a field in production, and to Field Definition Language (FDL) as the contract mechanism that makes checks-before-effect and replayable receipts first-class at the API boundary. [2], [5]

The paper’s falsifiable predictions follow the structure: receipt-complete benchmarks should predict failures better than pooled means; worst-subfield and dispersion should dominate mean as governance predictors; and receipt audit sampling should explain a large fraction of false closure in real agent deployments—especially where tool boundaries and evidence admissibility are the true failure modes. [1]

References
[1] R. Figurelli, “From Scores to Receipts: Introducing ReceiptBench, a Typed-Receipt Protocol for Governance-Ready Evaluation”. Zenodo, fev. 16, 2026. https://doi.org/10.5281/zenodo.18661829
[2] R. Figurelli, “Field-Driven Design (FDD): The Operational Extension of Large Language Fields (LLFs)”. Zenodo, out. 13, 2025. https://doi.org/10.5281/zenodo.17342856
[3] R. Figurelli, “Large Language Fields (LLFs): The Invisible Layer Above LLMs”. Zenodo, out. 03, 2025. https://doi.org/10.5281/zenodo.17254137
[4] R. Figurelli, “Field Intelligence: Designing Decisions in Large Language Fields (LLFs)”. Zenodo, out. 08, 2025. https://doi.org/10.5281/zenodo.17290823
[5] R. Figurelli, “Field Definition Language (FDL): A Proposal to Evolve APIs into Governed Fields”. Zenodo, out. 18, 2025. https://doi.org/10.5281/zenodo.17382665

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

From Scores to Receipts: Introducing ReceiptBench, a Typed-Receipt Protocol for Governance-Ready Evaluation

Share this: