Benchmarks-as-Contracts: A ReceiptBench Spec Template for Regimes and Closure

Paper tax invoices being digitized into glowing blue data blocks inside a transparent tube.

Benchmarks-as-Contracts: A ReceiptBench Spec Template for Regimes and Closure

Mathine: Contract-First Receipt Spec Machine
Link: https://doi.org/10.5281/zenodo.18675035 [1]

Benchmarks for tool-using agents are converging away from pooled scalar scores and toward regime-explicit evaluation: admissible evidence, checkable closure, and verifier-bounded verification budgets. In practice, this means a benchmark is no longer just a scoreboard—it is a governed interface that must declare what counts as evidence, what actions are admissible, and what “done” means.

This paper introduces the practical missing layer: a specification template that turns any benchmark into an explicit contract artifact Π_b, with declared admissibility rules, budgets, and closure predicates—plus mandatory receiptboard reporting so subfield heterogeneity is preserved rather than averaged away.

ReceiptBench makes the shift executable by treating outputs as typed receipts (minimal, time-scoped claims emitted by evaluators/verifiers). With a Π_b spec, scores become secondary: a score is an explicit policy projection over receipt-space, rather than the primary object that hides regime structure and tail risk.

The deeper claim is architectural: a benchmark spec template is naturally a Metafield—a design-time contract family that instantiates benchmark contracts across tasks and domains while enforcing migration semantics to prevent silent semantic drift. The result is governance-ready evaluation: “done” becomes auditable, tail risk becomes visible through worst-subfield and closure-integrity signals, and comparability becomes a property of declared regimes rather than leaderboard folklore.

Falsifiable predictions follow directly: regime-explicit, receiptboard-based benchmarks should predict operational failures better than pooled means; worst-subfield and dispersion should dominate mean as governance predictors; and closure-integrity signals should explain a meaningful share of false closure observed in real agent deployments.

References
[1] R. Figurelli, “Benchmarks-as-Contracts: A ReceiptBench Spec Template for Regimes and Closure”. Zenodo, Feb. 17, 2026. https://doi.org/10.5281/zenodo.18675035
[2] R. Figurelli, “ReceiptBench for Field Networks: Recursively Composable Governance via Typed Receipts”. Zenodo, Feb. 17, 2026. https://doi.org/10.5281/zenodo.18665376
[3] R. Figurelli, “From Scores to Receipts: Introducing ReceiptBench, a Typed-Receipt Protocol for Governance-Ready Evaluation”. Zenodo, Feb. 16, 2026. https://doi.org/10.5281/zenodo.18661829
[4] R. Figurelli, “Benchmark Convergence As Operational Confirmation Of Large Language Fields (LLFs)”. Zenodo, Feb. 15, 2026. https://doi.org/10.5281/zenodo.18653012
[5] R. Figurelli, “Large Language Fields (LLFs): The Invisible Layer Above LLMs”. Zenodo, Oct. 03, 2025. https://doi.org/10.5281/zenodo.17254137

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).