Executable Tests Are the Benchmark: Why Secure Coding Needs Receipts, Not Scores

Executable Tests Are the Benchmark: Why Secure Coding Needs Receipts, Not Scores

Math Machine: PoC-Receipt Secure-Code Closure Machine
License: CC BY 4.0
Source: https://www.arxiv.org/abs/2602.15485

The source introduces SecCodeBench-V2, a publicly released benchmark to evaluate coding assistants on secure code generation and secure code fixing. It says the benchmark contains 98 scenarios spanning 22 common weakness categories across five programming languages (including a JavaScript runtime), and that scenarios are derived from industrial productions. It uses a function-level task formulation: each scenario provides a project scaffold and asks the model to implement or patch a designated target function under fixed interfaces and dependencies. For verification, the source says each scenario includes executable proof-of-concept test cases for both functional validation and security verification, authored and double-reviewed by security experts, and that evaluation is primarily dynamic execution in isolated environments; for cases that cannot be decided by deterministic tests, the source adds an LLM-as-judge oracle. It also describes a Pass@K-based scoring protocol with aggregation over scenarios and severity to make comparisons more holistic.

What we add is a “fields-like” reading of what this benchmark is really doing: it’s turning “secure code” from a narrative property into an admissible transform with a verifier. The invariant is not “the code looks right,” but “the code passes a declared battery of functional and security checks under a fixed scaffold,” which is exactly a ReceiptBench/LSF move: the unit of trust becomes a receipt bundle (tests + environment + outcome), not a claim. In Field Network terms, the subfield is executable verification, the field is secure-code evaluation, the metafield is identity+drift of scaffolds and dependencies, and the overfield is governance of AI coding assistants where “safe enough” must be checkable rather than persuasive. This is also W = I ^ C in practice: capability (I) is irrelevant unless governance (C) forces closure to be proven in a way that survives drift.

Zero-trust for this case is concrete: don’t trust model narratives, and don’t trust single numbers. Trust only replayable verification artifacts that are cheaper to check than to argue about: scaffold identity, dependency locks, the exact PoC tests, the execution isolation rules, and the pass/fail transcript. TTOkay becomes “okay to ship this assistant in a workflow” only when the verifier receipts demonstrate bounded failure modes across weakness classes, not just an average pass rate. GeoIT clarifies the coordination burden: developers want speed, security teams want coverage, operators want reproducibility, and auditors want attributable evidence—so the Circle of Realization breaks unless all parties can replay the same closure object. Multitime is why this is hard: the attacker clock evolves fast, the dependency clock drifts continuously, the product clock pushes releases, and the audit clock lags—so closure must be defined as a living predicate under drift, not a one-off benchmark win.

A clean operational equation for what this benchmark is implicitly optimizing is a severity-weighted, executable closure rate:
Score = ( Σₛ w(sevₛ) · 𝟙[ FuncTestsₛ = pass ∧ SecTestsₛ = pass ] ) / ( Σₛ w(sevₛ) )

Here s indexes scenarios, sevₛ is the scenario’s declared severity, w(·) is the weight assigned by severity aggregation, FuncTestsₛ are the functional PoC tests, and SecTestsₛ are the security PoC tests. This formulation matches operational truth because it treats “secure and correct” as a joint predicate that must hold under execution, not a style preference. H1 — Security evaluation becomes transportable only when it is framed as zero-trust closure (replayable verifiers + typed HOLD), not as textual judgments. [1] Falsifier: narrative-based evaluation predicts real-world secure behavior as well as replayable verifier receipts. H2 — Survivable progress requires worst-slice reporting across weakness classes and severities; means alone will systematically understate risk. [2] Falsifier: mean score predicts failure risk as well as worst-slice + dispersion across classes. H3 — A Cub³ gate (compute feasibility of replay, cMth survivability constraints, and hPhy regime flips under drift) prevents “score theater” without slowing genuine improvement. [3] Falsifier: adding these gates does not reduce regressions or increases time-to-iteration without safety gains.

Closure for this benchmark class is “done” only when the community can treat results as portable evidence. That requires: declared subfields (task formulation, scaffold identity, execution isolation, oracle usage), explicit closure predicates (what constitutes pass, what is oracle-adjudicated, what is HOLD), a receipt schema (scaffold hash, dependency locks, PoC tests, runtime isolation spec, pass/fail transcript, severity labels), and an audit budget (sampling rules for reruns, frequency of drift revalidation, and how worst-slice is monitored). It must also name regime flips where conclusions invert—deterministic PoC checks vs oracle judgments, stable dependencies vs drift, and fix tasks vs generation tasks—because those boundaries decide whether a score is a closure object or merely a snapshot.

References
[1] R. Figurelli, “Zero-Trust Science: A New Architecture for Scientific Closure (Beyond Peer Review),” Preprint, 2026.
[2] R. Figurelli, “Collapse Mathematics (cMth): A New Frontier in Symbolic Structural Survivability,” Preprint, 2026.
[3] R. Figurelli, “Cub³: A New Heuristic Architecture for Cross-Domain Convergence,” Preprint, 2026.
[4] L. Chen et al., “SecCodeBench-V2 Technical Report,” Preprint, 2026.
[5] NIST, Secure Software Development Framework (SSDF), SP 800-218, 2022.
[6] ISO/IEC, ISO/IEC 27034 Application Security, ISO/IEC, 2011.
[7] MITRE, Common Weakness Enumeration (CWE) List, Technical Catalog, 2025.
[8] OWASP, Application Security Verification Standard (ASVS), 2023.
[9] S. McConnell, Code Complete, 2nd ed., Microsoft Press, 2004.
[10] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).

Executable Tests Are the Benchmark: Why Secure Coding Needs Receipts, Not Scores

Share this: