Transcripts as Receipts: Bounding Coding-Agent Uplift Without an RCT

Vibrant neon digital interface overlaid on a rainy futuristic city street.

Transcripts as Receipts: Bounding Coding-Agent Uplift Without an RCT

Math Machine: Transcript-Uplift Upper-Bound Receipt Machine
Source: https://metr.org/notes/2026-02-17-exploratory-transcript-analysis-for-estimating-time-savings-from-coding-agents/

Facts
The source (dated February 17, 2026) describes an exploratory method to estimate productivity uplift from coding agents using transcripts as a cheaper alternative to human uplift studies, prototyped on 5,305 coding-agent transcripts generated in January 2026 by seven technical staff using a coding agent. The method uses an LLM judge to estimate how long an experienced software engineer would take to complete the same tasks without AI, compares that counterfactual estimate to estimated time spent with AI (based on windows containing user-typed messages), and reports an estimated “time savings factor” ranging from about ~1.5× to ~13× on these AI-assisted tasks, while emphasizing substantial caveats and framing the factor as a soft upper bound for true uplift. The source lists limitations including task substitution, task selection effects (transcripts reflect tasks people choose to do with AI), worker specialization effects, limited judge validation (34 labels), limited data sources (one month, one tool, seven people), and potential ambiguity in identifying truly human-typed messages in advanced workflows; additional external validation against an RCT is described as future work.

What we add / What’s new

  • Field Network (subfield→field→metafield→overfield→metaoverfield): subfield = transcript segmentation + “human-typed” windows; field = per-task counterfactual time estimates; metafield = uplift claim admissibility; overfield = organizational productivity decisions; metaoverfield = trust in “speedup headlines” as governance input. [1], [6]
  • GeoIT: the Circle of Realization must close as measurement → decision → rollout → audit; transcript-derived uplift is only decision-grade if the audit loop can replay how estimates were produced. [1]
  • TTOkay: “okay-to-operate” for uplift claims is not the magnitude; it is whether the estimate is receipt-backed, bounded, and stable under declared regime flips (selection, substitution, specialization). [1], [2]
  • Multitime: clocks diverge: session clock (minutes), task clock (net successful output), project clock (value delivered), and organizational clock (portfolio outcomes); “faster” on one clock can be neutral or negative on another. [4]
  • ReceiptBench/LLF/LSF link: transcripts are partial receipts; without contract-level closure (what counts as “task complete” and “net value”), speedup becomes narrative drift across signals. [1]
  • Zero-trust in this case: do not trust uplift narratives; verify uplift with replayable estimation recipes, sampling, and explicit HOLD states when counterfactuals are not identifiable under budget. [1]
  • cMth: the fragile part is the headline multiplier; the survivable core is a bounded statement: “upper bound on time savings for a selected slice of AI-chosen tasks, under a stated judge and segmentation regime.” [2]
  • hPhy: treat uplift as a heuristic law inside a regime (tooling, task mix, user behavior); outside it, small perturbations in selection or segmentation can create phase-like jumps in measured multipliers. [4]
  • Cub³: compute (replay cost and determinism), math (bounds, dispersion, worst-slice), and physics (behavioral feedback loops: people choose different tasks when assisted) must cohere, or the claim is not decision-grade. [3]

Why it matters
Organizations increasingly need uplift signals, but rigorous RCT-style studies are expensive and slow; transcript-based approaches can be attractive because they are “already there.” The operational risk is false closure: a large time-savings factor can be real for a narrow slice while still overstating total productivity impact due to selection and substitution—so decision-makers need bounded, replayable, slice-aware evidence rather than a single multiplier.

Hypotheses
H1 — Transcript-based uplift estimates become decision-grade only when treated as zero-trust claims: replayable recipes, explicit admissibility, and typed HOLD when counterfactuals cannot be defended under budget. [1] Falsifier: A transcript-only pipeline without replayability/HOLD delivers the same audit reliability and policy stability as a zero-trust pipeline across repeated replications.
H2 — Under collapse pressure (more autonomy, more parallelism, more task substitution), the headline uplift collapses first; survivable truth is a bounded upper bound on a declared slice with explicit regime flips. [2] Falsifier: The headline multiplier remains stable (low dispersion) across major regime flips in task mix, autonomy, and selection behavior.
H3 — Cub³ coherence (compute replay + math bounds + behavioral physics) predicts which uplift claims generalize to planning decisions better than a single point estimate. [3] Falsifier: Point estimates predict downstream planning accuracy (staffing, timelines) as well as Cub³ coherence checks across multiple settings.

Where it flips (regimes)
Conclusions invert across: (1) “tasks chosen because AI helps” vs representative work distributions, (2) time saved vs value created when substitution shifts toward lower-value tasks, (3) individual specialization vs a generic “experienced engineer” counterfactual, and (4) single-session work vs concurrent multi-agent workflows where segmentation and attribution rules change the measured denominator.

Math behind it (without math)
The inference trap is that transcripts are not a neutral sample: they are a filtered view of work people decide to do with AI, and the counterfactual time estimate is itself a model. A large measured time-savings factor can simultaneously be (a) real for the observed slice and (b) misleading for overall productivity because the missing data is exactly the work people did not choose to do with AI.

Math behind it (with math)
Û = median_i ( T̂_noAI,i / T̂_withAI,i ) with HOLD if Ident(i)=0 [4], [6]

  • Û: reported time-savings factor (an upper-bound style uplift proxy) for a population slice.
  • i: an indexed session/task bundle defined by a declared segmentation regime.
  • T̂_noAI,i: estimated time an experienced engineer would need without AI (judge-based counterfactual).
  • T̂_withAI,i: estimated time spent with AI (segmentation/attribution rule over activity windows).
  • Ident(i): an identifiability flag (1 if segmentation + counterfactual are defensible under the declared regime; 0 otherwise).
    Rationale: operational truth requires bounding and explicit non-closure. If identifiability fails for a slice, a typed HOLD is safer than forcing a number that will be reinterpreted as a global fact.

Millennium-problem alignment (and why it matters here)
This is “verification under budget”: it is easy to produce a multiplier, harder to certify that it generalizes across regimes without expensive ground truth (P vs NP as an operational analogy; no formal reduction). A second lens is Yang–Mills existence and mass gap as an intuition for measurement: decision-grade uplift needs a stable “gap” above the noise floor—an invariant that survives behavioral feedback, selection, and segmentation changes—otherwise you are measuring a moving field, not a property. In coevolution terms, as agents change what people choose to do, governance must evolve from point estimates to ledgers of what was verified vs assumed; P + NP = 1 becomes a measurement rule across levels and time: either you pay for verification (ground truth, stronger validation, replay) or you accept unverified space (assumptions about counterfactuals and selection), but you must record that trade explicitly. [1], [2], [6]

Multitime + TTOkay (when ‘done’ depends on which clock you trust)
Key clocks include: user clock (session minutes), task clock (net successful output), concurrency clock (parallel sessions counted once vs many), project clock (integration and maintenance), organizational clock (value delivery), and audit clock (replay of segmentation + judge recipe). TTOkay fails when closure follows the session clock (“we’re 10× faster”) while the project clock absorbs rework or while the audit clock cannot replay how “with AI time” and “without AI time” were computed.

Closure target
“Settled/done” means: declared subfields (segmentation rule, concurrency counting, definition of “net successful output,” counterfactual judge regime, population slice, and exclusion rules), explicit closure predicates (replayable recipe; dispersion reported; worst-slice behavior disclosed; regime flips stated for selection/substitution/specialization; validation plan stated with budget), and a receipt schema (transcript IDs, time-window attribution outputs, compressed summaries used for judging, judge version and prompts regime, per-i estimates, identifiability flags, and sampling design). Closure must include budgeted validation (expand ground-truth labels beyond a small set; compare against an external uplift design when feasible), worst-slice + dispersion reporting (not only a headline range), and explicit HOLD outcomes where counterfactuals or attribution cannot be defended.

References
[1] R. Figurelli, “Zero-Trust Science: A New Architecture for Scientific Closure (Beyond Peer Review),” Preprint, 2026.
[2] R. Figurelli, “Collapse Mathematics (cMth): A New Frontier in Symbolic Structural Survivability,” Preprint, 2026.
[3] R. Figurelli, “Cub³: A New Heuristic Architecture for Cross-Domain Convergence,” Preprint, 2026.
[4] R. Figurelli, “Heuristic Physics: Foundations for a Semantic and Computational Architecture of Physics,” Preprint, 2026.
[5] J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed., Cambridge Univ. Press, 2009.
[6] A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge Univ. Press, 2007.
[7] A. Angrist and J.-S. Pischke, Mostly Harmless Econometrics, Princeton Univ. Press, 2009.
[8] D. Freedman, Statistical Models: Theory and Practice, Cambridge Univ. Press, 2009.
[9] D. Kahneman, Thinking, Fast and Slow, Farrar, Straus and Giroux, 2011.
[10] W. J. Youden, “Index for Rating Diagnostic Tests,” Cancer, 1950.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).