AI Agents Reliability: Turning “Accuracy” Into a Motif Atlas

Five glowing, translucent spheres connected by a web of light and data fragments.

AI Agents Reliability: Turning “Accuracy” Into a Motif Atlas

Mathine: Reliability Motif Atlas & Replay-Budget Closure Machine
Link: https://arxiv.org/html/2602.16666v1

The source argues that compressing agent behavior into a single success metric hides the operational flaws that matter in deployment. It proposes a reliability profile decomposed into four dimensions—consistency, robustness, predictability, and safety—implemented via a suite of concrete metrics, and it reports an evaluation of 14 agentic models across two benchmarks, finding that recent capability gains translate into only small reliability improvements.

This is fields-like because “agent reliability” is a regime question: which behaviors are admissible under nominal conditions, under perturbations, under uncertainty, and under harm constraints. The closure problem is that two agents with the same average task success can differ radically in what makes them governable: one may fail on a stable, diagnosable slice, while another fails unpredictably or with unbounded severity. A regime that certifies by mean accuracy cannot close on “okay-to-operate” in the presence of tail risk.

PAMPA treats the source as a specimen for extracting closure motifs—repeatable failure patterns that recur across domains and scaffolds. Here, the motifs are explicit: run-to-run variance, sensitivity to superficial instruction changes, miscalibrated confidence, and rare but high-severity failure modes. The goal is not to “win the benchmark,” but to build a motif library that lets teams (and auditors) recognize the same closure failure mechanics when they reappear elsewhere, even if the surface task changes.

Zero-trust, in PAMPA terms, means we do not trust a single score as evidence; we require receipts that let independent reviewers replay why reliability failed. Minimal receipts include: multiple-run traces (to measure variance), perturbation protocols (to prove robustness claims), confidence statements bound to outcomes (to test predictability), and severity labels for failures (to prevent benign/catastrophic averaging). If these receipts are missing, the correct outcome is HOLD—not because the agent is “bad,” but because the regime cannot verify what matters.

Closure mechanics become “pairwise labs”: compare two environments (e.g., structured vs open-ended tasks) and ask which motifs persist, which vanish, and what minimal receipt schema separates them.

A simple PAMPA representation is a motif vector m = (m₁, …, m_k), where each m_i flags a reliability failure class (variance, brittleness, miscalibration, high-severity breach).

“Repair” is successful when the receipt set is sufficient to (i) detect the worst-slice motif early, and (ii) constrain the system so the same motif cannot recur unnoticed under the declared regime.

— © 2026 Rogério Figurelli. This article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0). You are free to share and adapt this material for any purpose, even commercially, provided that appropriate credit is given to the author and the source. To explore more on this and other related topics and books, visit the author’s page (Amazon).