Reward Integrity: Deterministic, Replayable Audits of RL-Environment Reward Quality
Six model-free instruments, a difficulty signal validated against pass-rate and an independent oracle, a constructive account of difficulty via action-space decomposition, and a continuous Reward Integrity Score — unified by epiplexity.
Don't trust us — re-run the seed.
Contents
Abstract
Reinforcement learning from verifiable rewards (RLVR) makes the environment's reward — not the model — the load-bearing artifact of a training run, yet that reward ships unaudited: no third-party hack-resistance check, no reproducible difficulty standard. We present a deterministic, seed-pinned audit of reward quality — the Reward Integrity Index — built from six model-free instruments, and validate its difficulty signal against both LLM pass-rate (within-model Spearman −0.924 / −0.870) and an independent oracle (+0.872 on 90 Sokoban levels). We give a constructive account of difficulty via action-space decomposition: an LM that proposes rule-edit macros collapses a search that flat move-search cannot crack by up to 1372×. All three results are unified by epiplexityThe information a *computationally-bounded* observer can extract from a task — its structural content, minus the random entropy no bounded solver can use. — information extractable by a computationally-bounded observer — and we operationalize it as a single continuous score per environment.
1 · Introduction
Under RLVR the reward function is the specification: a policy is optimized to whatever the environment scores, so a reward satisfiable without doing the task teaches the policy to not do the task1J. Skalse, N. Howe, D. Krasheninnikov & D. Krueger. Defining and Characterizing Reward Hacking. NeurIPS 2022; arxiv.org/abs/2209.13085. — and the more capable the optimizer, the harder it exploits the gap2A. Pan, K. Bhatia & J. Steinhardt. The Effects of Reward Misspecification. ICLR 2022; arxiv.org/abs/2201.03544.. Reward-hacking, not low accuracy, is the failure mode that silently corrupts a run; over-optimizing any imperfect proxy is Goodhart's law in miniature3L. Gao, J. Schulman & J. Hilton. Scaling Laws for Reward Model Overoptimization. arxiv.org/abs/2210.10760 (2022).. The market for RL environments sells hack-resistance and calibrated difficulty as claims; we found none of them independently checked.
The escape is structural. A measurement built from deterministic instruments can certify without a conflict of interest, because the customer recomputes the result rather than trusting the auditor — “run the checker” replaces “trust me.” This paper develops the instruments, the theory that unifies them, and two new quantitative readings the theory affords: a continuous Reward Integrity Score, and a bounded-compute search-effort estimator of difficulty (in bits).
2 · The epiplexity frame
Finzi et al. introduce epiplexity as the information available to a computationally-bounded observer — structural content, excluding the random entropy a time-bounded solver cannot exploit4M. Finzi, S. Qiu, Y. Jiang, P. Izmailov, J. Z. Kolter & A. G. Wilson. From Entropy to Epiplexity. arxiv.org/abs/2601.03220 (2026).. The idea sits in an established line of bounded-compute information — resource-bounded complexity such as Bennett's logical depth5C. H. Bennett. Logical Depth and Physical Complexity. In The Universal Turing Machine: A Half-Century Survey, 227–257 (1988)., and especially *usable information under computational constraints* (V-information), which likewise lets computation create information6Y. Xu, S. Zhao, J. Song, R. Stewart & S. Ermon. A Theory of Usable Information Under Computational Constraints (predictive V-information). ICLR 2020; arxiv.org/abs/2002.10689. — a tradition epiplexity refines and that we adopt rather than claim. Two of its consequences are load-bearing here: that information can be created by computation, and that it depends on ordering. The frame unifies our three claims:
- Difficulty is bounded-compute structure. A task is hard to the degree that extracting the structure which solves it costs search; the solver simulations to first solve are an operational estimator of that cost.
- Gameability is low-epiplexity reward satisfaction. A reward is hackable exactly when a near-zero-structure output — content-free filler — clears its
pass_threshold. The degenerate-probe and exploit-search instruments search for that condition directly. - Decomposition is structure injected by computation. When an LM proposes a macro-action targeting a postcondition, it *creates* usable structure and re-orders the search, collapsing the bounded-compute cost — “information created by computation” made concrete.
We make the difficulty reading explicit. Writing s(t) for the simulations a fixed seeded solver takes to first solve task t, we measure its bounded-compute search-effort bits — an operational proxy for the epiplexity cost, not an epiplexity estimator — as
3 · Deterministic reward-quality auditing
The Index rates an environment's reward/verifier, not a model's capability. Every instrument is deterministic, model-free, derived from a fixed seed, and replayable. Grades are deliberately coarse: A clean, C one signature, F two or more or a confirmed exploit, ERR the environment did not load under our imageERR is a compatibility signal, first-class and never a silent pass.. Six instruments produce them:
- Degenerate-probe — a fixed battery of content-free completions (empty, filler, “I don't know”, prompt-echo); any scoring at or above
pass_thresholdis *accepted garbage*. The same superficial “master-key” tokens have been shown to fool even reference-based LLM judges at false-positive rates up to ~80%7Y. Zhao, et al. (Tencent AI Lab). One Token to Fool LLM-as-a-Judge — superficial “master-key” tokens elicit false-positive rewards.arxiv.org/abs/2507.08794(2025).. - Isomorphic-Perturbation Testing (IPT) — meaning-preserving cosmetic re-renderings (whitespace, casing, punctuation) must not move the score; a move is a fragility bug. It is the headline finding on LLM-judge rewards — judges that are persuadable by position, verbosity, and surface form8L. Zheng, W.-L. Chiang, Y. Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023;
arxiv.org/abs/2306.05685.9P. Wang, L. Li, L. Chen, et al. Large Language Models are not Fair Evaluators.arxiv.org/abs/2305.17926(2023). — that a deterministic grader avoids by construction. The name IPT and its model-side form are concurrent, independent work10L. Helff, Q. Delfosse, J. Steinmann, et al. LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking.arxiv.org/abs/2604.15149(2026).; we apply the same invariance principle environment-side and deterministically, in the metamorphic-testing tradition11S. Cho, A. Ruberto & V. Terragni. Metamorphic Testing of Large Language Models for NLP. ICSME 2025;arxiv.org/abs/2511.02108. Foundations: Chen, Cheung & Yiu, 1998.. - Verifier-completeness — from ground truth: a dressed-up wrong answer that scores ≥ threshold is a soundness gap; a valid equivalent form (42 vs 42.0) scored below threshold is a completeness gap. Abstains without usable ground truth.
- Exploit-search (the flagship) — a seeded search drives the reward *up* while a held-out intent check fails; a grader-accepted non-attempt is a confirmed exploit — a constructed, replayable reward-hack that forces F and ships as a sealed transcript (“gamed in N sims”)The sims count is defined by the pinned hacker spec that produced it and ranks hardness only *within* a verifier family — no normalization we tested makes it a calibrated cross-family score..
- Multi-turn tiers — scripted adversarial policies through the real rollout, plus reward property-tests (determinism, duplication, length-padding) and trajectory-perturbation checks, for stateful environments.
- Difficulty certification — search-effort sims-to-solve against an independent oracle, with a published rank-correlation.
pass_threshold — any dot at or above it is accepted garbage (red). Gameability is a probe clearing the bar.The seal is the brand. Each record is sealed by a record_sha256 over its deterministic fields (source, verifiers_version, harness_version); the only per-run-varying field, audited_at, lives in a sidecar, so the evidence is byte-identical across runsStronger than pinning a git revision: a revision records *what code ran*; the seal proves *the result itself reproduces*, bit for bit.12S. Biderman, H. Schoelkopf, L. Sutawika, et al. Lessons from the Trenches on Reproducible Evaluation of Language Models. arxiv.org/abs/2405.14782 (2024).. We do not cry wolf — instruments err toward false-negatives: the exploit intent-check leans “genuine” when unsure, verifier-completeness abstains without ground truth, and IPT skips structured gold where its text-invariance ops are not meaning-preserving. A finding is always framed precisely — *at pass_threshold X this reward accepts Y, seed-reproducible* — never “this environment is broken.”
On a live board of 20 audited artifacts, the flagship finding is a third-party single-turn environment graded F: sixteen content-free completions cleared the bar, eighteen soundness gaps fired, and exploit-search constructed a grader-accepted non-attempt in a single simulation — sealed and replayable. Clean controls grade A under the identical battery.
4 · Solver-effort difficulty certification
If difficulty is bounded-compute structure, then solver effort should track it — the same premise under compute-optimal test-time scaling, where harder problems are *allocated* more search13C. Snell, J. Lee, K. Xu & A. Kumar. Scaling LLM Test-Time Compute Optimally. arxiv.org/abs/2408.03314 (2024).. We test this two ways. First, against model behavior: on a 42-task constrained-text ladder, best-of-N samples-to-first-accept rank-correlates with within-model pass-rate at −0.924 (Qwen3-4B) and −0.870 (Qwen3-32B)Within-model: the more search effort a task needs, the lower the LLM pass-rate (Spearman ρ ≈ −0.92). — while a static constraint-count baseline correlates at the wrong sign on this ladder (+0.60 / +0.29)A property of this correlated ladder, not a general law: on a decorrelated battery constraint *count* is the strongest static predictor of LLM constrained-generation difficulty (§10) — measured effort still beats static features on both batteries.. Difficulty bands measured on the 4B transfer to the 32B (pass Spearman 0.656). Notably the prevailing difficulty signals are model-dependent — pass-rate / learnability filtering near p≈0.514S. Bae, et al. Online Difficulty Filtering for Reasoning-Oriented RL — pass-rate learnability around p≈0.5. arxiv.org/abs/2504.03380 (2025)., item-response-theory fits15H. Zhou, et al. Lost in Benchmarks? Rethinking LLM Benchmarking with Item Response Theory (PSN-IRT). AAAI 2026; arxiv.org/abs/2505.15055., and difficulty read as a model's lack of *V-usable information*16K. Ethayarajh, Y. Choi & S. Swayamdipta. Understanding Dataset Difficulty with V-Usable Information. ICML 2022 (Outstanding Paper); arxiv.org/abs/2110.08420. — whereas seeded-solver effort is model-free, which is what lets one certificate transfer across solvers rather than be re-estimated per model.
| signal | ρ vs pass-rate |
|---|---|
| solver effort · Qwen3-4B | −0.924 |
| solver effort · Qwen3-32B | −0.870 |
| static constraint-count | +0.60 / +0.29 |
Second, against ground truth where progress decomposes. On 90 reverse-generated Sokoban levels (CPU-only, every (level, seed, budget) bit-exactly replayable), seeded-MCTS sims-to-solve rank-correlates with the BFS-optimal oracle at +0.872 (moves) and +0.843 (pushes); all 90 solve, and band-median effort is monotone (easy 6.7 → medium 57.9 → hard 178.9 sims). The seeded MCTS here is the same test-time tree search that powers LLM reasoning17S. Hao, Y. Gu, H. Ma, et al. Reasoning with Language Model is Planning with World Model (RAP). EMNLP 2023; arxiv.org/abs/2305.14992.18S. Yao, D. Yu, J. Zhao, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023; arxiv.org/abs/2305.10601., repurposed to measure difficulty rather than to maximize a reward.
The instrument has an honest scope, and the scope is itself a result. A more powerful, replayable line-level tree search solves 0 of 42 creative-text tasks (Spearman 0.000, every task censored)A poem does not decompose into independently-scored lines, so partial progress carries no signal — there is nothing for tree search to climb. — yet it succeeds on Sokoban (ρ = +0.87). The refutation is a map: tree-search difficulty belongs in domains where partial progress is meaningful — games, puzzles, code, math.
5 · Action-space decomposition
If holistic structure defeats flat tree search, can an LM supply the missing decomposition? We test it on Baba Is You, a rewritable formal grammar in which NOUN IS PROPERTY sentences on the board *are* the rules19M. Charity & J. Togelius. Keke AI Competition: solving Baba Is You levels — the canonical Baba benchmark; baselines are flat primitive-move tree search. IEEE CoG 2022; arxiv.org/abs/2209.04911.. On a BFS-certified rule-assembly ladder — the only win is to push the WIN word d cells to form FLAG IS WIN — we compare a FLAT searcher over four primitive moves against a MACRO searcher over LM-proposed rule-edits:
| distance d | FLAT solve | FLAT sims | MACRO sims | ratio |
|---|---|---|---|---|
| 1 | 1.00 | 7 | 1 | 7× |
| 2 | 1.00 | 23 | 1 | 23× |
| 4 | 1.00 | 469 | 1 | 469× |
| 6 | 0.62 | 1372 | 1 | 1372× |
FLAT sims-to-solve grows super-exponentially and begins to censor at d = 6 (only 62% of seeds solve within budget — the games analog of the 0/42 creative-text dead-end)Flat primitive-move search (1372 sims) collapses to one grounded macro.; MACRO holds one simulation and 100% solve at every distance. The mechanism is exactly what epiplexity predicts: before the rule forms, the progress signal is a flat plateau ending in a one-cell needle, so the flat searcher has no value gradient and must stumble onto the precise length-d sequence by exploration; a macro that targets the postcondition FLAG IS WIN, grounded by one bounded local search, never faces the plateau. The LM injects the structure that collapses the cost — information created by computation. An offline-distilled, pure-Python student keeps the collapse with no LM and no search at inference, the cheapest arm measured (≈10.3× under flat compute at sims = 1; 0.50 held-out solve rate bounds the off-policy gap).
The ingredients have deep prior art — temporal abstraction and the options framework20R. S. Sutton, D. Precup & S. Singh. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction (the options framework). Artificial Intelligence 112:181–211 (1999)., learned macro-operators that coarsen a search21R. E. Korf. Macro-Operators: A Weak Method for Learning. Artificial Intelligence 26:35–77 (1985)., and LLMs that propose subgoals or skills for RL22C. L. Shek & P. Tokekar. Option Discovery using LLM-guided Semantic Hierarchical RL (LDSC). arxiv.org/abs/2503.19007 (2025) — reports +55.9% average reward, no simulations-to-solve metric (sub-2× regime). — but the specific combination, an LM-proposed rule-edit macro grounded by one bounded local search that collapses a goal-directed search by up to 1372× *measured in simulations-to-solve*, is to our knowledge unattested: the closest neighbour reports only sub-2× gains on maze navigation, with no search-effort metric, and LLM-guided search more broadly tops out near ~2× without a simulations-to-solve operationalization23S. Meng, et al. LLM-A*: LLM-guided heuristic search — reports a search-cost metric but only ~2× fewer operations, no order-of-magnitude collapse. EMNLP 2024; arxiv.org/abs/2407.02511..
The most active adjacent line is program-synthesis-for-planning, where an LM synthesises reusable abstractions that a hierarchical planner then grounds — the closest relatives to our LM-proposes-the-structure / search-grounds-it design. But the nearest members abstract *spatial* move-to subroutines grounded by a learned transition model, so rule changes again arrive only as a side-effect of primitive moves, and they report sample-efficiency in tokens or reward, never a search-tree collapse24A. Ahmed, J. B. Tenenbaum, C. J. Bates & S. J. Gershman. Synthesizing World Models for Bilevel Planning (TheoryCoder). arxiv.org/abs/2503.20124 (2025); successor — Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents, ICLR 2026, arxiv.org/abs/2602.00929. Abstracts spatial move-to subroutines grounded by a learned transition model; reports token-cost / solution-rate, not a search-tree metric (Baba is motivation only).; the classical macro-operator ancestor that *does* report search cost tops out near a 10× planner-time speed-up, not an order of magnitude25A. Botea, M. Enzenberger, M. Müller & J. Schaeffer. Macro-FF: Improving AI Planning with Automatically Learned Macro-Operators. JAIR 24:581–621 (2005); arxiv.org/abs/1109.2154 — classical macros prune planner search by ≤10× CPU-time, never an order of magnitude., and program-synthesis-as-policy searches a learned program embedding for high return, not for a smaller search26D. Trivedi, Y. Zhang, S.-H. Sun & J. J. Lim. Learning to Synthesize Programs as Interpretable and Generalizable Policies (LEAPS). NeurIPS 2021; arxiv.org/abs/2108.13643. Descendant HPRL (G.-T. Liu, et al., ICML 2023; arxiv.org/abs/2301.12950). Both search a learned program embedding for high return, not for fewer search nodes.. Across the whole family the reported quantity is *performance* — reward, accuracy, success, token cost — not simulations-to-solve; the metric kind itself, not just the magnitude, is where our claim lives.
The Baba domain itself is not new — it is an established rule-manipulation benchmark27D. Paglieri, et al. BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games (a Baba Is AI env; flat primitive action set). ICLR 2025; arxiv.org/abs/2411.13543. — but prior agents act through flat primitive moves, where rule changes are an emergent side-effect of pushing word-tiles, never a deliberately selected rule-edit action. The nearest precedent to expose a high-level make/break rule primitive scores VLM planning accuracy on static images and runs no search at all28A. Cloos, et al. Baba Is AI — agents manipulate objects and rules; exposes high-level make/break NOUN-IS-PROPERTY primitives but scores VLM planning accuracy and runs no search. ICML 2024 wksp; arxiv.org/abs/2407.13729., so it cannot exhibit — and does not pre-empt — the search-effort collapse that is our claim. The deeper rules-as-formal-objects tradition — general game playing and its Game Description Language — treats rules as a specification the designer writes or a system repairs, not an action an agent selects mid-play29M. R. Genesereth, N. Love & B. Pell. General Game Playing: Overview of the AAAI Competition (the Game Description Language). AI Magazine 26(2):62–72 (2005); incomplete-information extension GDL-II, M. Thielscher, AAAI 2010. In the GGP/GDL tradition, rules are a formal specification the designer writes or a system repairs, not an action an agent selects mid-play.. We adopt the domain; the LM-proposed rule-edit macro and its measured collapse are the contribution.
6 · The Reward Integrity Score
A letter grade is a verdict; a measurement layer wants a number. We define the Reward Integrity Score (RIS) — a continuous reading in [0, 100], derived from the *same* signals as the grade so it is monotone with it: the grade selects a non-overlapping band, and the worst content-free margin — how far the most-accepting degenerate probe sits below (clean) or above (gamed) pass_threshold — places the row within itA confirmed exploit floors the score below the F band, lower the fewer simulations it took to construct — “gamed in 1 sim” is the worst reading the Index produces. The sims count is a coarse flooring signal under the pinned hacker spec, not a calibrated cross-family hardness number..
Because RIS is a deterministic function of the published evidence, it is itself replayable, and it sorts the board into a clean gradientReward Integrity Scores across the live board — F rows cluster low, A rows high.: the exploit-bearing F row scores in the single digits, single-signature C rows sit mid-scale, and clean A controls cluster near the top. The score is a *presentation* of the existing audit — not a new judgment — so it inherits the same false-negative bias and the same seed-reproducibility.
7 · A bounded-compute search-effort estimator
Section 2 defined Ê(t) = log₂(1 + s(t)). It is the log-time term of Levin's Kt complexity — Kt(x) being the minimum over programs of (program length + log run-time)30L. A. Levin. Universal sequential search problems. Problems of Information Transmission 9(3):265–266 (1973); textbook treatment in M. Li & P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, 2008). — but read for one *fixed* solver rather than minimised over all programs, which makes it a measurable, solver-relative proxy for the run-time that gates bounded-compute difficulty. That distinguishes it from logical and *computational* depth, which subtract the plain Kolmogorov complexity to isolate the run-time of the *minimal* program31L. Antunes, L. Fortnow, D. van Melkebeek & N. V. Vinodchandran. Computational Depth: Concept and Applications. Theoretical Computer Science 354(3):391–404 (2006) — depth as time-bounded minus classical Kolmogorov complexity, capturing Bennett's logical depth.; a formal reduction of Ê to Kt — or to epiplexity itself — is open, and indeed the epiplexity authors list exactly these relationships (Kt, logical depth, resource-bounded complexity) among their open problemsso Ê is offered as an operational analog of the Kt time term, not a theorem; the cross-solver band-invariance thread below is the empirical face of the same open question.. Here we read it off the certified Sokoban corpus. The difficulty bands, expressed in bits of bounded-compute structure, are monotone and modest — a hard level demands only a few bits more than an easy oneeasy 6.7 → 2.94 bits · medium 57.9 → 5.88 · hard 178.9 → 7.49; the log scale is the point — effort explodes, structure accrues gently. — and the ordering is the oracle's (ρ = +0.872):
Reading difficulty in bits matters for *transfer*: a quantity that is ordinal and model-free is a candidate for a cross-solver, cross-model difficulty band, the object the calibration result (−0.92 within model, 0.656 across tiers) gestures at. Establishing band-invariance across solver families is the first open thread.
8 · Related work
Positioning and novelty. Two findings sharpen the claim. The closest concurrent instrument to ours — isomorphic-perturbation testing — is applied to a *model's output* on a single domain (Helff et al.); we run the same principle environment-side, deterministically, as one of six instruments alongside difficulty certification, action decomposition, and the epiplexity unification. And to our knowledge no public reward-quality index or integrity gate for RL environments exists: the crowdsourced environment hubs auto-publish contributions with no rating, leaderboard, or reward-hack screen, and the nearest benchmark scores *models* for exploit rates rather than auditing the environments' rewards32K. Thaman. Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use — scores models, not environments. arxiv.org/abs/2605.02964 (2026).. The niche the Reward Integrity Index occupies is, as of writing, unoccupied.
Reward hacking and overoptimization. The phenomenon we audit is well characterized: reward hacking has a formal definition (Skalse et al.), it worsens with the optimizer's capability and can arrive as a phase transition (Pan et al.), and over-optimizing any imperfect proxy traces a Goodhart curve away from the true objective (Gao et al.). Those results describe the failure model-side and in aggregate; we make it an environment-side, per-row, deterministic measurement.
Invariance testing and hack detection. Perturb-and-check appears independently model-side (Helff et al.), and contrastive model-based detectors report detection rates near 63% on a strong reasoner33D. Deshpande, A. Kannappan & R. Qian (Patronus AI). Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis — the TRACE benchmark. arxiv.org/abs/2601.20103 (2026).. Our IPT is environment-side, and exploit-search instead *constructs* the hack with a seeded search and ships a deterministic transcript — every score a reproducible construction, not a probability.
Judge gaming and verifier robustness. Model-based judges are strong but gameable — persuadable by position, verbosity, and surface form (Zheng et al.; Wang et al.) — which is the headline fragility a deterministic grader sidesteps. Studies of rule- versus model-based verifiers34From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning. arxiv.org/abs/2505.22203 (2025). and of RL under imperfect or noisy rewards35An Imperfect Verifier is Good Enough: Learning with Noisy Rewards. arxiv.org/abs/2604.07666 (2026).36Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers. arxiv.org/abs/2510.00915 (2025). show that verifier quality bounds RLVR; our verifier-completeness instrument is the *measurement* those lines presume.
Search effort, test-time compute, and difficulty. Test-time tree search is how reasoning systems spend compute (Tree of Thoughts; RAP), and compute-optimal scaling already routes more of it to harder problems (Snell et al.); we read that same effort as a *difficulty signal* rather than a way to raise a score. Frontier small-model RL weights updates toward the p ≈ 0.5 boundary, estimated per-model by rollout37Xu et al. VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models. arxiv.org/abs/2606.16140 (2026). — a static, model-independent certificate would remove that estimation.
Decomposition and temporal abstraction. Our search-collapsing macros inherit a long lineage: automatic option discovery — including options chosen to *minimize planning time*, which is NP-hard and therefore motivates a heuristic proposer38Y. Jinnai, D. Abel, D. Hershkowitz, M. Littman & G. Konidaris. Finding Options that Minimize Planning Time. ICML 2019; arxiv.org/abs/1810.07311. — and the LLM-as-planner line that proposes skills or subgoals, from learned skill libraries39G. Wang, et al. Voyager: An Open-Ended Embodied Agent with Large Language Models — a learned skill library. NeurIPS 2023; arxiv.org/abs/2305.16291. and affordance-grounded skill selection40M. Ahn, A. Brohan, N. Brown, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). arxiv.org/abs/2204.01691 (2022). to library learning that compresses a synthesis search41G. Grand, L. Wong, et al. LILO: Learning Interpretable Libraries by Compressing and Documenting Code. NeurIPS 2023; arxiv.org/abs/2310.19791.. These operate within an RL/MDP search or grow a skill set; none reports an LM-proposed macro that collapses a goal-directed solver search by orders of magnitude — the gap §5 fills. The testbed is honest about its lineage: Baba Is You is an established rule-manipulation benchmark (§5), but its published solvers search flat primitive moves, and the nearest LLM-guided-search neighbors report only ~2× effort reductions — none in simulations-to-solve, none on an LM-proposed rule-edit.
Reproducible evaluation. The community push toward seed-controlled, reproducible evaluation (Biderman et al.) is the methodological neighbor of our byte-identical seal; and a deterministic, data-oblivious quantization substrate42A. Zandieh, M. Daliri, A. Hadian & V. Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arxiv.org/abs/2504.19874 (2025). is the layer on which a portable, customer-verifiable distilled grader would sit.
9 · Extensions & open problems
The instruments and the two estimators above open a concrete research agenda; each item is a measurement the field currently assumes but does not have.
- From correlation to estimator. Establish whether Ê — bounded-compute search-effort bits, today an operational proxy only — admits a principled reduction to epiplexity, and test whether its difficulty band is *model-invariant* — stable across solver families (MCTS vs best-of-N vs A*), not merely solver-relative.
- A predictive theory of decomposition. Characterize the FLAT/MACRO collapse as a function of plateau length, needle width, and macro-grounding cost — predicting the ratio from a reward's gradient structure *before* running the search.
- Exploit-search as a hardness benchmark. Turn “gamed in N sims” into a standard verifier-robustness score — as rank-within-family hardness bands under a pinned, versioned hacker spec (a sims number is meaningless without its hacker); cross-family comparability would require a *validated* per-family calibration, which normalization alone does not provide. The intent-check (the false-positive guard) is what gets calibrated against human labels.
- Soundness/completeness at Hub scale. Audit verifier-completeness across the environment catalogue and test whether a reward's soundness gaps predict downstream RLVR hacking.
- Difficulty-certified curricula. Substitute a static, pre-certified band for the per-model pass-rate estimate frontier recipes compute at train time, and measure sample-efficiency.
- Byte-exact distilled graders. Ship a verifier as a single quantized file a customer re-verifies bit-for-bit — the artifact the replayability brand presumes.
10 · Limitations
We state the boundaries plainly. The constrained-text calibration uses one gold reference per form, and the sonnet grader cannot be tightened without rejecting genuine Shakespeare; the tree-search difficulty instrument is scoped to decomposable domains (the line-MCTS dead-end on holistic text is a feature of the domain, not a bug to fix). The model-adversary and real-LM macro-proposer arms are gated on managed-API and GPU budget, so several decomposition results use deterministic mock proposers; live LLM-judge fragility needs an egress path the network-free sandbox deliberately withholds. Finally, pass-rate is model behavior, not pure difficulty — which is exactly why the oracle-anchored Sokoban result carries the load.
Robustness — constraint count, not constrainedness, drives LLM difficulty. The natural companion certificate — exact Gent–Walsh constrainedness / solution *density* of a task's constraint set — is refuted as the driver of LLM constrained-generation difficulty: on a battery built to decorrelate density from constraint count (an exact-DFA regex battery plus a fixed-length rarity-spanning acrostic battery; 16 models, 8 families), constraint count recovers 67.5% of the fittable 2PL signal, output length 55%, density only 44.5% (held-out R² 0.20 / 0.16 / 0.09); at fixed count the density signal scatters around zero. The CSP phase-transition story does not transfer to LLM generation. Measured search effort, by contrast, *survives* the same count-decorrelation control (cross-family partial ρ = −0.671, negative in 30/30 pairs).
References
Inline citations above carry full detail in the margin; the consolidated list follows. arXiv identifiers are given as text — the page is self-contained and embeds no links.
- Finzi, M., Qiu, S., Jiang, Y., Izmailov, P., Kolter, J. Z. & Wilson, A. G. *From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence.*
arxiv.org/abs/2601.03220(2026). - Zandieh, A., Daliri, M., Hadian, A. & Mirrokni, V. *TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.*
arxiv.org/abs/2504.19874(2025). - Deshpande, D., Kannappan, A. & Qian, R. (Patronus AI). Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis (the TRACE benchmark).
arxiv.org/abs/2601.20103(2026). - Helff, L., Delfosse, Q., Steinmann, J., Härle, M., Shindo, H., Schramowski, P., Stammer, W., Kersting, K. & Friedrich, F. *LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking.*
arxiv.org/abs/2604.15149(2026). - Xu et al. *VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models.*
arxiv.org/abs/2606.16140(2026). - *From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning.*
arxiv.org/abs/2505.22203(2025). - *An Imperfect Verifier is Good Enough: Learning with Noisy Rewards.*
arxiv.org/abs/2604.07666(2026). - *Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers.*
arxiv.org/abs/2510.00915(2025). - Skalse, J., Howe, N., Krasheninnikov, D. & Krueger, D. Defining and Characterizing Reward Hacking. NeurIPS 2022;
arxiv.org/abs/2209.13085. - Pan, A., Bhatia, K. & Steinhardt, J. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. ICLR 2022;
arxiv.org/abs/2201.03544. - Gao, L., Schulman, J. & Hilton, J. Scaling Laws for Reward Model Overoptimization.
arxiv.org/abs/2210.10760(2022). - Zheng, L., Chiang, W.-L., Sheng, Y., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023;
arxiv.org/abs/2306.05685. - Wang, P., Li, L., Chen, L., et al. Large Language Models are not Fair Evaluators.
arxiv.org/abs/2305.17926(2023). - Snell, C., Lee, J., Xu, K. & Kumar, A. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.
arxiv.org/abs/2408.03314(2024). - Yao, S., Yu, D., Zhao, J., et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023;
arxiv.org/abs/2305.10601. - Hao, S., Gu, Y., Ma, H., et al. Reasoning with Language Model is Planning with World Model. EMNLP 2023;
arxiv.org/abs/2305.14992. - Biderman, S., Schoelkopf, H., Sutawika, L., et al. Lessons from the Trenches on Reproducible Evaluation of Language Models.
arxiv.org/abs/2405.14782(2024). - Zhao, Y., et al. (Tencent AI Lab). One Token to Fool LLM-as-a-Judge.
arxiv.org/abs/2507.08794(2025). - Cho, S., Ruberto, A. & Terragni, V. Metamorphic Testing of Large Language Models for Natural Language Processing. ICSME 2025;
arxiv.org/abs/2511.02108. - Thaman, K. Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use.
arxiv.org/abs/2605.02964(2026). - Bae, S., Hong, S., Lee, J., et al. Online Difficulty Filtering for Reasoning-Oriented Reinforcement Learning.
arxiv.org/abs/2504.03380(2025). - Zhou, H., Huang, J., Zhao, Y., et al. Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory. AAAI 2026;
arxiv.org/abs/2505.15055. - Xu, Y., Zhao, S., Song, J., Stewart, R. & Ermon, S. A Theory of Usable Information Under Computational Constraints. ICLR 2020;
arxiv.org/abs/2002.10689. - Ethayarajh, K., Choi, Y. & Swayamdipta, S. Understanding Dataset Difficulty with V-Usable Information. ICML 2022;
arxiv.org/abs/2110.08420. - Bennett, C. H. Logical Depth and Physical Complexity. In The Universal Turing Machine: A Half-Century Survey, 227–257 (1988).
- Levin, L. A. Universal sequential search problems. Problems of Information Transmission 9(3):265–266 (1973); see M. Li & P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, 2008).
- Antunes, L., Fortnow, L., van Melkebeek, D. & Vinodchandran, N. V. Computational Depth: Concept and Applications. Theoretical Computer Science 354(3):391–404 (2006).
- Sutton, R. S., Precup, D. & Singh, S. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning. Artificial Intelligence 112:181–211 (1999).
- Korf, R. E. Macro-Operators: A Weak Method for Learning. Artificial Intelligence 26:35–77 (1985).
- Jinnai, Y., Abel, D., Hershkowitz, D., Littman, M. & Konidaris, G. Finding Options that Minimize Planning Time. ICML 2019;
arxiv.org/abs/1810.07311. - Shek, J. & Tokekar, P. Option Discovery using LLM-guided Semantic Hierarchical Reinforcement Learning.
arxiv.org/abs/2503.19007(2025). - Wang, G., et al. Voyager: An Open-Ended Embodied Agent with Large Language Models. NeurIPS 2023;
arxiv.org/abs/2305.16291. - Ahn, M., Brohan, A., Brown, N., et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan).
arxiv.org/abs/2204.01691(2022). - Grand, G., Wong, L., et al. LILO: Learning Interpretable Libraries by Compressing and Documenting Code. NeurIPS 2023;
arxiv.org/abs/2310.19791. - Charity, M. & Togelius, J. Keke AI Competition: solving Baba Is You levels. IEEE CoG 2022;
arxiv.org/abs/2209.04911. - Cloos, A., et al. Baba Is AI: a benchmark where agents manipulate both objects and rules. ICML 2024 Workshop on LLMs & Cognition;
arxiv.org/abs/2407.13729. - Paglieri, D., et al. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games. ICLR 2025;
arxiv.org/abs/2411.13543. - van Wetten, M., Plaat, A. & van Duijn, M. Baba is LLM: reasoning in a game with dynamic rules. Leiden 2025;
arxiv.org/abs/2506.19095. - Meng, S., et al. LLM-A*: large language model-enhanced incremental heuristic search on path planning. EMNLP 2024 (Findings);
arxiv.org/abs/2407.02511. - Gonzalez-Pumariega, G., Chen, S., Kedia, A. & Choudhury, S. Query-Efficient Planning with Language Models.
arxiv.org/abs/2412.06162(2024). - Ahmed, A., Tenenbaum, J. B., Bates, C. J. & Gershman, S. J. Synthesizing World Models for Bilevel Planning (TheoryCoder).
arxiv.org/abs/2503.20124(2025). - Ahmed, A., Irie, K., Tenenbaum, J. B., Bates, C. J. & Gershman, S. J. Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents. ICLR 2026;
arxiv.org/abs/2602.00929. - Botea, A., Enzenberger, M., Müller, M. & Schaeffer, J. Macro-FF: Improving AI Planning with Automatically Learned Macro-Operators. Journal of Artificial Intelligence Research 24:581–621 (2005);
arxiv.org/abs/1109.2154. - Trivedi, D., Zhang, Y., Sun, S.-H. & Lim, J. J. Learning to Synthesize Programs as Interpretable and Generalizable Policies (LEAPS). NeurIPS 2021;
arxiv.org/abs/2108.13643. - Liu, G.-T., Hu, E.-P., Cheng, P.-J., Lee, H.-Y. & Sun, S.-H. Hierarchical Programmatic Reinforcement Learning via Learning to Compose Programs (HPRL). ICML 2023;
arxiv.org/abs/2301.12950. - Genesereth, M. R., Love, N. & Pell, B. General Game Playing: Overview of the AAAI Competition. AI Magazine 26(2):62–72 (2005).
- Thielscher, M. A General Game Description Language for Incomplete Information Games. AAAI 2010.
Audit your environment
Shipping a verifiers-format environment? We run this exact deterministic, model-free audit — gaming & difficulty — and hand back a sealed, replayable verifier card your buyers re-run themselves. No trust required.