starscry · technical paper

Reward Integrity: Deterministic, Replayable Audits of RL-Environment Reward Quality

Six model-free instruments, a difficulty signal validated against pass-rate and an independent oracle, a constructive account of difficulty via action-space decomposition, and a continuous Reward Integrity Score — unified by epiplexity.

Don't trust us — re-run the seed.

Contents

Abstract

Reinforcement learning from verifiable rewards (RLVR) makes the environment's reward — not the model — the load-bearing artifact of a training run, yet that reward ships unaudited: no third-party hack-resistance check, no reproducible difficulty standard. We present a deterministic, seed-pinned audit of reward quality — the Reward Integrity Index — built from six model-free instruments, and validate its difficulty signal against both LLM pass-rate (within-model Spearman −0.924 / −0.870) and an independent oracle (+0.872 on 90 Sokoban levels). We give a constructive account of difficulty via action-space decomposition: an LM that proposes rule-edit macros collapses a search that flat move-search cannot crack by up to 1372×. All three results are unified by epiplexityThe information a *computationally-bounded* observer can extract from a task — its structural content, minus the random entropy no bounded solver can use. — information extractable by a computationally-bounded observer — and we operationalize it as a single continuous score per environment.

1 · Introduction

Under RLVR the reward function is the specification: a policy is optimized to whatever the environment scores, so a reward satisfiable without doing the task teaches the policy to not do the task1J. Skalse, N. Howe, D. Krasheninnikov & D. Krueger. Defining and Characterizing Reward Hacking. NeurIPS 2022; arxiv.org/abs/2209.13085. — and the more capable the optimizer, the harder it exploits the gap2A. Pan, K. Bhatia & J. Steinhardt. The Effects of Reward Misspecification. ICLR 2022; arxiv.org/abs/2201.03544.. Reward-hacking, not low accuracy, is the failure mode that silently corrupts a run; over-optimizing any imperfect proxy is Goodhart's law in miniature3L. Gao, J. Schulman & J. Hilton. Scaling Laws for Reward Model Overoptimization. arxiv.org/abs/2210.10760 (2022).. The market for RL environments sells hack-resistance and calibrated difficulty as claims; we found none of them independently checked.

The escape is structural. A measurement built from deterministic instruments can certify without a conflict of interest, because the customer recomputes the result rather than trusting the auditor — “run the checker” replaces “trust me.” This paper develops the instruments, the theory that unifies them, and two new quantitative readings the theory affords: a continuous Reward Integrity Score, and a bounded-compute search-effort estimator of difficulty (in bits).

2 · The epiplexity frame

Finzi et al. introduce epiplexity as the information available to a computationally-bounded observer — structural content, excluding the random entropy a time-bounded solver cannot exploit4M. Finzi, S. Qiu, Y. Jiang, P. Izmailov, J. Z. Kolter & A. G. Wilson. From Entropy to Epiplexity. arxiv.org/abs/2601.03220 (2026).. The idea sits in an established line of bounded-compute information — resource-bounded complexity such as Bennett's logical depth5C. H. Bennett. Logical Depth and Physical Complexity. In The Universal Turing Machine: A Half-Century Survey, 227–257 (1988)., and especially *usable information under computational constraints* (V-information), which likewise lets computation create information6Y. Xu, S. Zhao, J. Song, R. Stewart & S. Ermon. A Theory of Usable Information Under Computational Constraints (predictive V-information). ICLR 2020; arxiv.org/abs/2002.10689. — a tradition epiplexity refines and that we adopt rather than claim. Two of its consequences are load-bearing here: that information can be created by computation, and that it depends on ordering. The frame unifies our three claims:

Ê(t) = log2(1 + s(t))an ordinal, model-free estimator of the bounded-compute structure t demands of the solver

We make the difficulty reading explicit. Writing s(t) for the simulations a fixed seeded solver takes to first solve task t, we measure its bounded-compute search-effort bits — an operational proxy for the epiplexity cost, not an epiplexity estimator — as

3 · Deterministic reward-quality auditing

The Index rates an environment's reward/verifier, not a model's capability. Every instrument is deterministic, model-free, derived from a fixed seed, and replayable. Grades are deliberately coarse: A clean, C one signature, F two or more or a confirmed exploit, ERR the environment did not load under our imageERR is a compatibility signal, first-class and never a silent pass.. Six instruments produce them:

3 of 6 accepted00.51
0.50
Illustrative content-free probe scores; drag pass_threshold — any dot at or above it is accepted garbage (red). Gameability is a probe clearing the bar.

The seal is the brand. Each record is sealed by a record_sha256 over its deterministic fields (source, verifiers_version, harness_version); the only per-run-varying field, audited_at, lives in a sidecar, so the evidence is byte-identical across runsStronger than pinning a git revision: a revision records *what code ran*; the seal proves *the result itself reproduces*, bit for bit.12S. Biderman, H. Schoelkopf, L. Sutawika, et al. Lessons from the Trenches on Reproducible Evaluation of Language Models. arxiv.org/abs/2405.14782 (2024).. We do not cry wolf — instruments err toward false-negatives: the exploit intent-check leans “genuine” when unsure, verifier-completeness abstains without ground truth, and IPT skips structured gold where its text-invariance ops are not meaning-preserving. A finding is always framed precisely — *at pass_threshold X this reward accepts Y, seed-reproducible* — never “this environment is broken.”

On a live board of 20 audited artifacts, the flagship finding is a third-party single-turn environment graded F: sixteen content-free completions cleared the bar, eighteen soundness gaps fired, and exploit-search constructed a grader-accepted non-attempt in a single simulation — sealed and replayable. Clean controls grade A under the identical battery.

4 · Solver-effort difficulty certification

If difficulty is bounded-compute structure, then solver effort should track it — the same premise under compute-optimal test-time scaling, where harder problems are *allocated* more search13C. Snell, J. Lee, K. Xu & A. Kumar. Scaling LLM Test-Time Compute Optimally. arxiv.org/abs/2408.03314 (2024).. We test this two ways. First, against model behavior: on a 42-task constrained-text ladder, best-of-N samples-to-first-accept rank-correlates with within-model pass-rate at −0.924 (Qwen3-4B) and −0.870 (Qwen3-32B)Within-model: the more search effort a task needs, the lower the LLM pass-rate (Spearman ρ ≈ −0.92). — while a static constraint-count baseline correlates at the wrong sign on this ladder (+0.60 / +0.29)A property of this correlated ladder, not a general law: on a decorrelated battery constraint *count* is the strongest static predictor of LLM constrained-generation difficulty (§10) — measured effort still beats static features on both batteries.. Difficulty bands measured on the 4B transfer to the 32B (pass Spearman 0.656). Notably the prevailing difficulty signals are model-dependent — pass-rate / learnability filtering near p≈0.514S. Bae, et al. Online Difficulty Filtering for Reasoning-Oriented RL — pass-rate learnability around p≈0.5. arxiv.org/abs/2504.03380 (2025)., item-response-theory fits15H. Zhou, et al. Lost in Benchmarks? Rethinking LLM Benchmarking with Item Response Theory (PSN-IRT). AAAI 2026; arxiv.org/abs/2505.15055., and difficulty read as a model's lack of *V-usable information*16K. Ethayarajh, Y. Choi & S. Swayamdipta. Understanding Dataset Difficulty with V-Usable Information. ICML 2022 (Outstanding Paper); arxiv.org/abs/2110.08420. — whereas seeded-solver effort is model-free, which is what lets one certificate transfer across solvers rather than be re-estimated per model.

Within-model solver-effort vs LLM pass-rate (Spearman ρ); the constraint-count baseline is wrong-signed on this ladder (see §10 — on a decorrelated battery count is the strongest static predictor).
signalρ vs pass-rate
solver effort · Qwen3-4B−0.924
solver effort · Qwen3-32B−0.870
static constraint-count+0.60 / +0.29

Second, against ground truth where progress decomposes. On 90 reverse-generated Sokoban levels (CPU-only, every (level, seed, budget) bit-exactly replayable), seeded-MCTS sims-to-solve rank-correlates with the BFS-optimal oracle at +0.872 (moves) and +0.843 (pushes); all 90 solve, and band-median effort is monotone (easy 6.7 → medium 57.9 → hard 178.9 sims). The seeded MCTS here is the same test-time tree search that powers LLM reasoning17S. Hao, Y. Gu, H. Ma, et al. Reasoning with Language Model is Planning with World Model (RAP). EMNLP 2023; arxiv.org/abs/2305.14992.18S. Yao, D. Yu, J. Zhao, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023; arxiv.org/abs/2305.10601., repurposed to measure difficulty rather than to maximize a reward.

The instrument has an honest scope, and the scope is itself a result. A more powerful, replayable line-level tree search solves 0 of 42 creative-text tasks (Spearman 0.000, every task censored)A poem does not decompose into independently-scored lines, so partial progress carries no signal — there is nothing for tree search to climb. — yet it succeeds on Sokoban (ρ = +0.87). The refutation is a map: tree-search difficulty belongs in domains where partial progress is meaningful — games, puzzles, code, math.

5 · Action-space decomposition

If holistic structure defeats flat tree search, can an LM supply the missing decomposition? We test it on Baba Is You, a rewritable formal grammar in which NOUN IS PROPERTY sentences on the board *are* the rules19M. Charity & J. Togelius. Keke AI Competition: solving Baba Is You levels — the canonical Baba benchmark; baselines are flat primitive-move tree search. IEEE CoG 2022; arxiv.org/abs/2209.04911.. On a BFS-certified rule-assembly ladder — the only win is to push the WIN word d cells to form FLAG IS WIN — we compare a FLAT searcher over four primitive moves against a MACRO searcher over LM-proposed rule-edits:

FLAT primitive-move search vs an LM-proposed MACRO, by push distance (median sims-to-solve).
distance dFLAT solveFLAT simsMACRO simsratio
11.0071
21.0023123×
41.004691469×
60.62137211372×
collapse 7×FLAT d=17MACRO1
Real certified ladder at the measured d ∈ {1,2,4,6}: FLAT primitive-move search explodes 7→23→469→1372 sims and censors at d=6, while the MACRO arm holds 1.
The live engine, embedded — press Solve · FLAT vs Solve · MACRO on a hard rung and watch the sims diverge. Open the full explorer →

FLAT sims-to-solve grows super-exponentially and begins to censor at d = 6 (only 62% of seeds solve within budget — the games analog of the 0/42 creative-text dead-end)FLATMACRO13721Flat primitive-move search (1372 sims) collapses to one grounded macro.; MACRO holds one simulation and 100% solve at every distance. The mechanism is exactly what epiplexity predicts: before the rule forms, the progress signal is a flat plateau ending in a one-cell needle, so the flat searcher has no value gradient and must stumble onto the precise length-d sequence by exploration; a macro that targets the postcondition FLAG IS WIN, grounded by one bounded local search, never faces the plateau. The LM injects the structure that collapses the cost — information created by computation. An offline-distilled, pure-Python student keeps the collapse with no LM and no search at inference, the cheapest arm measured (≈10.3× under flat compute at sims = 1; 0.50 held-out solve rate bounds the off-policy gap).

The ingredients have deep prior art — temporal abstraction and the options framework20R. S. Sutton, D. Precup & S. Singh. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction (the options framework). Artificial Intelligence 112:181–211 (1999)., learned macro-operators that coarsen a search21R. E. Korf. Macro-Operators: A Weak Method for Learning. Artificial Intelligence 26:35–77 (1985)., and LLMs that propose subgoals or skills for RL22C. L. Shek & P. Tokekar. Option Discovery using LLM-guided Semantic Hierarchical RL (LDSC). arxiv.org/abs/2503.19007 (2025) — reports +55.9% average reward, no simulations-to-solve metric (sub-2× regime). — but the specific combination, an LM-proposed rule-edit macro grounded by one bounded local search that collapses a goal-directed search by up to 1372× *measured in simulations-to-solve*, is to our knowledge unattested: the closest neighbour reports only sub-2× gains on maze navigation, with no search-effort metric, and LLM-guided search more broadly tops out near ~2× without a simulations-to-solve operationalization23S. Meng, et al. LLM-A*: LLM-guided heuristic search — reports a search-cost metric but only ~2× fewer operations, no order-of-magnitude collapse. EMNLP 2024; arxiv.org/abs/2407.02511..

The most active adjacent line is program-synthesis-for-planning, where an LM synthesises reusable abstractions that a hierarchical planner then grounds — the closest relatives to our LM-proposes-the-structure / search-grounds-it design. But the nearest members abstract *spatial* move-to subroutines grounded by a learned transition model, so rule changes again arrive only as a side-effect of primitive moves, and they report sample-efficiency in tokens or reward, never a search-tree collapse24A. Ahmed, J. B. Tenenbaum, C. J. Bates & S. J. Gershman. Synthesizing World Models for Bilevel Planning (TheoryCoder). arxiv.org/abs/2503.20124 (2025); successor — Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents, ICLR 2026, arxiv.org/abs/2602.00929. Abstracts spatial move-to subroutines grounded by a learned transition model; reports token-cost / solution-rate, not a search-tree metric (Baba is motivation only).; the classical macro-operator ancestor that *does* report search cost tops out near a 10× planner-time speed-up, not an order of magnitude25A. Botea, M. Enzenberger, M. Müller & J. Schaeffer. Macro-FF: Improving AI Planning with Automatically Learned Macro-Operators. JAIR 24:581–621 (2005); arxiv.org/abs/1109.2154 — classical macros prune planner search by ≤10× CPU-time, never an order of magnitude., and program-synthesis-as-policy searches a learned program embedding for high return, not for a smaller search26D. Trivedi, Y. Zhang, S.-H. Sun & J. J. Lim. Learning to Synthesize Programs as Interpretable and Generalizable Policies (LEAPS). NeurIPS 2021; arxiv.org/abs/2108.13643. Descendant HPRL (G.-T. Liu, et al., ICML 2023; arxiv.org/abs/2301.12950). Both search a learned program embedding for high return, not for fewer search nodes.. Across the whole family the reported quantity is *performance* — reward, accuracy, success, token cost — not simulations-to-solve; the metric kind itself, not just the magnitude, is where our claim lives.

The Baba domain itself is not new — it is an established rule-manipulation benchmark27D. Paglieri, et al. BALROG: Benchmarking Agentic LLM/VLM Reasoning On Games (a Baba Is AI env; flat primitive action set). ICLR 2025; arxiv.org/abs/2411.13543. — but prior agents act through flat primitive moves, where rule changes are an emergent side-effect of pushing word-tiles, never a deliberately selected rule-edit action. The nearest precedent to expose a high-level make/break rule primitive scores VLM planning accuracy on static images and runs no search at all28A. Cloos, et al. Baba Is AI — agents manipulate objects and rules; exposes high-level make/break NOUN-IS-PROPERTY primitives but scores VLM planning accuracy and runs no search. ICML 2024 wksp; arxiv.org/abs/2407.13729., so it cannot exhibit — and does not pre-empt — the search-effort collapse that is our claim. The deeper rules-as-formal-objects tradition — general game playing and its Game Description Language — treats rules as a specification the designer writes or a system repairs, not an action an agent selects mid-play29M. R. Genesereth, N. Love & B. Pell. General Game Playing: Overview of the AAAI Competition (the Game Description Language). AI Magazine 26(2):62–72 (2005); incomplete-information extension GDL-II, M. Thielscher, AAAI 2010. In the GGP/GDL tradition, rules are a formal specification the designer writes or a system repairs, not an action an agent selects mid-play.. We adopt the domain; the LM-proposed rule-edit macro and its measured collapse are the contribution.

6 · The Reward Integrity Score

A letter grade is a verdict; a measurement layer wants a number. We define the Reward Integrity Score (RIS) — a continuous reading in [0, 100], derived from the *same* signals as the grade so it is monotone with it: the grade selects a non-overlapping band, and the worst content-free margin — how far the most-accepting degenerate probe sits below (clean) or above (gamed) pass_threshold — places the row within itA confirmed exploit floors the score below the F band, lower the fewer simulations it took to construct — “gamed in 1 sim” is the worst reading the Index produces. The sims count is a coarse flooring signal under the pinned hacker spec, not a calibrated cross-family hardness number..

RIS = band(grade) + (ceiling − floor) · headroomA ∈ [72,100], C ∈ [45,70], F ∈ [20,44]; a confirmed exploit ∈ [2,14]
RIS 94.4 · AFCA
The Reward Integrity Score, live — the same formula the Index publishes. Signatures set the band (A/C/F); the worst margin places the row within it; a confirmed exploit floors it.

Because RIS is a deterministic function of the published evidence, it is itself replayable, and it sorts the board into a clean gradientReward Integrity Score050100Reward Integrity Scores across the live board — F rows cluster low, A rows high.: the exploit-bearing F row scores in the single digits, single-signature C rows sit mid-scale, and clean A controls cluster near the top. The score is a *presentation* of the existing audit — not a new judgment — so it inherits the same false-negative bias and the same seed-reproducibility.

7 · A bounded-compute search-effort estimator

Section 2 defined Ê(t) = log₂(1 + s(t)). It is the log-time term of Levin's Kt complexity — Kt(x) being the minimum over programs of (program length + log run-time)30L. A. Levin. Universal sequential search problems. Problems of Information Transmission 9(3):265–266 (1973); textbook treatment in M. Li & P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications (Springer, 2008). — but read for one *fixed* solver rather than minimised over all programs, which makes it a measurable, solver-relative proxy for the run-time that gates bounded-compute difficulty. That distinguishes it from logical and *computational* depth, which subtract the plain Kolmogorov complexity to isolate the run-time of the *minimal* program31L. Antunes, L. Fortnow, D. van Melkebeek & N. V. Vinodchandran. Computational Depth: Concept and Applications. Theoretical Computer Science 354(3):391–404 (2006) — depth as time-bounded minus classical Kolmogorov complexity, capturing Bennett's logical depth.; a formal reduction of Ê to Kt — or to epiplexity itself — is open, and indeed the epiplexity authors list exactly these relationships (Kt, logical depth, resource-bounded complexity) among their open problemsso Ê is offered as an operational analog of the Kt time term, not a theorem; the cross-solver band-invariance thread below is the empirical face of the same open question.. Here we read it off the certified Sokoban corpus. The difficulty bands, expressed in bits of bounded-compute structure, are monotone and modest — a hard level demands only a few bits more than an easy oneeasy 6.7 → 2.94 bits · medium 57.9 → 5.88 · hard 178.9 → 7.49; the log scale is the point — effort explodes, structure accrues gently. — and the ordering is the oracle's (ρ = +0.872):

sims 58 → Ê 5.88 bitseasymedhard
Ê = log₂(1 + sims), the exact curve. Effort explodes (log x-axis) while structure accrues gently; the three Sokoban bands sit where the oracle (ρ = +0.872) places them.

Reading difficulty in bits matters for *transfer*: a quantity that is ordinal and model-free is a candidate for a cross-solver, cross-model difficulty band, the object the calibration result (−0.92 within model, 0.656 across tiers) gestures at. Establishing band-invariance across solver families is the first open thread.

Positioning and novelty. Two findings sharpen the claim. The closest concurrent instrument to ours — isomorphic-perturbation testing — is applied to a *model's output* on a single domain (Helff et al.); we run the same principle environment-side, deterministically, as one of six instruments alongside difficulty certification, action decomposition, and the epiplexity unification. And to our knowledge no public reward-quality index or integrity gate for RL environments exists: the crowdsourced environment hubs auto-publish contributions with no rating, leaderboard, or reward-hack screen, and the nearest benchmark scores *models* for exploit rates rather than auditing the environments' rewards32K. Thaman. Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use — scores models, not environments. arxiv.org/abs/2605.02964 (2026).. The niche the Reward Integrity Index occupies is, as of writing, unoccupied.

Reward hacking and overoptimization. The phenomenon we audit is well characterized: reward hacking has a formal definition (Skalse et al.), it worsens with the optimizer's capability and can arrive as a phase transition (Pan et al.), and over-optimizing any imperfect proxy traces a Goodhart curve away from the true objective (Gao et al.). Those results describe the failure model-side and in aggregate; we make it an environment-side, per-row, deterministic measurement.

Invariance testing and hack detection. Perturb-and-check appears independently model-side (Helff et al.), and contrastive model-based detectors report detection rates near 63% on a strong reasoner33D. Deshpande, A. Kannappan & R. Qian (Patronus AI). Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis — the TRACE benchmark. arxiv.org/abs/2601.20103 (2026).. Our IPT is environment-side, and exploit-search instead *constructs* the hack with a seeded search and ships a deterministic transcript — every score a reproducible construction, not a probability.

Judge gaming and verifier robustness. Model-based judges are strong but gameable — persuadable by position, verbosity, and surface form (Zheng et al.; Wang et al.) — which is the headline fragility a deterministic grader sidesteps. Studies of rule- versus model-based verifiers34From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning. arxiv.org/abs/2505.22203 (2025). and of RL under imperfect or noisy rewards35An Imperfect Verifier is Good Enough: Learning with Noisy Rewards. arxiv.org/abs/2604.07666 (2026).36Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers. arxiv.org/abs/2510.00915 (2025). show that verifier quality bounds RLVR; our verifier-completeness instrument is the *measurement* those lines presume.

Search effort, test-time compute, and difficulty. Test-time tree search is how reasoning systems spend compute (Tree of Thoughts; RAP), and compute-optimal scaling already routes more of it to harder problems (Snell et al.); we read that same effort as a *difficulty signal* rather than a way to raise a score. Frontier small-model RL weights updates toward the p ≈ 0.5 boundary, estimated per-model by rollout37Xu et al. VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models. arxiv.org/abs/2606.16140 (2026). — a static, model-independent certificate would remove that estimation.

Decomposition and temporal abstraction. Our search-collapsing macros inherit a long lineage: automatic option discovery — including options chosen to *minimize planning time*, which is NP-hard and therefore motivates a heuristic proposer38Y. Jinnai, D. Abel, D. Hershkowitz, M. Littman & G. Konidaris. Finding Options that Minimize Planning Time. ICML 2019; arxiv.org/abs/1810.07311. — and the LLM-as-planner line that proposes skills or subgoals, from learned skill libraries39G. Wang, et al. Voyager: An Open-Ended Embodied Agent with Large Language Models — a learned skill library. NeurIPS 2023; arxiv.org/abs/2305.16291. and affordance-grounded skill selection40M. Ahn, A. Brohan, N. Brown, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). arxiv.org/abs/2204.01691 (2022). to library learning that compresses a synthesis search41G. Grand, L. Wong, et al. LILO: Learning Interpretable Libraries by Compressing and Documenting Code. NeurIPS 2023; arxiv.org/abs/2310.19791.. These operate within an RL/MDP search or grow a skill set; none reports an LM-proposed macro that collapses a goal-directed solver search by orders of magnitude — the gap §5 fills. The testbed is honest about its lineage: Baba Is You is an established rule-manipulation benchmark (§5), but its published solvers search flat primitive moves, and the nearest LLM-guided-search neighbors report only ~2× effort reductions — none in simulations-to-solve, none on an LM-proposed rule-edit.

Reproducible evaluation. The community push toward seed-controlled, reproducible evaluation (Biderman et al.) is the methodological neighbor of our byte-identical seal; and a deterministic, data-oblivious quantization substrate42A. Zandieh, M. Daliri, A. Hadian & V. Mirrokni. TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arxiv.org/abs/2504.19874 (2025). is the layer on which a portable, customer-verifiable distilled grader would sit.

9 · Extensions & open problems

The instruments and the two estimators above open a concrete research agenda; each item is a measurement the field currently assumes but does not have.

10 · Limitations

We state the boundaries plainly. The constrained-text calibration uses one gold reference per form, and the sonnet grader cannot be tightened without rejecting genuine Shakespeare; the tree-search difficulty instrument is scoped to decomposable domains (the line-MCTS dead-end on holistic text is a feature of the domain, not a bug to fix). The model-adversary and real-LM macro-proposer arms are gated on managed-API and GPU budget, so several decomposition results use deterministic mock proposers; live LLM-judge fragility needs an egress path the network-free sandbox deliberately withholds. Finally, pass-rate is model behavior, not pure difficulty — which is exactly why the oracle-anchored Sokoban result carries the load.

Robustness — constraint count, not constrainedness, drives LLM difficulty. The natural companion certificate — exact Gent–Walsh constrainedness / solution *density* of a task's constraint set — is refuted as the driver of LLM constrained-generation difficulty: on a battery built to decorrelate density from constraint count (an exact-DFA regex battery plus a fixed-length rarity-spanning acrostic battery; 16 models, 8 families), constraint count recovers 67.5% of the fittable 2PL signal, output length 55%, density only 44.5% (held-out R² 0.20 / 0.16 / 0.09); at fixed count the density signal scatters around zero. The CSP phase-transition story does not transfer to LLM generation. Measured search effort, by contrast, *survives* the same count-decorrelation control (cross-family partial ρ = −0.671, negative in 30/30 pairs).

References

Inline citations above carry full detail in the margin; the consolidated list follows. arXiv identifiers are given as text — the page is self-contained and embeds no links.

Audit your environment

Shipping a verifiers-format environment? We run this exact deterministic, model-free audit — gaming & difficulty — and hand back a sealed, replayable verifier card your buyers re-run themselves. No trust required.

Audit your environment →