Run the Checker: Replayable Audits, Difficulty Certificates, and the Capability Floor

Cody Lee Walker · StarOS / stardata · 2026-07

Every empirical number in this post is seed-pinned and reproducible from a cited harness command or a sealed transcript (record_sha256). Don't trust me — re-run the seed.

Abstract

Reinforcement learning from verifiable rewards has made the reward environment, not the model, the load-bearing artifact of a training run — and the market for these environments trades on self-attestation. This post describes a deterministic, seed-pinned auditing system for that market (six instruments, a public Environment Quality Index of 19 audited environments, every row sealed by a content hash) and reports what a season of pointing those instruments at my own hypotheses produced: three refutations with named mechanisms (likelihood-based difficulty is a length meter; information-theoretic constraint density is constraint count in disguise; base-checkpoint few-shot is the wrong substrate for emergence), and two results that survived the same treatment — a capability floor between 1.7B and 4B parameters below which difficulty proxies mislead, replicated across three model families, and a solver-effort difficulty instrument that keeps its signal after the decorrelation test that killed density. The refutations are not caveats to the method; they are the method. An instrument that cannot kill its owner's thesis cannot certify anyone else's.

1. The problem

The load-bearing artifact of a modern training run has changed. RLVR made the model transient and the environment's reward function the specification — the thing the policy will optimize to the letter, and past it. A gameable reward does not show up as low accuracy on a dashboard; it shows up as a policy that confidently learned the wrong thing, discovered in a post-mortem after the compute is spent.

An economy has formed around these environments — hubs, vendors, real procurement budgets — and it trades almost entirely on self-attestation. Labs screen environment vendors on exactly two technical criteria, resistance to reward-hacking and difficulty calibration, and have an independent instrument for neither. The existing answer is model-based red-teaming, which is stochastic, non-reproducible, and itself gameable: an LLM judge auditing LLM-judge rewards inherits the conflict of interest it is supposed to referee.

The structural claim this work is built on: only deterministic measurement can certify without a conflict of interest, because a deterministic result is one the customer recomputes. A measurement is objective not because the measurer is neutral but because the full apparatus travels with the result — seed, versions, thresholds, sealed hash — so anyone can re-enact the same cut and obtain the same phenomenon, byte for byte. "Trust me" is replaced by "run the checker."

2. The apparatus

The Environment Quality Index rates an environment's reward, not a model. Six deterministic, model-free instruments score fixed or seed-derived candidate completions against the environment's own reward: a degenerate-probe (content-free completions that should never pass), isomorphic perturbation testing (meaning-preserving re-renderings that should never move a score), verifier-completeness (dressed-up wrong answers that pass are soundness gaps; valid equivalent forms that fail are completeness gaps), a seeded exploit-search (the flagship — it constructs a grader-accepted non-attempt and ships it as a replayable transcript, "gamed in N sims"), multi-turn tiers, and difficulty certification (§3). Grades: A clean · C one signature · F two-plus or a confirmed exploit · ERR the environment did not load — a compatibility signal, never a silent pass. Full definitions: docs/methodology.md.

Each record's deterministic provenance is sealed into a record_sha256; the one per-run-varying field lives in a sidecar. The seal proves the result itself reproduces — bit-for-bit — which is stronger than pinning code to a git revision.

A note on what "objective" means here. Following Barad's account of measurement (Meeting the Universe Halfway, 2007), an apparatus is not a window onto a pre-existing property; it is the boundary-drawing practice that makes the property definite. "Reward quality" does not sit in the environment waiting to be read off — it is enacted by a specific battery, at a specific threshold, under a specific seed. That is not a weakness to apologize for; it is the design. The audit does not ask to be believed despite its apparatus, it ships the apparatus, and reproducibility of the whole arrangement — instrument, cut, and phenomenon together — is what objectivity amounts to. The same stance appears in bounded-information terms in the epiplexity frame [1]: what a measurement yields depends on the computation that extracts it, so the computation must travel with the claim.

The deployed instance. The live board carries 20 sealed audits across 19 environments — 2 F · 3 C · 9 A · 5 ERR · 1 difficulty certificate (skill_reward_hacking is audited both single- and multi-turn) — spanning torch-heavy closures, dataset-at-load, and both verifiers API generations. The flagship finding: tonyteo/skill_reward_hacking grades F — 16 content-free completions clear pass_threshold=0.5 (echoing the prompt back scores 12.27), the completeness probe fires 18 soundness gaps, and the exploit-search constructs a grader-accepted non-attempt in one simulation (composite 1.076), sealed and replayable (stardata audit skill_reward_hacking --row-seed 0 --exploit-search). Clean controls (gsm8k, aime2025, ascii_tree, …) grade A under the identical battery — the negative controls are the proof the instrument discriminates, and they are reported as loudly as the findings.

3. Difficulty: the second axis, and its honest scope

The other criterion labs screen on is difficulty calibration, and pass-rate banding — the default — is model behavior, not difficulty. The instrument here is solver effort: how much search a fixed, seeded solver spends to first solve the task.

On a 42-task constrained-text ladder, within-model best-of-N samples-to-first-accept rank-correlates with pass-rate at −0.924 (Qwen3-4B) and −0.870 (Qwen3-32B), n=32 (docs/research/difficulty-correlation-v2.md). On 90 reverse-generated Sokoban levels, seeded MCTS sims-to-solve correlates with the independent BFS-optimal oracle at +0.872 — no language model anywhere in the measurement, every (level, seed, budget) bit-exact (stardata difficulty; docs/research/experiments-log-2026-07.md, Wave 5).

The scope statement matters as much as the correlations. Line-level tree search on the same creative-text tasks solved 0/42 — a poem does not decompose into independently scorable lines, and no budget crosses a ceiling the decomposition imposes. In a decomposable grammar game the same search machinery, handed LM-proposed macro-actions, collapses effort by up to 1372× (flat 7→23→469→1372 sims across an assembly ladder; macro holds 1 sim at 100% solve). Tree-search difficulty certification belongs in decomposable-progress domains — games, puzzles, code, math — and the flat-sampling variant covers holistic text (docs/research/paper.md §4–5).

4. What the instruments killed

Three hypotheses of mine went into the battery this season. None survived. Each refutation came out with a named mechanism, which is what makes the survivors in §5 worth believing.

Teacher-forced likelihood is a length meter, not a difficulty measure. The cheapest imaginable certificate — mean per-token logprob of a reference solution across a Pythia scale ladder — correlates the wrong way with everything it should predict. The mechanism check (scripts/experiments/length_confound_check.py) found logprob dominated by reference length (ρ = +0.63) while length is orthogonal to actual difficulty (ρ(len, pass) = +0.04). Bits-per-byte normalization does not rescue it. Instrument choice is decisive: item difficulty is scale-stable within an instrument (Spearman 0.957 across a 20× parameter span) and near-orthogonal across instruments (−0.076) — likelihood and production measure different things, and only production tracks difficulty (Waves 2–3, experiments-log-2026-07.md).

Constraint density is constraint count in disguise. The elegant hypothesis — that Gent–Walsh constrainedness, the exact information-theoretic solution density of a task's constraint set, predicts LLM pass-rate and locates a phase transition — looked confirmed at ρ = −0.91 on existing data. Then the decorrelation test forced density and constraint count apart (an exact-DFA regex battery where bits ⊥ count, plus a fixed-length acrostic battery where initials span rarity at fixed count, 16 models, 8 families). At fixed count, the density signal scatters around zero with mixed signs; in the IRT, count recovers 67.5% of the fittable 2PL signal, length 55%, density 44.5%. LLM constrained-generation difficulty is governed by the number of simultaneous constraints — a working-memory-shaped limit — not by solution density. The CSP phase-transition does not transfer to LLMs (docs/research/phase-transition-program-2026-07.md).

Base-checkpoint few-shot is the wrong substrate for emergence. The training-compute axis (does the certificate predict when in pretraining a task becomes solvable?) returned null on 8 OLMo-2 checkpoints spanning 5B→3524B tokens — and the diagnosis is the substrate, not the axis: a few-shot prompt leaks format even to a 5B-token checkpoint, flattening the training curve. The clean version of this experiment needs instruct-tuned intermediate checkpoints, which no lab currently releases (Wave 7a — also a concrete collaboration ask).

5. What survived scale

The capability floor. A cheap, model-free difficulty certificate does transfer across model scale — but only above a measurable capability floor, located between 1.7B and 4B parameters on this battery. Below the floor its correlation with model behavior is flat (−0.19 at 0.6B, −0.22 at 1.7B); at 4B it switches on (−0.46) and strengthens monotonically (−0.79 at 32B). The mechanism is concrete and replayable: a 0.6B model scores a perfect composite on a vowel-constraint task that Qwen3-32B fails — by emitting "eep eep eep eep eep" — and "writes" a sonnet as "da-DUM da-DUM da-DUM", the meter pattern, literally. Sub-floor models satisfy constraints for spurious reasons, so their difficulty rankings do not transfer (Wave 4).

The floor is not a Qwen artifact. On a 10-model, 3-family instruct ladder (Qwen3 0.6B–32B, SmolLM2/3, OLMo-2 7B/13B; sampled n=32, sealed transcripts, ~$6 of GPU), certificate validity rises with capability across all families — +0.29 (inverted) at the 0.6B tier to −0.55…−0.73 for every model above the floor — and OLMo-2-7B lands on the Qwen curve at matched capability. The certificate also predicts each task's engagement threshold (the least-capable model that reaches pass ≥ 0.1): Spearman 0.42, rank-monotone. This is a computable, a-priori answer to when a small proxy model is valid — the transfer-failure phenomenon named in arXiv:2512.24503, with a floor you can measure first (Wave 7b).

Solver effort survives the test that killed density. This week the D1 decorrelation bar was applied to the surviving instrument: does effort measured on probe model A predict exact-binary pass on target model B once constraint count, structural size, and output length are partialled out? Across 30 probe→target pairs (strongest form: cross-family), the raw cross-family median is −0.827 and the partial median is −0.671, negative in 30/30 pairs, no sign flips — the opposite of density's collapse. Two honest limits ship with it: at n=32, text-battery effort is close to a monotone transform of the probe's own pass profile (ρ = 0.855), so it is a pinned-reference-solver instrument, not a model-free one — the model-free claim lives on the Sokoban leg — and the signal is carried by cross-form variation, not within-ladder resolution (docs/research/l2-decorrelation-2026-07.md, result from 2026-07-04, this week).

6. The pre-registered next step

The confirmatory version of the scale-transfer claim — "difficulty certificates transfer above a measurable capability floor, and the floor explains why small proxies mislead" — is pre-registered and pending: frozen battery, pinned model revisions, a held-out family clause (the certificate is fit without ever seeing the family it must predict), and decision rules declared before the run (docs/research/prereg-scale-transfer-v1.md). The exploratory results above are quarantined as exploratory until then. That is the same discipline that produced the refutations in §4, pointed forward.

7. Limitations

The constrained-text calibration has n=1 gold per form (a sonnet grader cannot be tightened without rejecting genuine Shakespeare). Tree-search difficulty is scoped to decomposable domains — the line-MCTS dead-end is documented, not hidden. The model-adversary and live-judge audit arms are gated behind managed-API/GPU access, so several results use deterministic mock proposers. Pass-rate is model behavior, not pure difficulty — which is exactly why the oracle-anchored Sokoban result and the exact-binary re-analyses carry the weight they do. And one instrument-honesty note from §5 bears repeating: the text-side effort instrument reads a pinned solver's behavior; only the game-side instrument is model-free end to end.

Related work (brief)

Helff et al. independently introduce isomorphic-perturbation testing on the model side [3]; the convergence is evidence the invariance principle is real, and the two views compose — they measure which models hack, this measures which environments are hackable. TRACE [2] detects reward hacks with models, probabilistically; the exploit-search here constructs them deterministically and ships the transcript. Verifier-robustness studies [7–9] establish that verifier quality bounds RLVR; the completeness instrument is the per-environment measurement those lines presume. VibeThinker-3B [4] is the natural consumer of certified difficulty bands (boundary-weighted RL updates). The epiplexity frame [1] grounds both "effort-to-solve" and "effort-to-game" as bounded-compute structure; Barad [8] grounds what the seal is for.

References

[1] Finzi, Qiu, Jiang, Izmailov, Kolter, Wilson. From Entropy to Epiplexity. arXiv:2601.03220, 2026. [2] Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis (TRACE). arXiv:2601.20103, 2026. [3] Helff et al. LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking. arXiv:2604.15149, 2026. [4] Xu et al. VibeThinker-3B. arXiv:2606.16140, 2026. [5] From Accuracy to Robustness: Rule- and Model-based Verifiers in Mathematical Reasoning. arXiv:2505.22203, 2025. [6] An Imperfect Verifier is Good Enough. arXiv:2604.07666, 2026. [7] RL with Verifiable yet Noisy Rewards under Imperfect Verifiers. arXiv:2510.00915, 2025. [8] K. Barad. Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter and Meaning. Duke University Press, 2007. [9] (proxy transfer) arXiv:2512.24503, 2025.

Audit your environment

Shipping a verifiers-format environment? We run this exact deterministic, model-free audit — gaming & difficulty — and hand back a sealed, replayable verifier card your buyers re-run themselves. No trust required.

Audit your environment →