Don't Trust Me — Re-run the Seed
Cody Lee Walker · codylee.ca · 2026-07
There is a public reinforcement-learning environment on a model-training hub whose reward I can satisfy without doing the task it was built to test. Echo the prompt back verbatim and the grader hands back a score of 12.27, against a passing threshold of 0.5. A seeded search finds a grader-accepted non-attempt in one simulation. I did not detect this with a cleverer model. A deterministic instrument constructed the exploit, sealed the transcript with a content hash, and the whole thing re-runs on your machine byte-for-byte:
stardata audit skill_reward_hacking --row-seed 0 --exploit-search
That sentence — re-runs on your machine byte-for-byte — is the entire point of this post.
Why a gameable reward should worry you
For most of deep learning the model was the artifact you shipped and the thing you worried about. Reinforcement learning from verifiable rewards moved the risk. Now the model is transient and the environment's reward function is the specification — the exact thing the policy learns to satisfy, to the letter and then past it. A reward that can be cleared without doing the task doesn't fail loudly. It produces a policy that confidently learned the wrong thing, and you find out in a post-mortem, after the compute is spent.
Meanwhile an economy has grown up around these environments — hubs, vendors, procurement budgets — and it runs almost entirely on the honor system. Labs vet environment vendors on two things: is the reward hard to game, and is the difficulty calibrated? For neither question is there an independent instrument. The usual stopgap is to point a big model at the environment and have it try to hack the reward. That is stochastic, it doesn't reproduce, and it has a conflict of interest baked in: an LLM judge auditing LLM-judge rewards is refereeing its own kind.
The move: construct, don't detect — and ship the instrument
So I stopped trying to catch reward-hacks with a smarter model and started constructing them with a dumb, deterministic one. A seeded search drives the reward up while a held-out check confirms the output isn't a real attempt; when it succeeds, that's a confirmed exploit, and it ships as a replayable transcript — "gamed in N sims" — not as my say-so.
Here's the part I actually care about. A measurement isn't trustworthy because the person who took it is neutral; it's trustworthy because you can take it again and get the same thing. So every result carries its whole apparatus with it — the seed, the software versions, the threshold, a hash that seals the record — and you re-enact the measurement rather than trusting the measurer. The instrument is the argument, and you can hold it in your hands. Don't trust me. Re-run the seed.
The board
The instruments produce a grade per environment: A clean, C one hack signature, F two or more (or a confirmed exploit), and ERR when the environment won't even load — a first-class "this is broken" signal that never quietly counts as a pass. The live index has 19 environments graded this way.
The credibility check is the environments that come back A. The clean, well-built rewards
(gsm8k, aime2025, and others) go through the identical battery and pass — I report those
as loudly as the failures. A checker that flags everything is worthless; the negative controls
are the proof it discriminates. And every finding is phrased narrowly — at this threshold this
reward accepts these specific completions, seed-reproducible — never "this environment is
broken." I'm not in the business of crying wolf.
If you build environments, the same instrument runs as a gate — reward-hacking caught the way a failing test is caught, before a run instead of after:
| exit code | meaning | CI action |
|---|---|---|
| 0 | clean at or above your threshold | proceed |
| 1 | a reward-hack at/above the fail grade | fail the build |
| 2 | environment not authorized | fix authorization |
| 3 | ERR — environment didn't load | fix the environment |
stardata ci my_env --fail-on F || { echo "reward-quality gate failed — not launching"; exit 1; }
Nobody gates reward-hacking during training today; it gets written up afterward. That's the gap.
The science that fell out
Difficulty calibration is the other thing labs screen on, and pass-rate is a bad proxy for it — it measures the model, not the task. So the instrument measures solver effort instead: how much search a fixed, seeded solver burns to first solve the task. On generated Sokoban levels it tracks an independent optimal-solution oracle at rank-correlation +0.872, with no language model in the loop at all.
Trying to extend that to language models is where it got interesting, because it kept failing in instructive ways. The best one: a difficulty certificate only predicts how a model behaves above a capability floor, somewhere between 1.7B and 4B parameters. Below the floor, models satisfy constraints for fake reasons. A 0.6B model scored a perfect mark on a single-vowel-only writing task that Qwen3-32B outright fails — by emitting "eep eep eep eep eep." Asked for a sonnet, it wrote "da-DUM da-DUM da-DUM" — the meter, literally, spelled out. It never engaged the task, so its sense of what's hard is noise. Measure the floor first and you know when a cheap proxy model is telling you the truth. That result now holds across three model families.
The graveyard is the point
I'll tell you what didn't work, because that's the part that should make you believe the parts that did.
- The cheapest difficulty certificate I could think of — a model's likelihood of a good answer — turned out to measure answer length, not difficulty. It correlated the wrong way.
- The most elegant one — the exact information-theoretic "solution density" of a task's constraints, a beautiful result from constraint-satisfaction theory — looked confirmed at a strong correlation, and then dissolved the moment I forced density apart from the mere number of constraints. It was constraint-count wearing a tuxedo. The pretty theory doesn't transfer to language models.
- The emergence story — predicting when during training a skill appears — came back null, because the checkpoints I could get leak the answer format even early in training.
Three of my favorite hypotheses, all dead, each with a named cause of death. The reason I'll stand behind the survivors is that they went through the exact same guillotine. When I ran the count-versus-density test on the solver-effort instrument this week, it kept its signal — negative in all thirty model-pairs I checked — where density had collapsed. An instrument that can't kill its owner's thesis can't certify anyone else's.
Where this goes
The index is public and refreshes as environments are audited; the confirmatory version of the capability-floor result is pre-registered — battery frozen, decision rules written down before the run, with one model family held out so the certificate has to predict a family it never saw. If you maintain an RL environment and want to know whether its reward survives contact with a deterministic adversary, that audit exists and produces something you can re-run yourself.
Which is the whole pitch, one more time: not "trust the auditor." Re-run the seed.
Audit your environment
Shipping a verifiers-format environment? We run this exact deterministic, model-free audit — gaming & difficulty — and hand back a sealed, replayable verifier card your buyers re-run themselves. No trust required.