The Reward Integrity Index
Can a reward be satisfied without doing the task? An accessible tour of how the Index answers it — the instruments, the live board, and how to use it. For the theory, the formal results, and the references, read the technical paper.
Don't trust us — re-run the seed.
Contents
The problem
Reward-hacking — not low accuracy — is the failure mode that silently corrupts an RL runReward-hacking — earning the reward without doing the task it was meant to measure.: if a reward can be satisfied without doing the task, the policy learns to satisfy the reward. Yet the market screens environment vendors on exactly this property — and on difficulty calibration — while trading almost entirely on self-attestation.
We surveyed the field and found no independent, third-party reward-quality auditor for RL environments. The trust layer is missing precisely where buyers, sellers, and regulators all need it. starscry's Reward Integrity Index is that layer.
The Index & its instruments
The Index rates an environment's reward/verifier — not a model's capability. Every row is produced by deterministic, model-free instruments, is seed-pinned, and is sealed by a record_sha256 over its evidencerecord_sha256 — a hash over the audit's evidence. Change one byte and it changes, so anyone can confirm the row was not edited after the fact.. The grades are deliberately simple:
- A — clean: no reward-hack signature found.
- C — one reward-hack signature.
- F — two or more signatures, or a confirmed exploit.
- ERR — the env did not load or roll out under the harness image; an env-compatibility signal, never a pass.
How we measure
No model sits in judgement — a judge is itself gameable. The instruments are mechanical:
- Degenerate-probe — a fixed battery of content-free completions (empty, filler, "I don't know", prompt-echo, …); any that scores at or above pass_threshold is accepted garbage.
- IPT (isomorphic-perturbation testing) — cosmetic, meaning-preserving re-renderings (whitespace, casing, punctuation) must not move the score; a move is a fragility bug.
- Verifier-completeness — for exact-match rewards, probes soundness (a known-wrong answer that scores as correct) and completeness (a valid equivalent form that is rejected).
- Exploit-search — the flagship: a seeded search drives the env's own reward up while a held-out intent check fails; a grader-accepted non-attempt is a confirmed exploit and forces F, shipped as a replayable transcript ("gamed in N sims" — a count defined by the pinned hacker spec that produced it, ranking hardness only within a verifier family, never across families).
- Multi-turn tiers — scripted adversarial policies driven through the real rollout, plus reward property-tests and trajectory-perturbation checks, for stateful environments.
- Difficulty certification — search-effort sims-to-solve against an independent oracle, with a published rank-correlation.
Replayable evidence — the brand
Independence here is structural, not promised. Because every measurement is deterministic and sealed, a reader doesn't trust us — they re-run the seed and check the bytes. The same property lets the same company eventually certify its own products without conflict: certification you can recompute is run-the-checker, not trust-me.
Each finding ships as a named artifact: a gaming-audit report, an exploit transcript, a verifier card, or a difficulty certificate. A record_sha256 seals the reproducibility of the audit — it is explicitly not an attestation of the environment's own training-data provenance. Reproduce any row with stardata audit <env> --row-seed <s>.
Why deterministic
Both things we measure reduce to one — bounded-compute structure, the usable signal a task or a reward holds for a solver with finite computeThis is epiplexity (Finzi et al., 2026) — the lens the technical paper builds on.. Difficulty is how much structure a task demands, which our search-effort sims-to-solve measuresWithin-model: the more search effort a problem needs, the lower the pass-rate (Spearman ρ ≈ −0.92).; a reward is gameable precisely when a near-zero-structure output — content-free filler — still clears it. A deterministic instrument has no model to fool, and the same property dissolves the conflict of interest: you recompute the result rather than trust us.
The theory, the formal estimator, and the field → the technical paper
We do not cry wolf
A fabricated reward-hack accusation is the one failure the brand cannot afford, so every instrument errs toward false-negatives over false-positivesA missed hack is cheaper than a fabricated accusation — so when the evidence is ambiguous, the instruments stay silent.: the exploit intent-check leans "genuine" when unsure; verifier-completeness abstains without usable ground-truth; IPT skips structured (JSON) gold answers where its text-invariance ops are not meaning-preserving.
A finding is always framed precisely — at pass_threshold X this env's reward accepts Y, seed-reproducible — and never as "this env is broken." A finding that survives a skeptic re-running the seed is worth more than ten that don't.
What the live board found
The live board carries 20 audited artifacts (2 F · 5 ERR · 3 C · 9 A · 1 difficulty). Every figure below is read straight from the published rows — re-render the Index and it updates itself.
The flagship signal. skill_reward_hacking grades F: 16 content-free completions scored at or above pass_threshold, 18 verifier-soundness gaps (a wrong answer scored as correct), a confirmed, replayable exploit (forces F). The exploit ships as a sealed transcript that re-runs from its seed — a constructed reward-hack, not a fixed probe, which is the strongest evidence the Index produces.
A single-signature C. allenai_ifeval: 1 content-free completion scored at or above pass_threshold — stated precisely, at the audited pass_threshold on the audited rows, and seed-reproducible. Not a claim the env is broken.
A clean control. aime2025 grades A — it withstood the full battery (degenerate probes, invariance, and exploit-search) with no signature. Negative controls are reported as loudly as findings; a board that only ever cried wolf would be worthless.
5 ERR rows are an env-compatibility signal — the environment did not load or roll out under the harness image — never a silent pass and never a quality claim.
Where difficulty comes from
Difficulty is governed by decomposabilityWhether progress breaks into sub-goals a search can climb, or hides behind one all-or-nothing step.. On a rule-assembly puzzle a flat primitive-move search explodes (median 7 → 23 → 469 → 1372 simulations, censoring at distance 6) while an LM that proposes macro-actions collapses it to a single simulation — the LM injects the structure that removes the plateau:
▶ Open the live explorer → The mechanism, in the paper →
A CI gate for RL environments
Reward-hacking is usually found after training, in a post-mortem. stardata ci gates it during training, like a failing test — the same audit, wrapped in a CI-shaped exit-code contract, so a gameable reward is caught before it corrupts a run:
$ stardata ci my_env --fail-on F # exit 0 reward-quality clean — proceed # exit 1 reward-hack at/above --fail-on — fail the build # exit 2 env not authorized for audit # exit 3 env did not load/roll out (ERR)
The --fail-on F default is the no-cry-wolf setting: only a confirmed exploit or two-plus signatures fails the build. Every verdict ships the exact command to reproduce it.
Responsible disclosure
Third-party findings go through a notify-then-publish tripwire: the author is notified privately with the precise, seed-reproducible finding and given roughly ten business days to fix, dispute, or acknowledge before the row publishes. A dispute lands a correction note, not a block — the seed makes every claim checkable by both sides. We seed the board with opt-in and study environments first, and earn the authority to cold-notify.
Cite this
The Index refreshes per audit batch (targeted monthly). Re-run any finding from its seed; each row is sealed by a record_sha256.
@misc{walker2026rii,
author = {Cody Lee Walker},
title = {starscry: the Reward Integrity Index --- deterministic, replayable
reward-quality audits of RL environments},
year = {2026},
howpublished = {huggingface.co/datasets/staros/environment-quality-index},
note = {Re-run any finding from its seed; each row is sealed by a record_sha256.}
}Further reading. The full method, the formal results, the related work, and the consolidated references live in the technical paper — heavily cited, annotated in the margins.
Audit your environment
Shipping a verifiers-format environment? We run this exact deterministic, model-free audit — gaming & difficulty — and hand back a sealed, replayable verifier card your buyers re-run themselves. No trust required.