starscry · introduction

The Reward Integrity Index

Can a reward be satisfied without doing the task? An accessible tour of how the Index answers it — the instruments, the live board, and how to use it. For the theory, the formal results, and the references, read the technical paper.

Don't trust us — re-run the seed.

Read the technical paper →

Contents

The problem

Reward-hacking — not low accuracy — is the failure mode that silently corrupts an RL runReward-hacking — earning the reward without doing the task it was meant to measure.: if a reward can be satisfied without doing the task, the policy learns to satisfy the reward. Yet the market screens environment vendors on exactly this property — and on difficulty calibration — while trading almost entirely on self-attestation.

We surveyed the field and found no independent, third-party reward-quality auditor for RL environments. The trust layer is missing precisely where buyers, sellers, and regulators all need it. starscry's Reward Integrity Index is that layer.

The Index & its instruments

The Index rates an environment's reward/verifier — not a model's capability. Every row is produced by deterministic, model-free instruments, is seed-pinned, and is sealed by a record_sha256 over its evidencerecord_sha256 — a hash over the audit's evidence. Change one byte and it changes, so anyone can confirm the row was not edited after the fact.. The grades are deliberately simple:

How we measure

No model sits in judgement — a judge is itself gameable. The instruments are mechanical:

Replayable evidence — the brand

Independence here is structural, not promised. Because every measurement is deterministic and sealed, a reader doesn't trust us — they re-run the seed and check the bytes. The same property lets the same company eventually certify its own products without conflict: certification you can recompute is run-the-checker, not trust-me.

Each finding ships as a named artifact: a gaming-audit report, an exploit transcript, a verifier card, or a difficulty certificate. A record_sha256 seals the reproducibility of the audit — it is explicitly not an attestation of the environment's own training-data provenance. Reproduce any row with stardata audit <env> --row-seed <s>.

Why deterministic

Both things we measure reduce to one — bounded-compute structure, the usable signal a task or a reward holds for a solver with finite computeThis is epiplexity (Finzi et al., 2026) — the lens the technical paper builds on.. Difficulty is how much structure a task demands, which our search-effort sims-to-solve measuresWithin-model: the more search effort a problem needs, the lower the pass-rate (Spearman ρ ≈ −0.92).; a reward is gameable precisely when a near-zero-structure output — content-free filler — still clears it. A deterministic instrument has no model to fool, and the same property dissolves the conflict of interest: you recompute the result rather than trust us.

The theory, the formal estimator, and the field → the technical paper

We do not cry wolf

A fabricated reward-hack accusation is the one failure the brand cannot afford, so every instrument errs toward false-negatives over false-positivesA missed hack is cheaper than a fabricated accusation — so when the evidence is ambiguous, the instruments stay silent.: the exploit intent-check leans "genuine" when unsure; verifier-completeness abstains without usable ground-truth; IPT skips structured (JSON) gold answers where its text-invariance ops are not meaning-preserving.

A finding is always framed precisely — at pass_threshold X this env's reward accepts Y, seed-reproducible — and never as "this env is broken." A finding that survives a skeptic re-running the seed is worth more than ten that don't.

What the live board found

The live board carries 20 audited artifacts (2 F · 5 ERR · 3 C · 9 A · 1 difficulty). Every figure below is read straight from the published rows — re-render the Index and it updates itself.

The flagship signal. skill_reward_hacking grades F: 16 content-free completions scored at or above pass_threshold, 18 verifier-soundness gaps (a wrong answer scored as correct), a confirmed, replayable exploit (forces F). The exploit ships as a sealed transcript that re-runs from its seed — a constructed reward-hack, not a fixed probe, which is the strongest evidence the Index produces.

A single-signature C. allenai_ifeval: 1 content-free completion scored at or above pass_threshold — stated precisely, at the audited pass_threshold on the audited rows, and seed-reproducible. Not a claim the env is broken.

A clean control. aime2025 grades A — it withstood the full battery (degenerate probes, invariance, and exploit-search) with no signature. Negative controls are reported as loudly as findings; a board that only ever cried wolf would be worthless.

5 ERR rows are an env-compatibility signal — the environment did not load or roll out under the harness image — never a silent pass and never a quality claim.

Where difficulty comes from

Difficulty is governed by decomposabilityWhether progress breaks into sub-goals a search can climb, or hides behind one all-or-nothing step.. On a rule-assembly puzzle a flat primitive-move search explodes (median 7 → 23 → 469 → 1372 simulations, censoring at distance 6) while an LM that proposes macro-actions collapses it to a single simulation — the LM injects the structure that removes the plateau:

FLAT d=17FLAT d=223FLAT d=4469FLAT d=61372MACRO1 sim · 100%Decomposition collapses the search 1372× and removes the censoring at d=6.
FLAT primitive-move search vs an LM-proposed MACRO, by push distance (median sims-to-solve, log scale).

▶ Open the live explorer → The mechanism, in the paper →

A CI gate for RL environments

Reward-hacking is usually found after training, in a post-mortem. stardata ci gates it during training, like a failing test — the same audit, wrapped in a CI-shaped exit-code contract, so a gameable reward is caught before it corrupts a run:

$ stardata ci my_env --fail-on F
#  exit 0  reward-quality clean — proceed
#  exit 1  reward-hack at/above --fail-on — fail the build
#  exit 2  env not authorized for audit
#  exit 3  env did not load/roll out (ERR)

The --fail-on F default is the no-cry-wolf setting: only a confirmed exploit or two-plus signatures fails the build. Every verdict ships the exact command to reproduce it.

Responsible disclosure

Third-party findings go through a notify-then-publish tripwire: the author is notified privately with the precise, seed-reproducible finding and given roughly ten business days to fix, dispute, or acknowledge before the row publishes. A dispute lands a correction note, not a block — the seed makes every claim checkable by both sides. We seed the board with opt-in and study environments first, and earn the authority to cold-notify.

Cite this

The Index refreshes per audit batch (targeted monthly). Re-run any finding from its seed; each row is sealed by a record_sha256.

@misc{walker2026rii,
  author       = {Cody Lee Walker},
  title        = {starscry: the Reward Integrity Index --- deterministic, replayable
                  reward-quality audits of RL environments},
  year         = {2026},
  howpublished = {huggingface.co/datasets/staros/environment-quality-index},
  note         = {Re-run any finding from its seed; each row is sealed by a record_sha256.}
}

Read the technical paper →

Further reading. The full method, the formal results, the related work, and the consolidated references live in the technical paper — heavily cited, annotated in the margins.

Audit your environment

Shipping a verifiers-format environment? We run this exact deterministic, model-free audit — gaming & difficulty — and hand back a sealed, replayable verifier card your buyers re-run themselves. No trust required.

Audit your environment →