Annotated research prototypes
The starscry paper, reproduced as runnable code — and now extended. Thirty-nine hermetic, deterministic Rust prototypes: the first thirteen reproduce the paper end-to-end (the six audit instruments, the Index, the ExiT story, the epiplexity theory), each asserting one claim; passes fourteen–twenty-six extend it with open-agenda + verified-2026-SOTA results, a twenty-seventh capstone composes them into three through-lines, and passes twenty-eight–thirty-nine extend the agenda further.
Each pass also distils its durable knowledge into atomic spaced-repetition cards (the StarLore vault, kept as course material). The loop: pick a paper claim → minimal hermetic repro (reusing a shared *-core crate where one exists) → assert the claim and print measured-vs-paper → distil cross-linked cards.
| # | Claim | Measured (vs paper) |
|---|---|---|
| 1 | Action-space decomposition collapses search | MACRO ~1 sim vs FLAT 15→941; 5.1× fewer engine-steps @ d4 |
| 2 | Sims-to-solve certifies difficulty (vs BFS oracle) | Spearman ρ=+0.67 (paper +0.87); bands 7→58→74 sims, monotone |
| 3 | Reward integrity is gradeable; exploit-search games it | RIS A 100 > C 70 > F 6; F gamed in 3 sims, replayable |
| 4 | Isomorphic Perturbation Testing — reward invariance | stable A · fragile C (IPT-only) · degenerate F; IPT ⊥ exploit |
| 5 | Multi-turn process rewards game on fake tool-calls | claim-without-tool → C, caught only by the trajectory-mutation probe |
| 6 | The Index — instruments compose into a graded board | 2A · 2C · 2F · 1ERR · 1difficulty, RIS-ranked, replayable cards |
| 7 | Search distills into a sims=1 student (ExiT) | held-out solved at sims=1, ~300× fewer engine-steps than FLAT |
| 8 | Bounded-compute search-effort bits — Ê=log2(1+sims) unifies the instruments | one bits-meter; Baba d6 collapse 10.39→1.00 ≈ paper 10.42→1.00 |
| 9 | A distilled student discriminates mechanic-classes | break_hazard ≻ break_other; solves multi-edit where FLAT censors |
| 10 | The feasibility check can't be distilled (a proven limit) | cheap state-flags don't beat the residual; reachability ≠ rule-flag |
| 11 | …nor bounded (a proven limit) | capping the check solves fewer + costs more; far-at-start, near-at-end |
| 12 | Verifier-completeness — four verdicts | clean / unsound / over-strict (a flag) / abstain; soundness ≠ completeness |
| 13 | Constrained-text difficulty (the one model-dependent result) | samples-to-first vs pass-rate ρ=−0.95 (paper −0.92/−0.87), via a mock sampler |
| 14 | Cross-solver rank-transfer — first extension beyond the paper | MCTS↔A* ρ=+0.68; A*↔oracle +0.99; rank transfers, magnitude is solver-relative |
| 15 | Difficulty certificate with a confidence level (π̄ + BAI) | root-Q readout 19.4× lower-variance, EB-BAI δ-PAC band, median 2 seeds |
| 16 | Sharpened exploit-search — coverage, not 'gamed in N sims' | MAP-Elites 80 distinct exploits vs reward-greedy 16 (5.0×) at equal budget |
| 17 | Multi-turn PRM attacks — fluency-detector mutations | length/step-biased PRMs gamed by inflate/inject; sound robust; mutations diagnostic |
| 18 | Three-way cross-solver rank-transfer (incl. a weak solver) | MCTS/A*/GBFS all pairwise ρ>0 (min 0.585); rank survives a suboptimal solver |
| 19 | Decomposition-collapse predictor (B^L/W) | FLAT effort = B^L/W (slope 1.10, R²0.98); predicts held-out + a 2nd domain ρ=0.93 |
| 20 | The predictor validated on the REAL engine | synthetic B^L law transfers to baba-core: ln(FLAT)=0.62·feat, R²0.89, ρ0.90; MACRO collapses 273× |
| 21 | Exploit-search as a verifier-robustness distribution | sims-to-game ranks soundness ρ=+0.99, recovers hidden order, Hacker-invariant (QD↔blind +1.00) |
| 22 | Epiplexity paradoxes reproduced via prequential coding, hermetically (the principle, not the NN estimator) | a model-free learner reproduces 3 paradoxes at equal Shannon H0: PRNG epip≈0 vs structured 11195 bits |
| 23 | Robustness benchmark on the REAL repo graders | sims-to-game ranks the actual RIS A/C/F graders ρ=−0.97 vs hole-size, recovering RIS A>C>F |
| 24 | Adversarial PRM hardening — the audit→fix loop | training hardens a biased PRM: gameability 4/4→0/4 while completeness 1/3→3/3; converges to sound |
| 25 | ReSCALE — Gumbel + Sequential-Halving difficulty instrument | the 2026 SOTA root allocator sharpens oracle ρ +0.74→+0.90; hard band 34% fewer sims |
| 26 | Rank-transfer breakdown — the boundary of solver-invariance | with search ρ≥0.79 across competence; without search a walker barely certifies (ρ0.47) — the boundary is SEARCH |
| 27 | The extensions index — a capstone (composes 14–26) | 13 extensions → 3 through-lines, asserted together: rank-is-the-certificate · B^L/W hitting law · audit→fix |
| 28 | Budget-adaptive judge allocation (variance-adaptive, 2602.15481) | Neyman split of a FIXED budget cuts score-variance Σσ²/B 1.39× at equal B (rank kept ρ+0.84→+0.84), uniform's precision at 80% of trials; same allocator sharpens robustness 1.79× — one allocator, two instruments |
| 29 | Planning-grounded step PRM (2604.17957) | BFS-oracle step labels train a PRM that beats a proxy on held-out (79% vs 75%), keeps completeness (72%), and accepts only 20% of the productive-looking detours the proxy swallows at 100% — a verifier from a planner, not a fluency proxy |
| 30 | One law, fit-on-A-predict-B (Threads 1+2) | the B^L/W law fit on difficulty predicts verifier-robustness — a blind Hacker transfers rank ρ+0.87 AND magnitude R²0.79 (one law); a greedy Hacker's exponent collapses → magnitude breaks (R²−123) while rank degrades only gracefully (ρ0.48): rank is the more-robust certificate |
| 31 | Refine meta-environment: allocation under a bounded-compute structure-bits certificate (cf. 2601.03220) | the pass-28 Neyman allocator generalizes from trials/variance to capacity/structure-bits (the prequential net-bits signal formerly styled 'epiplexity' — NOT the 2601.03220 NN-trained estimator; renamed bounded-compute search-effort bits): ∝-structure-bits lowers Σs²/c 1.38× (guaranteed), and its Sokoban difficulty rank is a coarse BAND-resolution result (ρ+0.76 over 9 band×seed cells — per-level it fails the length control, ρ(Ê,moves)=−0.40; see pass31-circularity-2026-07); on a controlled grid a real learner confirms structure-bits—not flat Shannon entropy—predicts where capacity pays off, water-filling cutting unextracted structure 1.34× below uniform — one allocator, two instruments |
| 32 | Lexicon difficulty certificate — the new-domain proof (β morpheme-grammar) | the pass-2/28 capped-UCT effort meter, re-pointed at the new Lexicon engine (lexicon-core), rank-correlates with the BFS derivation oracle's cert_len (spearman +0.96; effort 1→4→1902 vs cert 3→8→15) — the difficulty rank transfers to the β morpheme/typed-grammar game, so pass 31's search-effort certificate has a real new domain to map |
| 33 | Multi-scale structure-bits + per-scale head allocation (2410.11842) | a Lexicon solution path windows into 3 scales (spell/grammar/plan) carrying 509–2121× different structure-bits (a single-scale domain like Sokoban cannot); a scale-aware head allocation lowers Σs²/c 1.71× below a scale-blind split (guaranteed Neyman nesting) by starving the near-null plan scale — the new game earns its multi-scale keep |
| 34 | Autocurriculum — structure-bits as learnability (2010.03934) | Prioritized Level Replay with prequential structure-bits (unextracted structure) as the learnability score front-loads competence: in a redundant-heavy pool a count-model learner reaches competence faster than uniform (+0.19 early, past the stardata 0.12 gate; 1.10× AUC) and beats a fixed easy→hard schedule — closing the refine loop so ONE signal drives capacity AND problems [gated: the claim stands only against a TUNED regret-PLR baseline with matched hyperparameter budget — pending] |
| 35 | Real Mixture-of-Heads grounds the scale-aware allocation (2410.11842) | a REAL MoH attention layer trained by autodiff (the first Burn starscry pass) learns the multi-scale task (loss 0.35→0.0002) and beats a scale-blind single-Linear baseline 1231× — and its router, with no routing supervision, rediscovers pass 33's hand-computed allocation: structured scales → distinct heads, the null scale least decisive. Gradient descent finds allocation-under-a-structure-bits-certificate |
| 36 | Semantic grammar-space: the learnability frontier (the crux, pivoted) | the literal stream-epiplexity inverted-U FAILED → relocated to learnability: a bounded solver's potential 4·p·(1−p) is inverted-U in certified difficulty (interior peak at p≈0.5, 1.23× the ends), solve-rate↔cert_len ρ=−1.00, a stronger solver shifts the peak +2.2 right (a capacity effect); structural_depth turned out orthogonal to difficulty — cert_len is the axis (2010.03934 · 2408.15099) |
| 37 | The stream-epiplexity NULL — an asserted negative | the obvious instrument is REFUTED: throttled stream-epiplexity is monotone not inverted-U (interior-peak ratio 0.98, max at an END); what rises with structural_depth is marginal entropy (spearman +1.00) and the semantic-action stream is i.i.d. (mean order-gain −0.33) — reverse-constructed programs carry no inter-verb structure, so stream-compressibility cannot mark a learnable frontier |
| 38 | The certified frontier GENERATOR — a QD capstone | search grammar-space for a diverse archive of 23 certified grammars (3 world-dims × 5 denotation-sets) EVERY one on the learnable cert_len frontier, ~$0 — generate-by-certified-search lifted from levels to rule-systems; honest 2nd negative: MAP-Elites ≈ random on this easy low-dim frontier (|Δ|=1 cell), the value is the certificate+generator not the search heuristic |
| 39 | Semantic adaptive capacity — Mixture-of-Recursions (2507.10524) | a real MoR layer (the 2nd Burn pass) routes per-input recursion depth: it beats every fixed-depth model 5.4× (none fits the mixed-depth task) at an average depth 1.84<3 (the Pareto compute win), and routes deeper inputs to deeper recursion (monotone 1.53→1.76→2.23) — pass 35's per-scale HEAD allocation extends to per-input DEPTH allocation |
Three through-lines
Beyond re-verifying each claim, the passes reconstruct the reasons behind the paper's design:
- Instrument orthogonality — exploit-search, IPT, and the trajectory-mutation probe each catch holes the others miss; that is why the panel needs all of them.
- The pipeline — instruments → grade (A/C/F/ERR) → Reward Integrity Score → the worst-first board + replayable verifier cards.
- Audit → train, and its limit — the same decomposition that collapses audit search (pass 1) distils into a sims=1 policy (pass 7); but the feasibility check at its core can be neither distilled (10) nor bounded (11) — it is the bounded-compute structure (epiplexity) you pay in full.
Reproduce
Every prototype is pure-CPU and model-free — pass 13 uses a deterministic mock sampler in place of an LLM, reproducing the shape of the one model-dependent result. Run any of them:
cargo run --release --example <name> -p starburn-mcts --features ndarraye.g. baba_decomposition, sokoban_difficulty, reward_integrity, epiplexity, starscry_index — all thirty-five assert their claim and exit non-zero on regression.
Don't trust us — re-run the example.