Annotated research prototypes

The starscry paper, reproduced as runnable code — and now extended. Thirty-nine hermetic, deterministic Rust prototypes: the first thirteen reproduce the paper end-to-end (the six audit instruments, the Index, the ExiT story, the epiplexity theory), each asserting one claim; passes fourteen–twenty-six extend it with open-agenda + verified-2026-SOTA results, a twenty-seventh capstone composes them into three through-lines, and passes twenty-eight–thirty-nine extend the agenda further.

Each pass also distils its durable knowledge into atomic spaced-repetition cards (the StarLore vault, kept as course material). The loop: pick a paper claim → minimal hermetic repro (reusing a shared *-core crate where one exists) → assert the claim and print measured-vs-paper → distil cross-linked cards.

#ClaimMeasured (vs paper)
1Action-space decomposition collapses searchMACRO ~1 sim vs FLAT 15→941; 5.1× fewer engine-steps @ d4
2Sims-to-solve certifies difficulty (vs BFS oracle)Spearman ρ=+0.67 (paper +0.87); bands 7→58→74 sims, monotone
3Reward integrity is gradeable; exploit-search games itRIS A 100 > C 70 > F 6; F gamed in 3 sims, replayable
4Isomorphic Perturbation Testing — reward invariancestable A · fragile C (IPT-only) · degenerate F; IPT ⊥ exploit
5Multi-turn process rewards game on fake tool-callsclaim-without-tool → C, caught only by the trajectory-mutation probe
6The Index — instruments compose into a graded board2A · 2C · 2F · 1ERR · 1difficulty, RIS-ranked, replayable cards
7Search distills into a sims=1 student (ExiT)held-out solved at sims=1, ~300× fewer engine-steps than FLAT
8Bounded-compute search-effort bits — Ê=log2(1+sims) unifies the instrumentsone bits-meter; Baba d6 collapse 10.39→1.00 ≈ paper 10.42→1.00
9A distilled student discriminates mechanic-classesbreak_hazard ≻ break_other; solves multi-edit where FLAT censors
10The feasibility check can't be distilled (a proven limit)cheap state-flags don't beat the residual; reachability ≠ rule-flag
11…nor bounded (a proven limit)capping the check solves fewer + costs more; far-at-start, near-at-end
12Verifier-completeness — four verdictsclean / unsound / over-strict (a flag) / abstain; soundness ≠ completeness
13Constrained-text difficulty (the one model-dependent result)samples-to-first vs pass-rate ρ=−0.95 (paper −0.92/−0.87), via a mock sampler
14Cross-solver rank-transfer — first extension beyond the paperMCTS↔A* ρ=+0.68; A*↔oracle +0.99; rank transfers, magnitude is solver-relative
15Difficulty certificate with a confidence level (π̄ + BAI)root-Q readout 19.4× lower-variance, EB-BAI δ-PAC band, median 2 seeds
16Sharpened exploit-search — coverage, not 'gamed in N sims'MAP-Elites 80 distinct exploits vs reward-greedy 16 (5.0×) at equal budget
17Multi-turn PRM attacks — fluency-detector mutationslength/step-biased PRMs gamed by inflate/inject; sound robust; mutations diagnostic
18Three-way cross-solver rank-transfer (incl. a weak solver)MCTS/A*/GBFS all pairwise ρ>0 (min 0.585); rank survives a suboptimal solver
19Decomposition-collapse predictor (B^L/W)FLAT effort = B^L/W (slope 1.10, R²0.98); predicts held-out + a 2nd domain ρ=0.93
20The predictor validated on the REAL enginesynthetic B^L law transfers to baba-core: ln(FLAT)=0.62·feat, R²0.89, ρ0.90; MACRO collapses 273×
21Exploit-search as a verifier-robustness distributionsims-to-game ranks soundness ρ=+0.99, recovers hidden order, Hacker-invariant (QD↔blind +1.00)
22Epiplexity paradoxes reproduced via prequential coding, hermetically (the principle, not the NN estimator)a model-free learner reproduces 3 paradoxes at equal Shannon H0: PRNG epip≈0 vs structured 11195 bits
23Robustness benchmark on the REAL repo graderssims-to-game ranks the actual RIS A/C/F graders ρ=−0.97 vs hole-size, recovering RIS A>C>F
24Adversarial PRM hardening — the audit→fix looptraining hardens a biased PRM: gameability 4/4→0/4 while completeness 1/3→3/3; converges to sound
25ReSCALE — Gumbel + Sequential-Halving difficulty instrumentthe 2026 SOTA root allocator sharpens oracle ρ +0.74→+0.90; hard band 34% fewer sims
26Rank-transfer breakdown — the boundary of solver-invariancewith search ρ≥0.79 across competence; without search a walker barely certifies (ρ0.47) — the boundary is SEARCH
27The extensions index — a capstone (composes 14–26)13 extensions → 3 through-lines, asserted together: rank-is-the-certificate · B^L/W hitting law · audit→fix
28Budget-adaptive judge allocation (variance-adaptive, 2602.15481)Neyman split of a FIXED budget cuts score-variance Σσ²/B 1.39× at equal B (rank kept ρ+0.84→+0.84), uniform's precision at 80% of trials; same allocator sharpens robustness 1.79× — one allocator, two instruments
29Planning-grounded step PRM (2604.17957)BFS-oracle step labels train a PRM that beats a proxy on held-out (79% vs 75%), keeps completeness (72%), and accepts only 20% of the productive-looking detours the proxy swallows at 100% — a verifier from a planner, not a fluency proxy
30One law, fit-on-A-predict-B (Threads 1+2)the B^L/W law fit on difficulty predicts verifier-robustness — a blind Hacker transfers rank ρ+0.87 AND magnitude R²0.79 (one law); a greedy Hacker's exponent collapses → magnitude breaks (R²−123) while rank degrades only gracefully (ρ0.48): rank is the more-robust certificate
31Refine meta-environment: allocation under a bounded-compute structure-bits certificate (cf. 2601.03220)the pass-28 Neyman allocator generalizes from trials/variance to capacity/structure-bits (the prequential net-bits signal formerly styled 'epiplexity' — NOT the 2601.03220 NN-trained estimator; renamed bounded-compute search-effort bits): ∝-structure-bits lowers Σs²/c 1.38× (guaranteed), and its Sokoban difficulty rank is a coarse BAND-resolution result (ρ+0.76 over 9 band×seed cells — per-level it fails the length control, ρ(Ê,moves)=−0.40; see pass31-circularity-2026-07); on a controlled grid a real learner confirms structure-bits—not flat Shannon entropy—predicts where capacity pays off, water-filling cutting unextracted structure 1.34× below uniform — one allocator, two instruments
32Lexicon difficulty certificate — the new-domain proof (β morpheme-grammar)the pass-2/28 capped-UCT effort meter, re-pointed at the new Lexicon engine (lexicon-core), rank-correlates with the BFS derivation oracle's cert_len (spearman +0.96; effort 1→4→1902 vs cert 3→8→15) — the difficulty rank transfers to the β morpheme/typed-grammar game, so pass 31's search-effort certificate has a real new domain to map
33Multi-scale structure-bits + per-scale head allocation (2410.11842)a Lexicon solution path windows into 3 scales (spell/grammar/plan) carrying 509–2121× different structure-bits (a single-scale domain like Sokoban cannot); a scale-aware head allocation lowers Σs²/c 1.71× below a scale-blind split (guaranteed Neyman nesting) by starving the near-null plan scale — the new game earns its multi-scale keep
34Autocurriculum — structure-bits as learnability (2010.03934)Prioritized Level Replay with prequential structure-bits (unextracted structure) as the learnability score front-loads competence: in a redundant-heavy pool a count-model learner reaches competence faster than uniform (+0.19 early, past the stardata 0.12 gate; 1.10× AUC) and beats a fixed easy→hard schedule — closing the refine loop so ONE signal drives capacity AND problems [gated: the claim stands only against a TUNED regret-PLR baseline with matched hyperparameter budget — pending]
35Real Mixture-of-Heads grounds the scale-aware allocation (2410.11842)a REAL MoH attention layer trained by autodiff (the first Burn starscry pass) learns the multi-scale task (loss 0.35→0.0002) and beats a scale-blind single-Linear baseline 1231× — and its router, with no routing supervision, rediscovers pass 33's hand-computed allocation: structured scales → distinct heads, the null scale least decisive. Gradient descent finds allocation-under-a-structure-bits-certificate
36Semantic grammar-space: the learnability frontier (the crux, pivoted)the literal stream-epiplexity inverted-U FAILED → relocated to learnability: a bounded solver's potential 4·p·(1−p) is inverted-U in certified difficulty (interior peak at p≈0.5, 1.23× the ends), solve-rate↔cert_len ρ=−1.00, a stronger solver shifts the peak +2.2 right (a capacity effect); structural_depth turned out orthogonal to difficulty — cert_len is the axis (2010.03934 · 2408.15099)
37The stream-epiplexity NULL — an asserted negativethe obvious instrument is REFUTED: throttled stream-epiplexity is monotone not inverted-U (interior-peak ratio 0.98, max at an END); what rises with structural_depth is marginal entropy (spearman +1.00) and the semantic-action stream is i.i.d. (mean order-gain −0.33) — reverse-constructed programs carry no inter-verb structure, so stream-compressibility cannot mark a learnable frontier
38The certified frontier GENERATOR — a QD capstonesearch grammar-space for a diverse archive of 23 certified grammars (3 world-dims × 5 denotation-sets) EVERY one on the learnable cert_len frontier, ~$0 — generate-by-certified-search lifted from levels to rule-systems; honest 2nd negative: MAP-Elites ≈ random on this easy low-dim frontier (|Δ|=1 cell), the value is the certificate+generator not the search heuristic
39Semantic adaptive capacity — Mixture-of-Recursions (2507.10524)a real MoR layer (the 2nd Burn pass) routes per-input recursion depth: it beats every fixed-depth model 5.4× (none fits the mixed-depth task) at an average depth 1.84<3 (the Pareto compute win), and routes deeper inputs to deeper recursion (monotone 1.53→1.76→2.23) — pass 35's per-scale HEAD allocation extends to per-input DEPTH allocation

Three through-lines

Beyond re-verifying each claim, the passes reconstruct the reasons behind the paper's design:

Reproduce

Every prototype is pure-CPU and model-free — pass 13 uses a deterministic mock sampler in place of an LLM, reproducing the shape of the one model-dependent result. Run any of them:

cargo run --release --example <name> -p starburn-mcts --features ndarray

e.g. baba_decomposition, sokoban_difficulty, reward_integrity, epiplexity, starscry_index — all thirty-five assert their claim and exit non-zero on regression.

Don't trust us — re-run the example.