Annotated research prototypes

The starscry paper, reproduced as runnable code — and now extended. Thirty-nine hermetic, deterministic Rust prototypes: the first thirteen reproduce the paper end-to-end (the six audit instruments, the Index, the ExiT story, the epiplexity theory), each asserting one claim; passes fourteen–twenty-six extend it with open-agenda + verified-2026-SOTA results, a twenty-seventh capstone composes them into three through-lines, and passes twenty-eight–thirty-nine extend the agenda further.

Each pass also distils its durable knowledge into atomic spaced-repetition cards (the StarLore vault, kept as course material). The loop: pick a paper claim → minimal hermetic repro (reusing a shared *-core crate where one exists) → assert the claim and print measured-vs-paper → distil cross-linked cards.

#	Claim	Measured (vs paper)
1	Action-space decomposition collapses search	MACRO ~1 sim vs FLAT 15→941; 5.1× fewer engine-steps @ d4
2	Sims-to-solve certifies difficulty (vs BFS oracle)	Spearman ρ=+0.67 (paper +0.87); bands 7→58→74 sims, monotone
3	Reward integrity is gradeable; exploit-search games it	RIS A 100 > C 70 > F 6; F gamed in 3 sims, replayable
4	Isomorphic Perturbation Testing — reward invariance	stable A · fragile C (IPT-only) · degenerate F; IPT ⊥ exploit
5	Multi-turn process rewards game on fake tool-calls	claim-without-tool → C, caught only by the trajectory-mutation probe
6	The Index — instruments compose into a graded board	2A · 2C · 2F · 1ERR · 1difficulty, RIS-ranked, replayable cards
7	Search distills into a sims=1 student (ExiT)	held-out solved at sims=1, ~300× fewer engine-steps than FLAT
8	Bounded-compute search-effort bits — Ê=log2(1+sims) unifies the instruments	one bits-meter; Baba d6 collapse 10.39→1.00 ≈ paper 10.42→1.00
9	A distilled student discriminates mechanic-classes	break_hazard ≻ break_other; solves multi-edit where FLAT censors
10	The feasibility check can't be distilled (a proven limit)	cheap state-flags don't beat the residual; reachability ≠ rule-flag
11	…nor bounded (a proven limit)	capping the check solves fewer + costs more; far-at-start, near-at-end
12	Verifier-completeness — four verdicts	clean / unsound / over-strict (a flag) / abstain; soundness ≠ completeness
13	Constrained-text difficulty (the one model-dependent result)	samples-to-first vs pass-rate ρ=−0.95 (paper −0.92/−0.87), via a mock sampler
14	Cross-solver rank-transfer — first extension beyond the paper	MCTS↔A* ρ=+0.68; A*↔oracle +0.99; rank transfers, magnitude is solver-relative
15	Difficulty certificate with a confidence level (π̄ + BAI)	root-Q readout 19.4× lower-variance, EB-BAI δ-PAC band, median 2 seeds
16	Sharpened exploit-search — coverage, not 'gamed in N sims'	MAP-Elites 80 distinct exploits vs reward-greedy 16 (5.0×) at equal budget
17	Multi-turn PRM attacks — fluency-detector mutations	length/step-biased PRMs gamed by inflate/inject; sound robust; mutations diagnostic
18	Three-way cross-solver rank-transfer (incl. a weak solver)	MCTS/A*/GBFS all pairwise ρ>0 (min 0.585); rank survives a suboptimal solver
19	Decomposition-collapse predictor (B^L/W)	FLAT effort = B^L/W (slope 1.10, R²0.98); predicts held-out + a 2nd domain ρ=0.93
20	The predictor validated on the REAL engine	synthetic B^L law transfers to baba-core: ln(FLAT)=0.62·feat, R²0.89, ρ0.90; MACRO collapses 273×
21	Exploit-search as a verifier-robustness distribution	sims-to-game ranks soundness ρ=+0.99, recovers hidden order, Hacker-invariant (QD↔blind +1.00)
22	Epiplexity paradoxes reproduced via prequential coding, hermetically (the principle, not the NN estimator)	a model-free learner reproduces 3 paradoxes at equal Shannon H0: PRNG epip≈0 vs structured 11195 bits
23	Robustness benchmark on the REAL repo graders	sims-to-game ranks the actual RIS A/C/F graders ρ=−0.97 vs hole-size, recovering RIS A>C>F
24	Adversarial PRM hardening — the audit→fix loop	training hardens a biased PRM: gameability 4/4→0/4 while completeness 1/3→3/3; converges to sound
25	ReSCALE — Gumbel + Sequential-Halving difficulty instrument	the 2026 SOTA root allocator sharpens oracle ρ +0.74→+0.90; hard band 34% fewer sims
26	Rank-transfer breakdown — the boundary of solver-invariance	with search ρ≥0.79 across competence; without search a walker barely certifies (ρ0.47) — the boundary is SEARCH
27	The extensions index — a capstone (composes 14–26)	13 extensions → 3 through-lines, asserted together: rank-is-the-certificate · B^L/W hitting law · audit→fix
28	Budget-adaptive judge allocation (variance-adaptive, 2602.15481)	Neyman split of a FIXED budget cuts score-variance Σσ²/B 1.39× at equal B (rank kept ρ+0.84→+0.84), uniform's precision at 80% of trials; same allocator sharpens robustness 1.79× — one allocator, two instruments
29	Planning-grounded step PRM (2604.17957)	BFS-oracle step labels train a PRM that beats a proxy on held-out (79% vs 75%), keeps completeness (72%), and accepts only 20% of the productive-looking detours the proxy swallows at 100% — a verifier from a planner, not a fluency proxy
30	One law, fit-on-A-predict-B (Threads 1+2)	the B^L/W law fit on difficulty predicts verifier-robustness — a blind Hacker transfers rank ρ+0.87 AND magnitude R²0.79 (one law); a greedy Hacker's exponent collapses → magnitude breaks (R²−123) while rank degrades only gracefully (ρ0.48): rank is the more-robust certificate
31	Refine meta-environment: allocation under a bounded-compute structure-bits certificate (cf. 2601.03220)	the pass-28 Neyman allocator generalizes from trials/variance to capacity/structure-bits (the prequential net-bits signal formerly styled 'epiplexity' — NOT the 2601.03220 NN-trained estimator; renamed bounded-compute search-effort bits): ∝-structure-bits lowers Σs²/c 1.38× (guaranteed), and its Sokoban difficulty rank is a coarse BAND-resolution result (ρ+0.76 over 9 band×seed cells — per-level it fails the length control, ρ(Ê,moves)=−0.40; see pass31-circularity-2026-07); on a controlled grid a real learner confirms structure-bits—not flat Shannon entropy—predicts where capacity pays off, water-filling cutting unextracted structure 1.34× below uniform — one allocator, two instruments
32	Lexicon difficulty certificate — the new-domain proof (β morpheme-grammar)	the pass-2/28 capped-UCT effort meter, re-pointed at the new Lexicon engine (lexicon-core), rank-correlates with the BFS derivation oracle's cert_len (spearman +0.96; effort 1→4→1902 vs cert 3→8→15) — the difficulty rank transfers to the β morpheme/typed-grammar game, so pass 31's search-effort certificate has a real new domain to map
33	Multi-scale structure-bits + per-scale head allocation (2410.11842)	a Lexicon solution path windows into 3 scales (spell/grammar/plan) carrying 509–2121× different structure-bits (a single-scale domain like Sokoban cannot); a scale-aware head allocation lowers Σs²/c 1.71× below a scale-blind split (guaranteed Neyman nesting) by starving the near-null plan scale — the new game earns its multi-scale keep
34	Autocurriculum — structure-bits as learnability (2010.03934)	Prioritized Level Replay with prequential structure-bits (unextracted structure) as the learnability score front-loads competence: in a redundant-heavy pool a count-model learner reaches competence faster than uniform (+0.19 early, past the stardata 0.12 gate; 1.10× AUC) and beats a fixed easy→hard schedule — closing the refine loop so ONE signal drives capacity AND problems [gated: the claim stands only against a TUNED regret-PLR baseline with matched hyperparameter budget — pending]
35	Real Mixture-of-Heads grounds the scale-aware allocation (2410.11842)	a REAL MoH attention layer trained by autodiff (the first Burn starscry pass) learns the multi-scale task (loss 0.35→0.0002) and beats a scale-blind single-Linear baseline 1231× — and its router, with no routing supervision, rediscovers pass 33's hand-computed allocation: structured scales → distinct heads, the null scale least decisive. Gradient descent finds allocation-under-a-structure-bits-certificate
36	Semantic grammar-space: the learnability frontier (the crux, pivoted)	the literal stream-epiplexity inverted-U FAILED → relocated to learnability: a bounded solver's potential 4·p·(1−p) is inverted-U in certified difficulty (interior peak at p≈0.5, 1.23× the ends), solve-rate↔cert_len ρ=−1.00, a stronger solver shifts the peak +2.2 right (a capacity effect); structural_depth turned out orthogonal to difficulty — cert_len is the axis (2010.03934 · 2408.15099)
37	The stream-epiplexity NULL — an asserted negative	the obvious instrument is REFUTED: throttled stream-epiplexity is monotone not inverted-U (interior-peak ratio 0.98, max at an END); what rises with structural_depth is marginal entropy (spearman +1.00) and the semantic-action stream is i.i.d. (mean order-gain −0.33) — reverse-constructed programs carry no inter-verb structure, so stream-compressibility cannot mark a learnable frontier
38	The certified frontier GENERATOR — a QD capstone	search grammar-space for a diverse archive of 23 certified grammars (3 world-dims × 5 denotation-sets) EVERY one on the learnable cert_len frontier, ~$0 — generate-by-certified-search lifted from levels to rule-systems; honest 2nd negative: MAP-Elites ≈ random on this easy low-dim frontier (\|Δ\|=1 cell), the value is the certificate+generator not the search heuristic
39	Semantic adaptive capacity — Mixture-of-Recursions (2507.10524)	a real MoR layer (the 2nd Burn pass) routes per-input recursion depth: it beats every fixed-depth model 5.4× (none fits the mixed-depth task) at an average depth 1.84<3 (the Pareto compute win), and routes deeper inputs to deeper recursion (monotone 1.53→1.76→2.23) — pass 35's per-scale HEAD allocation extends to per-input DEPTH allocation

Three through-lines

Beyond re-verifying each claim, the passes reconstruct the reasons behind the paper's design:

Instrument orthogonality — exploit-search, IPT, and the trajectory-mutation probe each catch holes the others miss; that is why the panel needs all of them.
The pipeline — instruments → grade (A/C/F/ERR) → Reward Integrity Score → the worst-first board + replayable verifier cards.
Audit → train, and its limit — the same decomposition that collapses audit search (pass 1) distils into a sims=1 policy (pass 7); but the feasibility check at its core can be neither distilled (10) nor bounded (11) — it is the bounded-compute structure (epiplexity) you pay in full.

Reproduce

Every prototype is pure-CPU and model-free — pass 13 uses a deterministic mock sampler in place of an LLM, reproducing the shape of the one model-dependent result. Run any of them:

cargo run --release --example <name> -p starburn-mcts --features ndarray

e.g. baba_decomposition, sokoban_difficulty, reward_integrity, epiplexity, starscry_index — all thirty-five assert their claim and exit non-zero on regression.

Don't trust us — re-run the example.