We benchmark 9 HPO methods (classical, LLM-based, and hybrid) on Karpathy's
autoresearch task under identical 24-hour budgets with 3 seeds. Classical methods
(CMA-ES, TPE, SMAC) consistently outperform pure LLM-based agents within a fixed search space.
Even frontier models like Gemini 3.1 Pro Preview do not close the gap.
Our hybrid Centaur, which shares CMA-ES's full internal state with the LLM,
achieves the best result in our experiments.
Interactive Demo
Explore the data interactively. Click legends to toggle methods, drag sliders, hover for details.
Method details & references
TPE — Tree-structured Parzen Estimator. Classical Bayesian HPO with density estimation. Optuna
Centaur (CMA-ES+LLM) — Hybrid: CMA-ES runs every trial; on 30% of trials the LLM overrides with a config informed by CMA-ES internal state. our paper · algorithm
1. Classical vs LLM-based HPO
Best val_bpb against cumulative training time (mean ± std across 3 seeds). All 9 Qwen3.5-27B methods plus Gemini 3.1 Pro Preview, 3.1 Flash-Lite, and 2.5 Flash variants of Centaur and Karpathy Agent (Code). Click the filter buttons to group methods by type, or click legend entries to toggle individual curves.
Show:
💡 Click legend entries to toggle · Double-click to isolate a method
2. Scaling the LLM optimizer (Qwen3.5: 0.8B vs 27B)
Scaling Qwen3.5 from 0.8B to 27B is essential for unconstrained code editing but provides
no advantage for fixed-HP methods. Centaur even shows a slight edge with 0.8B.
Show:
💡 Solid: 27B · Dashed: 0.8B
3. Centaur LLM ratio ablation
Centaur lets the LLM override CMA-ES on a fraction r of trials. Too much LLM
control (r=0.8) degrades performance, confirming CMA-ES should retain majority control.
Filter by model size or LLM ratio. CMA-ES baseline is always shown as reference.
4. Centaur: LLM vs CMA-ES trial contributions (Qwen3.5-27B, r=0.3)
Centaur uses the LLM on ~30% of trials, CMA-ES on the rest. This plot shows which
trials were proposed by which source (from Centaur's LLM call logs), with stars marking new incumbents.
Loading trial data…
5. Incumbent trace explorer
Grey dots = all trials. Colored stars = new incumbents (moments where the method found a new best).
Staircase line = best-so-far trajectory. Reveals when each method found its improvements.
Loading trial data…
6. Hyperparameter evolution
One panel per hyperparameter (14 total). Each dot is a trial, colored by its val_bpb (darker = better).
Shows how each method explores each HP dimension over time.
Loading trial data…
7. 2D HP scatter (pick two HPs)
Pick two hyperparameters to see how a method explored that 2D slice of the search space.
Red × marks failed (OOM/crashed) trials.