Can LLMs Beat Classical Hyperparameter Optimization Algorithms?

A Study on autoresearch

Fabio Ferreira · Lucca Wobbe · Arjun Krishnakumar · Frank Hutter · Arber Zela

We benchmark 9 HPO methods (classical, LLM-based, and hybrid) on Karpathy's autoresearch task under identical 24-hour budgets with 3 seeds. Classical methods (CMA-ES, TPE, SMAC) consistently outperform pure LLM-based agents within a fixed search space. Even frontier models like Gemini 3.1 Pro Preview do not close the gap. Our hybrid Centaur, which shares CMA-ES's full internal state with the LLM, achieves the best result in our experiments.

Interactive Demo

Explore the data interactively. Click legends to toggle methods, drag sliders, hover for details.

Method details & references
TPE — Tree-structured Parzen Estimator. Classical Bayesian HPO with density estimation. Optuna
CMA-ES — Covariance Matrix Adaptation Evolution Strategy. Classical evolutionary HPO. Optuna
SMAC — Sequential Model-based Algorithm Configuration with random-forest surrogate. SMAC3
Random — Uniform random sampling baseline. Optuna
LLAMBO (Optuna) — LLAMBO via OptunaHub port. Binary surrogate labels, categoricals random, failed trials hidden. OptunaHub
LLAMBO (Paper) — Faithful reimplementation of LLAMBO paper: continuous labels, all HPs visible, failed trials included. Ye et al. 2024 · our impl
Karpathy Agent (14 HPs) — LLM sees trial history and suggests next config within the fixed 14-HP search space. our impl
Karpathy Agent (Code) — LLM directly edits train.py source each trial (unconstrained). Karpathy autoresearch
Centaur (CMA-ES+LLM) — Hybrid: CMA-ES runs every trial; on 30% of trials the LLM overrides with a config informed by CMA-ES internal state. our paper · algorithm

1. Classical vs LLM-based HPO

Best val_bpb against cumulative training time (mean ± std across 3 seeds). All 9 Qwen3.5-27B methods plus Gemini 3.1 Pro Preview, 3.1 Flash-Lite, and 2.5 Flash variants of Centaur and Karpathy Agent (Code). Click the filter buttons to group methods by type, or click legend entries to toggle individual curves.

Show:
💡 Click legend entries to toggle · Double-click to isolate a method

2. Scaling the LLM optimizer (Qwen3.5: 0.8B vs 27B)

Scaling Qwen3.5 from 0.8B to 27B is essential for unconstrained code editing but provides no advantage for fixed-HP methods. Centaur even shows a slight edge with 0.8B.

Show:
💡 Solid: 27B · Dashed: 0.8B

3. Centaur LLM ratio ablation

Centaur lets the LLM override CMA-ES on a fraction r of trials. Too much LLM control (r=0.8) degrades performance, confirming CMA-ES should retain majority control. Filter by model size or LLM ratio. CMA-ES baseline is always shown as reference.

Show:
💡 Solid: 27B · Dashed: 0.8B · Dotted: CMA-ES baseline

4. Centaur: LLM vs CMA-ES trial contributions (Qwen3.5-27B, r=0.3)

Centaur uses the LLM on ~30% of trials, CMA-ES on the rest. This plot shows which trials were proposed by which source (from Centaur's LLM call logs), with stars marking new incumbents.

Loading trial data…

5. Incumbent trace explorer

Grey dots = all trials. Colored stars = new incumbents (moments where the method found a new best). Staircase line = best-so-far trajectory. Reveals when each method found its improvements.

Loading trial data…

6. Hyperparameter evolution

One panel per hyperparameter (14 total). Each dot is a trial, colored by its val_bpb (darker = better). Shows how each method explores each HP dimension over time.

Loading trial data…

7. 2D HP scatter (pick two HPs)

Pick two hyperparameters to see how a method explored that 2D slice of the search space. Red × marks failed (OOM/crashed) trials.

Loading trial data…