Can LLMs Beat Classical Hyperparameter Optimization Algorithms?

Date	Change
2026-05-19	Blog post: Classical HPO vs frontier LLMs — an update (companion to the paper; new seeds, Opus 4.7 + Sonnet 4.6, honest caveat on the original n=3 results)
2026-05-18	Opus 4.6 KA Code + KA HPs extra seeds (5 → 8 in progress); Sonnet 4.6 KA Code + KA HPs at 5/5; Centaur Sonnet 4.6 at 5/5; CMA-ES added as second classical baseline
2026-05-17	Opus 4.7 KA Code closes at 5/5
2026-05-16	Live Benchmark tab launched (rolling Centaur / KA Code / KA HPs across Claude generations)
2026-05-15	5-seed extension across all Qwen 27B methods; result dirs de-aliased

Method details & references

TPE : Tree-structured Parzen Estimator. Classical Bayesian HPO with density estimation. Optuna

CMA-ES : Covariance Matrix Adaptation Evolution Strategy. Classical evolutionary HPO. Optuna

SMAC : Sequential Model-based Algorithm Configuration with random-forest surrogate. SMAC3

Random : Uniform random sampling baseline. Optuna

LLAMBO (Optuna) : LLAMBO via OptunaHub port. Binary surrogate labels, categoricals random, failed trials hidden. OptunaHub · Ozaki et al. 2025

LLAMBO (Paper) : Faithful reimplementation of LLAMBO paper: continuous labels, all HPs visible, failed trials included. Liu et al. 2024 · our impl

Karpathy Agent (14 HPs) : LLM sees trial history and suggests next config within the fixed 14-HP search space. our impl

Karpathy Agent (Code) : LLM directly edits train.py source each trial (unconstrained). Karpathy autoresearch

Centaur (CMA-ES+LLM) : Hybrid: CMA-ES runs every trial; on 30% of trials the LLM overrides with a config informed by CMA-ES internal state. our paper · algorithm

1. Classical vs LLM-based HPO

Best val_bpb against cumulative training time (mean +/- std across 5-8 seeds depending on method). All 9 Qwen3.5-27B methods plus frontier-model variants of Centaur and Karpathy Agent (Code): Claude Opus 4.6, Gemini 3.1 Pro Preview, Gemini 3.1 Flash-Lite, and Gemini 2.5 Flash. Click the filter buttons to group methods by type, or click legend entries to toggle individual curves.

Show:

💡 Click legend entries to toggle · Double-click to isolate a method · Marker positions are offset per line for visual clarity

Key OOM finding: Karpathy Agent (Code)'s failure rate drops sharply with LLM capability (19% for Qwen 0.8B, 12% for Qwen 27B, 3% for Gemini 3.1 Pro, 5% for Opus 4.6), whereas Centaur's stays in a narrow range (13-20%) across model choices. This suggests CMA-ES dominates OOM avoidance in the hybrid method, while pure code editing is directly sensitive to LLM capability.

2. Scaling the LLM optimizer (Qwen3.5: 0.8B vs 27B)

Scaling Qwen3.5 from 0.8B to 27B is essential for unconstrained code editing but provides no advantage for fixed-HP methods. Centaur even shows a slight edge with 0.8B.

Show:

💡 Solid: 27B · Dashed: 0.8B

3. Centaur LLM ratio ablation

Centaur lets the LLM override CMA-ES on a fraction r of trials. Too much LLM control (r=0.8) degrades performance, confirming CMA-ES should retain majority control. Filter by model size or LLM ratio. CMA-ES baseline is always shown as reference.

Show:

💡 Solid: 27B · Dashed: 0.8B · Dotted: CMA-ES baseline

4. Centaur: LLM vs CMA-ES trial contributions (Qwen3.5-27B, r=0.3)

Centaur uses the LLM on ~30% of trials, CMA-ES on the rest. This plot shows which trials were proposed by which source (from Centaur's LLM call logs), with stars marking new incumbents.

Seed:

Loading trial data…

5. Incumbent trace explorer

Grey dots = all trials. Colored stars = new incumbents (moments where the method found a new best). Staircase line = best-so-far trajectory. Reveals when each method found its improvements.

Method: Seed:

Loading trial data…

6. Hyperparameter evolution

One panel per hyperparameter (14 total). Each dot is a trial, colored by its val_bpb (darker = better). Shows how each method explores each HP dimension over time.

Method:

Loading trial data…

7. 2D HP scatter (pick two HPs)

Pick two hyperparameters to see how a method explored that 2D slice of the search space. Red × marks failed (OOM/crashed) trials.

Method: X-axis: Y-axis:

Loading trial data…

8. LLM prompt explorer

Each LLM-based method receives a prompt containing the optimization goal, model class, training stack, hardware constraints, search space, and trial history. Select a method to view its prompt template. Braces like {name} are Python format placeholders filled in at runtime.

Method: View source on GitHub ↗

Live Benchmark: is the latest Claude frontier LLM beating classical HPO yet?

Each new Claude release runs at multiple seeds on Climbmix-400B with a 24h training budget. We compare against TPE (the best classical baseline). Results refresh automatically as new jobs complete.

Last updated: 2026-07-01 13:37 UTC

Current leader

Centaur (CMA-ES + LLM) [Opus 4.6]

Mean ± std0.9738 ± 0.0013

Seeds8

Δ vs TPE (paired, n=8)-0.0017, p=0.055 *

Best across 16 method × generation combinations (n=8-8 seeds per row).

Method	Seeds	Mean ± std	One-sided p vs TPE
Centaur (CMA-ES + LLM) [Opus 4.6] ★	8	0.9738 ± 0.0013	n=8, p=0.055 *
Karpathy Agent (14 HPs) [Opus 4.7]	8	0.9741 ± 0.0027	n=8, p=0.098 *
Centaur (CMA-ES + LLM) [Opus 4.8]	8	0.9748 ± 0.0015	n=8, p=0.320
Karpathy Agent (14 HPs) [Opus 4.6]	8	0.9749 ± 0.0025	n=8, p=0.320
Karpathy Agent (14 HPs) [Opus 4.8]	8	0.9752 ± 0.0023	n=8, p=0.527
TPE (classical)	8	0.9755 ± 0.0018	(baseline)
Karpathy Agent (Code) [Opus 4.6]	8	0.9755 ± 0.0035	n=8, p=0.527
Centaur (CMA-ES + LLM) [Opus 4.7]	8	0.9760 ± 0.0012	n=8, p=0.844
Karpathy Agent (14 HPs) [Sonnet 4.6]	8	0.9763 ± 0.0015	n=8, p=0.770
Karpathy Agent (Code) [Opus 4.8]	8	0.9765 ± 0.0034	n=8, p=0.809
Karpathy Agent (Code) [Sonnet 4.6]	8	0.9772 ± 0.0052	n=8, p=0.770
Centaur (CMA-ES + LLM) [Sonnet 4.6]	8	0.9773 ± 0.0021	n=8, p=0.973
CMA-ES (classical)	8	0.9776 ± 0.0028	n=8, p=0.902
Karpathy Agent (Code) [Opus 4.7]	8	0.9786 ± 0.0024	n=8, p=0.973
SMAC (classical)	8	0.9816 ± 0.0047	n=8, p=0.996
Random (classical)	8	0.9869 ± 0.0030	n=8, p=1.000

Significance: * p<0.10 (one-sided Wilcoxon vs TPE).

1. Convergence across Claude generations

Best val_bpb against cumulative training time, all Opus generations overlaid. TPE always visible as the classical reference. Use the filter buttons to narrow to a single generation. Default-visible state: TPE plus the two most recent Opus generations.

Show:

2. Per-method progression across Claude generations

For each method (Centaur, Karpathy Agent (Code), Karpathy Agent (14 HPs)), a slopegraph showing per-seed final val_bpb across Claude generations. Steep downward slopes indicate that the latest Opus release improved on the previous one for that method.

3. Wilcoxon Δ vs TPE

Paired Wilcoxon signed-rank test for each Claude generation × method versus TPE. Bars show Δ = mean(method) − mean(TPE) across paired seeds; one-sided p-value annotated.

4. Per-Claude-generation summary

One card per Claude release: mean ± std across all completed seeds, paired Wilcoxon p-value vs TPE, number of seeds completed, and last-updated date.

Opus 4.6 Last updated 2026-05-26

Method	Seeds	Mean ± std	p vs TPE
Centaur (CMA-ES + LLM)	8	0.9738 ± 0.0013	0.055
Karpathy Agent (14 HPs)	8	0.9749 ± 0.0025	0.320
Karpathy Agent (Code)	8	0.9755 ± 0.0035	0.527

Opus 4.7 Last updated 2026-06-03

Method	Seeds	Mean ± std	p vs TPE
Karpathy Agent (14 HPs)	8	0.9741 ± 0.0027	0.098
Centaur (CMA-ES + LLM)	8	0.9760 ± 0.0012	0.844
Karpathy Agent (Code)	8	0.9786 ± 0.0024	0.973

Sonnet 4.6 Last updated 2026-06-22

Method	Seeds	Mean ± std	p vs TPE
Karpathy Agent (14 HPs)	8	0.9763 ± 0.0015	0.770
Karpathy Agent (Code)	8	0.9772 ± 0.0052	0.770
Centaur (CMA-ES + LLM)	8	0.9773 ± 0.0021	0.973

Opus 4.8 Last updated 2026-07-01

Method	Seeds	Mean ± std	p vs TPE
Centaur (CMA-ES + LLM)	8	0.9748 ± 0.0015	0.320
Karpathy Agent (14 HPs)	8	0.9752 ± 0.0023	0.527
Karpathy Agent (Code)	8	0.9765 ± 0.0034	0.809

Classical reference (TPE): 0.9755 ± 0.0018 across 8 seeds.