Loading Events

FOLDS seminar: Surrogate-Model Approaches to Optimizers for LLM Training

April 9 at 12:00 PM - 1:00 PM
Details
Date: April 9, 2026
Time: 12:00 PM - 1:00 PM
Event Category: AI MonthSeminarColloquium
Event Tags:
Organizer
IDEAS Center
Venue
Amy Gutmann Hall, Room 414 3333 Chestnut Street
Philadelphia
19104
Google Map

Zoom link: https://upenn.zoom.us/j/98220304722

 

The recent empirical success of the Muon optimizer in training large language models has outpaced the theoretical understanding of its matrix-gradient orthogonalization design. To bridge this gap, this talk introduces surrogate-model approaches that analyze and systematically improve deep learning optimization over a single iteration. We first present the isotropic curvature model, a convex program assuming curvature isotropy across perturbation directions, which reveals that optimal update matrices achieve a more homogeneous spectrum. This approach demonstrates that while Muon’s gradient orthogonalization is directionally correct, it is only strictly optimal under specific curvature phase transitions. Building upon this theoretical foundation, we introduce a second quadratic surrogate model that approximates the loss using the gradient, an output-space curvature matrix, and the input data matrix. By minimizing this surrogate under an isotropic weight assumption, we derive Newton-Muon. This finding implies that standard Muon is an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, Newton-Muon accelerates GPT-2 pretraining, reaching target validation loss in 6% fewer iteration steps and reducing wall-clock training time by roughly 4%, illustrating the efficacy of principled surrogate models in designing LLM optimizers.