- This event has passed.
ASSET Seminar: “Distortion of AI Alignment from Human Feedback”
April 1 at 12:00 PM - 1:15 PM
After pre-training, large language models are aligned with human preferences based on pairwise comparisons. State-of-the-art alignment methods (such as PPO-based RLHF and DPO) are built on the assumption of aligning with a single preference model, despite being deployed in settings where users have diverse preferences. As a result, it is not even clear that these alignment methods produce models that satisfy users on average — a minimal requirement. Drawing on social choice theory and modeling users’ comparisons through individual Bradley-Terry (BT) models, we introduce an alignment method’s distortion: the worst-case ratio between the optimal achievable average utility, and the average utility of the learned policy. The notion of distortion helps draw sharp distinctions between alignment methods: Nash Learning from Human Feedback achieves the minimax optimal distortion of a constant, while the most commonly used methods of RLHF (PPO or DPO based) can suffer unbounded distortion.
Nika Haghtalab
Assistant Professor of Electrical Engineering and Computer Sciences
Nika Haghtalab is an Assistant Professor in the Department of Electrical Engineering and Computer Sciences at UC Berkeley. She works broadly on the theoretical aspects of machine learning, artificial intelligence, and algorithmic economics. She received her Ph.D. from the Computer Science Department of Carnegie Mellon University, where her thesis won the CMU School of Computer Science Dissertation Award (ACM nomination) and the SIGecom Dissertation Honorable Mention. She is a co-founder of Learning Theory Alliance (LeT-All). Among her honors are an NSF CAREER award, Sloan fellowship, Schmidt Sciences AI2050 fellowship, NeurIPS and ICAPS best paper awards, an EC exemplary in AI track award, and several industry awards and fellowships.