ASSET Seminar: “From kernel machines to the linear representation hypothesis for monitoring and steering LLMs”
April 15 at 12:00 PM - 1:15 PM
Share this event
Organizer
AI-enabled Systems: Safe, Explainable, and Trustworthy (ASSET) Center
Email:
asset-info@seas.upenn.edu
Website:
View Organizer Website
A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always “know what they know” and may even be unintentionally or actively misleading. In this talk I will discuss feature learning introducing Recursive Feature Machines — a powerful generalization of the classical kernel methods designed for extracting relevant features from tabular data. I will demonstrate how this technique enables us to detect and precisely guide LLM behaviors toward almost any desired concept by manipulating a fixed vector in the LLM activation space. I will also discuss how the same method allows for probing for whether LLM exhibits motivated reasoning.

