ASSET Seminar: “Alignment and Control with Representation Engineering”
Amy Gutmann Hall, Room 414 3333 Chestnut Street, Philadelphia, United StatesAbstract: Large Language Models (LLMs) are vulnerable to adversarial attacks, which bypass common safeguards put in place to prevent these models from generating harmful output. Notably, these attacks can be transferrable to other models---even proprietary ones—potentially compromising a wide range of AI systems with a single exploit. This surprising fragility underscores a critical weakness in […]