
I am a Research Scientist and founding member at Apollo Research working on safety cases and evaluations for frontier AI models. My work is focused on threats from capable LLM agents being misaligned and "scheming". Previously, I was a MATS scholar working with Owain Evans on evaluating out-of-context reasoning and co-discovered the Reversal Curse.
Active research
I am currently working on:
- evaluating latent reasoning capabilities of LLMs, motivated by making safety cases for scheming based on monitoring agents’ reasoning.
- developing evaluations of AI agent capabilities for sabotage.
Highlighted Research
-
Towards evaluations-based safety cases for AI scheming
We sketch how developers of frontier AI systems could construct a structured rationale — a 'safety case' — that an AI system is unlikely to cause catastrophic outcomes through scheming — pursuing misaligned goals covertly, hiding their true capabilities and objectives.
-
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
NeurIPS Datasets & Benchmarks Track 2024
We quantify how well LLMs understand themselves through 13k behavioral tests, finding gaps even in top models.
-
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Oral @ ICLR 2024 LLM Agents
GPT-4 can deceive its users without instruction in a simulated high-pressure insider trading scenario.
-
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
ICLR 2024
Language model weights encode knowledge as key-value mappings, preventing reverse-order generalization.
-
Taken out of context: On measuring situational awareness in LLMs
Language models trained on declarative facts like "The AI assistant Pangolin speaks German" generalize to speak German when prompted "You are Pangolin".
* equal contribution