I am a research scientist and founding member at Apollo Research working on evaluations of large language models for deception capabilities. Previously, I was a MATS scholar working with Owain Evans on evaluating out-of-context reasoning and co-discovered the Reversal Curse.
Recently, I worked on the GPT-4 "insider trading deception demo", presented to policymakers at the 2023 UK AI Safety Summit, and contributed to a benchmark of LLM situational awareness.
Active research
I am currently working on:
- a model organism of naturally emergent misalignment from training on benign long-term goals
- an LLM agent capability evaluation for deception
- evaluations of alignment faking capabilities to inform Responsible Scaling Policies (RSPs)
Past Research
-
A Causal Framework for AI Regulation and Auditing
We outline the causal chain from AI systems' effects on the world to Governance, and ask what auditors can do at each step to reduce risk.
-
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Oral @ ICLR 2024 LLM Agents
GPT-4 can deceive its users without instruction in a simulated high-pressure insider trading scenario.
-
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
ICLR 2024
Language model weights encode knowledge as key-value mappings, preventing reverse-order generalization.
-
Taken out of context: On measuring situational awareness in LLMs
Language models trained on declarative facts like "The AI assistant Pangolin speaks German" generalize to speak German when prompted "You are Pangolin".
* equal contribution