Mikita Balesni

I am a Research Scientist and founding member at Apollo Research working on safety cases and evaluations for frontier AI models. My work is focused on threats from capable LLM agents being misaligned and "scheming". Previously, I was a MATS scholar working with Owain Evans on evaluating out-of-context reasoning and co-discovered the Reversal Curse.

Active research

I am currently working on:

evaluating latent reasoning capabilities of LLMs, motivated by making safety cases for scheming based on monitoring agents’ reasoning.
developing evaluations of AI agent capabilities for sabotage.

Highlighted Research

Towards evaluations-based safety cases for AI scheming

Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq

We sketch how developers of frontier AI systems could construct a structured rationale — a 'safety case' — that an AI system is unlikely to cause catastrophic outcomes through scheming — pursuing misaligned goals covertly, hiding their true capabilities and objectives.
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

NeurIPS Datasets & Benchmarks Track 2024

We quantify how well LLMs understand themselves through 13k behavioral tests, finding gaps even in top models.
Large Language Models can Strategically Deceive their Users when Put Under Pressure

Jérémy Scheurer*, Mikita Balesni*, Marius Hobbhahn

Oral @ ICLR 2024 LLM Agents

GPT-4 can deceive its users without instruction in a simulated high-pressure insider trading scenario.

Code
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Lukas Berglund, Meg Tong*, Max Kaufmann*, Mikita Balesni*, Asa Cooper Stickland*, Tomasz Korbak, Owain Evans

ICLR 2024

Language model weights encode knowledge as key-value mappings, preventing reverse-order generalization.

Code
Taken out of context: On measuring situational awareness in LLMs

Lukas Berglund*, Asa Cooper Stickland*, Mikita Balesni*, Max Kaufmann*, Meg Tong*, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

Language models trained on declarative facts like "The AI assistant Pangolin speaks German" generalize to speak German when prompted "You are Pangolin".

Code

* equal contribution

Active research

Highlighted Research

Towards evaluations-based safety cases for AI scheming

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

Large Language Models can Strategically Deceive their Users when Put Under Pressure

The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

Taken out of context: On measuring situational awareness in LLMs