2024_pic_cropped.jpg

I am a Research Scientist and founding member at Apollo Research working on safety cases and evaluations for frontier AI models. My work is focused on threats from capable LLM agents being misaligned and "scheming". Previously, I was a MATS scholar working with Owain Evans on evaluating out-of-context reasoning and co-discovered the Reversal Curse.

Active research

I am currently working on:

  • evaluating latent reasoning capabilities of LLMs, motivated by making safety cases for scheming based on monitoring agents’ reasoning.
  • developing evaluations of AI agent capabilities for sabotage.

Highlighted Research

  • Towards evaluations-based safety cases for AI scheming

    Mikita Balesni, Marius Hobbhahn, David Lindner, Alexander Meinke, Tomek Korbak, Joshua Clymer, Buck Shlegeris, Jérémy Scheurer, Charlotte Stix, Rusheb Shah, Nicholas Goldowsky-Dill, Dan Braun, Bilal Chughtai, Owain Evans, Daniel Kokotajlo, Lucius Bushnaq

    We sketch how developers of frontier AI systems could construct a structured rationale — a 'safety case' — that an AI system is unlikely to cause catastrophic outcomes through scheming — pursuing misaligned goals covertly, hiding their true capabilities and objectives.

  • Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs

    Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, Owain Evans

    NeurIPS Datasets & Benchmarks Track 2024

    We quantify how well LLMs understand themselves through 13k behavioral tests, finding gaps even in top models.

  • Large Language Models can Strategically Deceive their Users when Put Under Pressure

    Jérémy Scheurer*, Mikita Balesni*, Marius Hobbhahn

    Oral @ ICLR 2024 LLM Agents

    GPT-4 can deceive its users without instruction in a simulated high-pressure insider trading scenario.

  • The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"

    Lukas Berglund, Meg Tong*, Max Kaufmann*, Mikita Balesni*, Asa Cooper Stickland*, Tomasz Korbak, Owain Evans

    ICLR 2024

    Language model weights encode knowledge as key-value mappings, preventing reverse-order generalization.

  • Taken out of context: On measuring situational awareness in LLMs

    Lukas Berglund*, Asa Cooper Stickland*, Mikita Balesni*, Max Kaufmann*, Meg Tong*, Tomasz Korbak, Daniel Kokotajlo, Owain Evans

    Language models trained on declarative facts like "The AI assistant Pangolin speaks German" generalize to speak German when prompted "You are Pangolin".

  • * equal contribution