German AI safety researcher leading the Alignment Science team at Anthropic, known for co-prototyping reinforcement learning from human feedback and for his public departure from OpenAI over safety concerns.
Profile
| Born | 1986 or 1987, Germany |
| Nationality | German |
| Current Institution | Anthropic (Alignment Science team lead) |
| Research Areas | AI Alignment, Reinforcement Learning from Human Feedback (RLHF), Scalable Oversight, Weak-to-Strong Generalization, Automated Alignment Research |
| Doctoral Advisor | Marcus Hutter |
| Doctoral Thesis | Nonparametric General Reinforcement Learning (Australian National University, 2016) |
| Personal Website | jan.leike.name |
| Blog | aligned.substack.com |
| X / Twitter | @janleike |
| GitHub | janleike |
| Google Scholar | Jan Leike |
Overview
Jan Leike is one of the most consequential researchers in the history of AI alignment, occupying a rare position at the intersection of foundational theory and frontier systems work. As a researcher at DeepMind, he co-prototyped reinforcement learning from human feedback (RLHF) — the technique that became the backbone of modern aligned language models. At OpenAI, he co-led the Superalignment team alongside Ilya Sutskever, oversaw the alignment of InstructGPT, ChatGPT, and GPT-4, and co-authored the field’s most prominent research roadmap for aligning superintelligent systems. His resignation from OpenAI in May 2024 — accompanied by a public statement that safety culture had “taken a backseat to shiny products” — became one of the defining moments in the public history of AI safety. He joined Anthropic shortly after, where he leads the Alignment Science team. TIME magazine listed him as one of the 100 most influential people in AI in both 2023 and 2024.
Early Life & Education
Leike grew up in Germany. He obtained his undergraduate degree from the University of Freiburg, and after earning a master’s degree in computer science, pursued a PhD in machine learning at the Australian National University under the supervision of Marcus Hutter. Hutter is the creator of AIXI, a theoretical model of a universally intelligent agent, and the intellectual framework of Leike’s doctoral work — nonparametric general reinforcement learning — is rooted in the algorithmic information theory tradition Hutter pioneered. His thesis, Nonparametric General Reinforcement Learning (2016), addressed fundamental questions about the theoretical limits of RL agents in environments without parametric assumptions.
After his PhD, Leike made a six-month postdoctoral fellowship at the Future of Humanity Institute at Oxford before joining DeepMind to focus on empirical AI safety research.
Career
DeepMind (c. 2016–2021)
At DeepMind’s safety team in London, Leike prototyped reinforcement learning from human feedback. The landmark paper, Deep Reinforcement Learning from Human Preferences (NeurIPS 2017), co-authored with Paul Christiano, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei, proposed training RL agents using non-expert human comparisons between trajectory segments rather than hand-specified reward functions. The paper demonstrated that complex novel behaviours could be learned with around an hour of human time, in environments considerably more complex than any previously learned from human feedback. This work established RLHF as a practical alignment technique and would later become the methodological core of InstructGPT, ChatGPT, and Claude.
During this period Leike also published Scalable Agent Alignment via Reward Modeling: A Research Direction (2018), co-authored with David Krueger, Tom Everitt, and Shane Legg, which outlined a research programme for scaling alignment via iterated reward modelling — an early formal articulation of what would become the Superalignment agenda.
OpenAI (2021–May 2024)
Leike joined OpenAI in 2021 as Head of Alignment. He was involved in the development of InstructGPT, ChatGPT, and the alignment of GPT-4, and is a co-author on the InstructGPT paper (NeurIPS 2022), which introduced supervised fine-tuning followed by RLHF training to produce a model that better followed human instructions.
In June 2023, he and Ilya Sutskever became the co-leaders of the newly introduced Superalignment project, which aimed to determine how to align future artificial superintelligences within four years. The project was announced with a public commitment that OpenAI would dedicate 20% of its compute to superalignment research. Leike also developed OpenAI’s approach to alignment research and co-authored the Superalignment team’s research roadmap.
Resignation. In May 2024, Leike resigned from OpenAI, within hours of Ilya Sutskever’s own departure. His public resignation statement on X was unusually direct. He accused OpenAI and its leaders of neglecting safety culture in favour of shiny products, and said he had “been disagreeing with OpenAI leadership about the company’s core priorities for quite some time, until we finally reached a breaking point.” He wrote that “building smarter-than-human machines is an inherently dangerous endeavour” and that “OpenAI must become a safety-first AGI company.” He noted that his team had been “sailing against the wind” and struggling to secure computing resources despite OpenAI’s public commitments. Within days, OpenAI disbanded the Superalignment team entirely, redistributing members across other research groups.
Anthropic (May 2024–present)
In May 2024, Leike joined Anthropic, an AI company founded by former OpenAI employees. He leads the Alignment Science team, which pursues the hardest open problems in making AI systems behave as intended on tasks where human evaluation is difficult or insufficient. His team is researching how to align an automated alignment researcher, working on scalable oversight, weak-to-strong generalisation, and robustness to jailbreaks.
Key Contributions
-
Deep Reinforcement Learning from Human Preferences (NeurIPS 2017) — Co-authored with Paul Christiano, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei at DeepMind. The paper introduced the practical version of RLHF, training agents from non-expert human comparisons of trajectory segments. It became the methodological foundation for aligning modern large language models including InstructGPT, ChatGPT, Claude, and others.
-
Scalable Agent Alignment via Reward Modeling (2018) — Co-authored with Krueger, Everitt, Martic, Maini, and Legg. Outlined a systematic research agenda for iterative reward modelling as a path to scalable alignment; an early blueprint for what would later underpin the Superalignment programme.
-
InstructGPT — Training Language Models to Follow Instructions with Human Feedback (NeurIPS 2022) — Senior author on the paper introducing InstructGPT, which combined supervised fine-tuning and RLHF to produce language models substantially better aligned with human intent. This work directly enabled the development of ChatGPT.
-
Superalignment Research Roadmap (2023) — Co-led with Ilya Sutskever; co-authored the technical plan for aligning superintelligent systems within four years using current or near-term AI to automate alignment research. Introduced the concept of weak-to-strong generalisation as a core technical approach.
-
Weak-to-Strong Generalisation (ICML 2024) — Co-authored with Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, and others including Ilya Sutskever. Proposed and empirically demonstrated that a weak model’s supervision can be used to elicit strong capabilities from a more powerful model — a key mechanism for aligning systems smarter than their supervisors.
-
LLM Critics Help Catch LLM Bugs (2024) — First work from OpenAI’s alignment team demonstrating that GPT-4 can identify errors in its own outputs at meaningful rates, contributing to the scalable oversight research programme.
-
Aligned Substack — An active research blog where Leike publishes accessible treatments of alignment concepts, including foundational essays on “the hard problem of alignment,” scalable oversight, and automated alignment research; influential in shaping the conceptual vocabulary of the field.
Awards & Recognition
- TIME100 AI (2023 and 2024) — One of very few researchers listed in both editions; cited for contributions to AI alignment research and for public candour about safety risks.
- Public resignation statement (May 2024) — Widely described as a watershed moment in the public history of AI safety; covered globally by major outlets and credited with raising the visibility of safety culture debates inside frontier AI labs.
Key Relationships
- Marcus Hutter — PhD supervisor at the Australian National University; creator of AIXI and the theoretical universal intelligence framework that shaped Leike’s early research on nonparametric general RL.
- Paul Christiano — Primary co-author of the 2017 RLHF paper; went on to found the Alignment Research Center (ARC) and then the Mechanistic Interpretability and Alignment team. One of the closest intellectual collaborators in Leike’s career.
- Shane Legg — Co-author on both the 2017 RLHF paper and the 2018 reward modelling paper; DeepMind co-founder. Leike’s work at DeepMind was conducted within Legg’s safety orbit.
- Dario Amodei — Co-author on the 2017 RLHF paper (then at OpenAI); now CEO of Anthropic, the organisation Leike joined after his OpenAI departure. Their research collaboration thus bookends the AI safety story of the decade.
- Ilya Sutskever — Co-lead of the Superalignment team; their simultaneous departures from OpenAI in May 2024 marked the most high-profile safety-focused exodus from a frontier AI lab in the field’s history.
- Sam Altman — OpenAI CEO with whom Leike reached a “breaking point” over the company’s safety priorities; the public disagreement crystallised a broader debate about governance and values at frontier labs.
Personal Style
Leike’s public voice is unusually direct and principled for a senior figure in a commercially competitive industry. His 2024 resignation statement — rare for its willingness to name specific institutional failures in a public, non-anonymised form — reflected a consistent pattern: he frames AI alignment not as a niche technical concern but as a civilisational obligation, and treats institutional credibility on safety as something that must be earned through consistent action rather than asserted through mission statements. His research writing is technically precise but accessible, and his Substack blog articulates alignment concepts for an audience spanning ML practitioners and policy-minded readers. His career has followed a consistent thread from theoretical RL foundations under Marcus Hutter through empirical RLHF prototyping at DeepMind to systems-level alignment at OpenAI and now Anthropic — always at the boundary where abstract safety questions meet deployed systems.
References
- Personal website: jan.leike.name
- Wikipedia: en.wikipedia.org/wiki/Jan_Leike
- Google Scholar: scholar.google.com — Jan Leike
- Alignment blog: aligned.substack.com
- X profile: digg.com/u/x/janleike
- TIME100 AI 2024: time.com/7012867/jan-leike
- RLHF paper (arXiv 1706.03741): arxiv.org/abs/1706.03741
- Crypto Briefing — “Jan Leike leads Anthropic’s alignment science team” (May 2026): cryptobriefing.com/jan-leike-anthropic-alignment-science
- OpenAI Superalignment announcement (June 2023): openai.com/blog/introducing-superalignment
- Fast Company — resignation reporting (May 2024): fastcompany.com/91127491