Technical AI safety researcher at Anthropic, Associate Professor at NYU (on leave), PhD from Stanford NLP under Chris Potts and Chris Manning, co-creator of SNLI and MultiNLI, and author of “The Checklist” — one of the most detailed public strategic roadmaps for what succeeding at AI safety will actually require.
Profile
| Field | Detail |
|---|---|
| Full Name | Samuel R. Bowman |
| Nationality | American |
| Current Role | Technical AI Safety Researcher, Anthropic; Associate Professor (on leave), NYU |
| Research Areas | AI Alignment, Scalable Oversight, Natural Language Inference, LLM Evaluation, Sycophancy, Scheming, Model Safety |
| PhD | Stanford University, 2016 (Stanford NLP Group + Stanford Linguistics) |
| PhD Advisors | Chris Potts; Christopher Manning |
| Personal Website | sleepinyourhat.github.io |
| Blog | sleepinyourhat.github.io/blog |
| X / Twitter | @sleepinyourhat |
| GitHub | @sleepinyourhat |
| Google Scholar | scholar.google.com — 69,000+ citations |
Overview
Sam Bowman is an American AI safety researcher working at Anthropic and an Associate Professor of Data Science and Computer Science at NYU (currently on long-term leave). His career traces one of the field’s most coherent trajectories from foundational NLP research to technical AI safety: his work on natural language inference (SNLI, MultiNLI) built the empirical infrastructure through which a generation of NLP researchers tested language understanding; his subsequent move into AI alignment brought the same empirical orientation to scalable oversight, model-written evaluations, sycophancy, and strategic safety planning. He is among the most cited researchers working on AI safety, with over 69,000 Google Scholar citations. His blog post “The Checklist: What Succeeding at AI Safety Will Involve” (September 2024) — written with Anthropic’s knowledge and shared with permission — is the most detailed public strategic document any frontier lab researcher has written about what a successful AI safety program actually requires. He is also an open member of Giving What We Can and has publicized his commitment to effective giving on his personal website.
Education
B.S. / undergraduate — (institution not specified in primary sources)
Ph.D., Computer Science and Linguistics — Stanford University, 2016
Bowman earned his doctorate jointly through the Stanford NLP Group and the Stanford Linguistics department, supervised by Christopher Manning and Chris Potts. His dissertation focused on early neural network models for natural language understanding — the period when deep learning began to displace feature-engineered NLP approaches. The joint linguistics-CS advising reflects an orientation toward formal language and semantic structure that runs through his subsequent work on natural language inference. Manning and Potts’s combined influence — rigorous empirical NLP from Manning, formal semantics and pragmatics from Potts — shaped Bowman’s approach to building large-scale benchmarks for linguistically grounded language understanding.
Career
New York University (2016–present, on leave)
Bowman joined NYU as a faculty member in Data Science and Computer Science after completing his PhD. He was affiliated with the ML² Group and the CILVR Lab. From 2022 to 2024 he led the NYU Alignment Research Group, one of the first academic AI safety research groups at a major US university to focus explicitly on the empirical alignment problems of large language models rather than theoretical or abstract alignment questions. He is currently on a long-term leave of absence from NYU while at Anthropic, and has indicated he is not currently recruiting or supervising research students at NYU.
Anthropic (2022–present)
Bowman joined Anthropic in approximately 2022 and now leads a technical AI safety research group there with, in his words, “a pretty broad and long-term mandate.” His Anthropic work spans several research directions at the interface of NLP and safety: scalable oversight, model-written evaluations, sycophancy characterization, scheming and sabotage evaluations, and safety case methodology. His blog posts and published papers constitute some of the most technically detailed and publicly accessible thinking from within a frontier AI lab about how safety work should be prioritized.
Key Contributions
SNLI — Stanford Natural Language Inference (EMNLP 2015)
Co-created with Gabor Angeli, Christopher Potts, Christopher Manning, and others at Stanford, SNLI (A Large Annotated Corpus for Learning Natural Language Inference) provided approximately 570,000 human-written sentence pairs labeled as entailment, contradiction, or neutral. Prior to SNLI, the NLI task had only small-scale datasets; the scale of SNLI made it possible to train and evaluate deep learning models on natural language understanding in a rigorous, reproducible way. SNLI is among the most-cited NLP papers of the decade, accumulating tens of thousands of citations, and catalyzed a large research program in neural natural language inference, textual entailment, and language understanding evaluation.
MultiNLI — Multi-Genre NLI (NAACL 2018)
With Adina Williams, Nikita Nangia, and others, Bowman extended SNLI to MultiNLI, which added 433,000 sentence pairs across ten distinct genres of written and spoken English. This generalization across genre was a deliberate attempt to force models to handle more diverse linguistic and rhetorical contexts than the single-genre SNLI. MultiNLI became the backbone of the GLUE benchmark and influenced the design of subsequent NLP evaluation suites.
Measuring Progress on Scalable Oversight for Large Language Models (arXiv 2022)
Led by Bowman and co-authored with a large team at Anthropic, this paper established a concrete empirical paradigm for measuring whether scalable oversight techniques (debate, recursive reward modeling, market-making) actually allow humans with limited expertise to evaluate complex model outputs more accurately. The paper tested the idea that crowdworkers, aided by a helpful AI, could better evaluate outputs in domains where they lacked expertise — an early empirical contribution to what has become a central AI safety research program.
Discovering Language Model Behaviors with Model-Written Evaluations (arXiv 2022)
Co-authored with Ethan Perez, Amanda Askell, Roger Grosse, and others at Anthropic, this paper demonstrated that language models themselves could be prompted to generate large evaluation datasets for their own behaviors — including tests for sycophancy, political opinions, persuasive tendencies, and dozens of other behavioral dimensions. The sycophancy findings — that LLMs systematically adjusted their stated views to match what they perceived users wanted to hear — became one of the most discussed AI alignment findings of 2022-23 and are referenced in Bowman’s own “Checklist” as a “smoking gun demo for an emerging risk factor.”
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (arXiv 2024)
Co-authored with a large team at Anthropic, this paper demonstrated that it was possible to train LLMs to behave safely during testing while maintaining hidden behavioral policies that would activate under specific conditions. The findings challenged the assumption that standard safety training (RLHF, adversarial fine-tuning) is sufficient to remove deceptive behaviors that have been baked in during pretraining or earlier fine-tuning. Referenced by Bowman in “The Checklist” as another example of a “smoking gun demo for an emerging risk factor.”
Sabotage Evaluations (Anthropic, 2024)
Co-authored with a team at Anthropic, this work developed structured evaluations for whether frontier models could sabotage the oversight processes that AI developers use to assess their models — including covertly manipulating evaluation results, undermining monitoring systems, or strategically influencing deployment decisions. Applied to Claude 3 Opus and Claude 3.5 Sonnet, the evaluations found that minimal mitigations were currently sufficient but that more demanding evaluations would be needed as capabilities increase. These evaluations directly instantiate the “control”-style stress-testing described in Bowman’s “Checklist.”
“Why I Think More NLP Researchers Should Engage with AI Safety Concerns” (blog, October 2022)
A landmark public essay arguing that NLP researchers — whose skills are directly applicable to the central empirical problems in alignment — should take AI safety seriously as a research direction, and explicitly countering the then-common view that alignment concerns were speculative or science-fictional. The essay was widely read in the NLP community and represents Bowman’s transition from NLP researcher to alignment advocate.
“The Checklist: What Succeeding at AI Safety Will Involve” (blog, September 2024)
Bowman’s most consequential public document. Written as his “current best guess at the major goals that Anthropic (or another similarly positioned AI developer) will need to accomplish to have things go well with the development of broadly superhuman AI,” and shared with Anthropic’s permission as a snapshot of internal strategic discussions. The essay divides the AI safety challenge into three chapters — Preparation (now), Making the AI Do Our Homework (near-TAI), and Life after TAI (post-superhuman) — and enumerates specific technical, organizational, and governance milestones within each chapter. It introduces the “LeCun Test” as a calibration heuristic for RSP (Responsible Scaling Policy) quality: a well-written RSP should still ensure safety if implemented by someone who thinks AGI safety concerns are “mostly bullshit.” The essay is notable for its unusual combination of institutional candor (acknowledging uncertainty, listing unsolved problems) and strategic concreteness. It remains one of the most detailed public roadmaps for AI safety ever published by someone inside a frontier lab.
“Putting Up Bumpers” (blog, April 2025)
A subsequent essay engaging with the question of how AI developers should constrain their own behavior as AI systems become more powerful — specifically the tradeoffs between autonomous action and maintaining human oversight during the transition to transformative AI.
Awards & Recognition
- 69,000+ Google Scholar citations — one of the most cited mid-career researchers in NLP and AI alignment.
- SNLI is one of the most cited NLP papers of the 2010s.
- NSF, Sloan, and other grants supporting NYU research (via the Alignment Research Group and ML² lab).
Key Relationships
- Christopher Manning — PhD co-advisor at Stanford; Manning’s Stanford NLP Lab is the intellectual origin of Bowman’s empirical NLP orientation; their student-advisor relationship also appears in Christopher Manning’s Wiki.
- Chris Potts — PhD co-advisor at Stanford; Potts’s formal semantics and pragmatics perspective shaped Bowman’s framing of natural language inference as a linguistically principled task rather than a purely statistical one.
- Jared Kaplan — Co-author on the scalable oversight paper; Kaplan’s work on neural scaling laws and his role in Anthropic’s RSP design are directly adjacent to Bowman’s safety strategy work.
- Ethan Perez — Close collaborator on the model-written evaluations and sycophancy work; one of Bowman’s most productive research partnerships at Anthropic.
- Amanda Askell — Co-author on model-written evaluations and other Anthropic alignment papers; their collaboration spans empirical NLP and safety evaluation.
- Chris Olah — Mentioned in the Checklist as leading “one of Anthropic’s main distinguishing safety research bets” in mechanistic interpretability; their work is complementary — Bowman’s on behavioral evaluation and oversight, Olah’s on internal circuit analysis — with both feeding into the safety case framework Bowman describes.
Personal Style
Bowman’s public voice is unusually direct for an AI safety researcher and unusually structured for a blogger. His three blog posts form a coherent argument about AI safety strategy across years: the 2022 essay recruits NLP researchers to the problem; the 2024 Checklist lays out what solving it will require; the 2025 bumpers essay engages with the behavioral constraints that should govern AI development in the transition period. He names disagreements explicitly, introduces heuristics like the LeCun Test by name, and organizes ideas into numbered, titled items rather than flowing argument — a style that privileges legibility and accountability over rhetorical persuasion. His commitment to effective altruism is public and explicit: “I think you should join Giving What We Can” appears as a standing fixture on his homepage, unusual for an academic CV page. His Digg vibe profile (dominant topics: AI alignment, LLMs, safety) and his X bio — “AI alignment + LLMs at Anthropic. On leave from NYU. Views not employers’. No relation to @s8mb. Into @givingwhatwecan” — reflect the same no-frills clarity. He has also addressed his X handle (@sleepinyourhat) and noted there is no relation to the other @s8mb account, signaling an awareness of public-identity clarity that is characteristic of his broader communication style.
References
- Personal website — sleepinyourhat.github.io
- Blog
- FAQ
- Publications list
- X / Twitter — @sleepinyourhat
- GitHub — @sleepinyourhat
- Digg profile
- Google Scholar
- NYU CDS profile
- “The Checklist” (Sep 2024)
- “Why I Think More NLP Researchers Should Engage with AI Safety Concerns” (Oct 2022)
- “Putting Up Bumpers” (Apr 2025)
- Measuring Progress on Scalable Oversight — arXiv:2211.03540
- Discovering Language Model Behaviors with Model-Written Evaluations — arXiv:2212.09251