Chris Olah

Mechanistic interpretability researcher and co-founder of Anthropic — the self-taught Canadian who coined the term “mechanistic interpretability,” built the colah.github.io blog into the most-read ML educational resource of the 2010s, and has spent fifteen years pursuing a single question: what are neural networks actually computing?


Profile

Field Detail
Nationality Canadian
Current Institution Anthropic (Co-founder; interpretability researcher)
Research Areas Mechanistic Interpretability, Neural Network Interpretability, Feature Visualization, Circuits, Superposition, Sparse Autoencoders, AI Safety
Education The Abelard School, Toronto (graduated National AP Scholar, 2010); left university without degree; Thiel Fellowship (2012)
Blog colah.github.io
Research thread transformer-circuits.pub
X / Twitter @ch402
GitHub @colah
Google Scholar scholar.google.com

Overview

Chris Olah is a Canadian machine learning researcher and co-founder of Anthropic who has spent his entire career on a single question: what are neural networks actually doing inside? He coined the term “mechanistic interpretability” in 2020, leading the OpenAI team that published the Circuits thread on Distill, and has since led Anthropic’s interpretability research, which produced the Toy Models of Superposition, Towards Monosemanticity, Scaling Monosemanticity, and Circuit Tracing papers — the body of work that defines the current frontier of the field. He has no formal academic degree, having left university at 18 to pursue independent research supported by a Thiel Fellowship, and his technical blog (colah.github.io) became one of the most-read resources in machine learning education during the 2010s before he shifted to research publication. He co-founded Distill (2017), a scientific journal dedicated to outstanding communication in ML, and later co-founded Anthropic (2021) with Dario Amodei, Ilya Sutskever, and others departing OpenAI. Time named him to its 100 Most Influential People in AI for 2024.


Early Life & Education

Olah grew up in Canada and graduated from The Abelard School in Toronto as a National AP Scholar in 2010. He enrolled at university but left at age 18 without completing a degree, drawn toward independent research in mathematics and computing. Around 2012 he received a Thiel Fellowship — the $100,000 award from Peter Thiel’s foundation designed to encourage gifted young people to pursue research or entrepreneurship instead of traditional university — which provided financial support for his early self-directed work.

During this period, Olah maintained an early blog (christopherolah.wordpress.com / colah.ca) covering mathematical topics including topology, calculus, and computer vision. The writing style he developed — clear, diagram-heavy, building from first principles — established habits that would make his later ML blog one of the most influential teaching resources in the field.


Career

Independent Research and colah.github.io (2012–2014)

Before joining any institution, Olah independently developed deep understanding of neural networks through reading, implementation, and writing. His blog colah.github.io became the destination for accessible explanations of topics then considered impenetrable: convolutional networks, recurrent networks, attention, word embeddings, backpropagation. Posts like “Understanding LSTM Networks” (2015) and “Deep Learning, NLP, and Representations” (2014) accumulated millions of reads and became standard references in university courses worldwide — embedding his name in the learning experience of an entire generation of ML practitioners. He has described his personal mission as wanting to “understand things clearly and explain them well,” a phrase that appears on his GitHub profile and encapsulates his career.

Google Brain (c. 2014–2019)

Olah joined Google Brain as a researcher, where he worked primarily on understanding what neural networks learn — at the time, a niche and somewhat undervalued research direction. His most publicly visible contribution from this period was co-creating DeepDream (2015, with Alexander Mordvintsev and Mike Tyka), a technique for amplifying patterns a neural network has learned to recognize by optimizing input images toward high activations of target neurons. The resulting psychedelic visualizations went viral far beyond the ML community and demonstrated for a broad audience that neural networks were building rich internal representations of the world, not just performing statistical pattern matching on surfaces.

More technically consequential was his work on feature visualization — systematic methods for understanding what individual neurons and channels in a network respond to — which he developed in collaboration with Shan Carter and others and published through the Distill journal he co-founded. The 2017 “Feature Visualization” paper and the 2018 “Building Blocks of Interpretability” paper established a vocabulary and methodology for structured analysis of network internals that directly preceded the Circuits work.

Distill (co-founded 2017): Alongside Shan Carter and others, Olah co-founded Distill, a scientific journal built on the conviction that presentation and clarity in scientific communication are not decoration but substance. Distill papers combined interactive visualizations, animated diagrams, and clear prose to explain ML concepts; the journal attracted high-profile contributions on attention, GANs, and feature visualization. Though Distill paused new publications in 2021 due to the maintenance burden, its aesthetic influenced a generation of ML researchers and established a standard for what careful scientific communication in the field could look like. The transformer-circuits.pub thread that Olah later launched at Anthropic carries forward the same ethos.

OpenAI — Clarity Team (c. 2019–2021)

Olah joined OpenAI to lead its interpretability-focused Clarity Team. The most significant output of this period was the Circuits thread — a sequence of papers published on Distill beginning in 2020 that studied InceptionV1, a vision model, at the level of individual neurons and the connections between them. The inaugural paper, “Zoom In: An Introduction to Circuits” (2020, with Nick Cammarata and others), demonstrated that individual neurons corresponded to recognizable concepts (curve detectors, texture detectors, even a “floppy ear detector”) and that connections between neurons formed meaningful algorithms — implying that neural networks could in principle be reverse-engineered into interpretable components. The paper coined the term “mechanistic interpretability” and argued for three properties of neural networks — features, circuits, and universality — as organizing principles for the field.

Activation Atlases (2019): Co-developed with Shan Carter at Google and OpenAI, Activation Atlases provided a global map of the feature space of a neural network by aggregating millions of activation vectors and visualizing their structure — enabling a “10,000-foot view” of what a network has learned rather than neuron-by-neuron inspection.

Anthropic — Co-Founder and Interpretability Lead (2021–present)

Olah co-founded Anthropic in 2021 alongside Dario Amodei, Daniela Amodei, Tom Brown, Chris Jones, Sam McCandlish, Jack Clark, and Jared Kaplan — a group that departed OpenAI largely over concerns about the pace of safety work relative to capabilities development. At Anthropic, interpretability is one of the company’s core research priorities, and Olah leads the interpretability team, which has produced the field’s most influential work of the 2020s:

“Toy Models of Superposition” (2022): This paper investigated why neurons in neural networks appear “polysemantic” — responding to multiple unrelated features — and developed a theoretical and empirical framework centered on “superposition”: the phenomenon by which a neural network with $n$ neurons can represent more than $n$ features by packing them into overlapping directions in activation space. The paper laid the mathematical groundwork for understanding why mechanistic interpretability is hard (features are not cleanly assigned to neurons) and why sparse methods might help decompose them.

“Towards Monosemanticity: Decomposing Language Models with Dictionary Learning” (2023): Applied sparse autoencoders (SAEs) to the transformer’s MLP neurons to extract a large dictionary of interpretable features — addressing the superposition problem by finding a higher-dimensional space in which individual directions correspond to human-recognizable concepts. This work shifted the mechanistic interpretability research agenda toward dictionary learning as the primary tool for feature decomposition.

“Scaling Monosemanticity” (2024): Scaled the sparse autoencoder approach to Claude Sonnet, finding millions of interpretable features in a production-scale language model — including features corresponding to characters, concepts, and surprisingly human-like abstractions including features associated with emotional states and introspective concepts.

Circuit Tracing and Attribution Graphs (2025): Introduced cross-layer transcoders (CLTs) as a new form of sparse autoencoder that replaces MLP layers with interpretable components, enabling the construction of “attribution graphs” — causal maps of which features influenced which outputs in a particular model forward pass. This approach was applied to Claude 3.5 Haiku in production and the codebase was open-sourced, making circuit tracing infrastructure available to the broader research community.

Transformer Circuits Thread (transformer-circuits.pub): Olah has organized Anthropic’s interpretability research outputs into this public publication thread, which functions as the field’s de facto primary research venue, in the spirit of Distill.


Key Contributions

  • Coined “mechanistic interpretability” — The term and the research program it names both originate in Olah’s 2020 Circuits paper; the field that has grown around it now spans dozens of research groups and hundreds of researchers worldwide.

  • “Zoom In: An Introduction to Circuits” (2020) — The founding paper of modern mechanistic interpretability; demonstrated that individual neurons and their connections in a vision model can be reverse-engineered into interpretable algorithms; introduced features, circuits, and universality as organizing principles.

  • “Toy Models of Superposition” (2022) — Provided the theoretical framework for understanding polysemanticity and the superposition hypothesis; redirected interpretability research toward sparse decomposition methods.

  • “Towards Monosemanticity” (2023) and “Scaling Monosemanticity” (2024) — Demonstrated that sparse autoencoders can decompose production language models into millions of interpretable features; the most-cited mechanistic interpretability papers of their years.

  • Circuit Tracing / Attribution Graphs (2025) — Introduced a unified framework for tracing causal pathways through a model’s computation; open-sourced the tooling; applied to Claude in production for the first time.

  • DeepDream (2015) — Co-created the technique that made internal neural network representations publicly visible and comprehensible; a cultural and scientific landmark that changed how the broader public understood what neural networks learn.

  • colah.github.io blog — Posts including “Understanding LSTM Networks,” “Calculus on Computational Graphs,” “Neural Networks, Types, and Functional Programming,” and “Visual Information Theory” became standard educational references; the blog reached millions of readers and trained a generation of ML practitioners.

  • Distill (co-founded 2017) — Co-founded the ML scientific journal focused on outstanding communication, establishing a standard for interactive, diagram-driven ML papers that influenced how the field publishes and explains itself.

  • Feature Visualization (2017, with Shan Carter) — Systematic methodology for understanding what individual neurons optimize for; foundational to all subsequent work on neural network internals.


Awards & Recognition

  • TIME100 AI — Most Influential People in AI (2024) — Cited as a pioneer of mechanistic interpretability.
  • Thiel Fellowship (2012) — $100,000 award recognizing exceptional young researchers and entrepreneurs.
  • National AP Scholar (2010) — Academic distinction upon graduating The Abelard School.

Key Relationships

  • Dario Amodei — Anthropic CEO and co-founder; interpretability is explicitly a strategic priority at Anthropic partly because of Dario’s own view that it is “one of the best bets for responsible AI development”; Olah and Amodei’s shared emphasis on the importance of understanding AI systems before deploying them defines Anthropic’s safety culture.
  • Shan Carter — Long-term research collaborator; co-author on Feature Visualization, Building Blocks of Interpretability, and Activation Atlases; co-founder of Distill alongside Olah; the pairing of Olah’s theoretical drive with Carter’s design and communication sensibility defined the Distill aesthetic.
  • Nick Cammarata — Key collaborator on the original Circuits thread at OpenAI; co-author of “Zoom In.”
  • Tom Brown — Co-founder of Anthropic; fellow OpenAI alumnus; GPT-3’s primary author brought language modeling expertise that complements Olah’s interpretability focus.
  • Andrej Karpathy — Among Olah’s most prominent followers; both share a commitment to building public understanding of how neural networks work, through different modalities (Karpathy through courses and code, Olah through visual essays and theory).

Personal Style

Olah’s intellectual identity can be stated simply because he has stated it repeatedly: “I want to understand things clearly and explain them well.” This dual commitment — to genuine understanding and to genuine communication — is not rhetorical; it predicts both the form and the substance of his output. His blog posts are notable for building from first principles, using carefully designed diagrams rather than equations wherever possible, and insisting that understanding means being able to construct an explanation that makes the thing feel obvious in retrospect. His research papers at Anthropic share the same aesthetic: the transformer-circuits thread reads more like a patient, cumulative scientific narrative than a sequence of conference paper contributions.

His approach to AI safety is also distinctive within the field: rather than starting from alignment theory or governance, he starts from the empirical question of what is actually happening inside neural networks — treating interpretability as the prerequisite to everything else. He has described feeling uncertain about many contested questions in AI safety but confident that “whatever the answer turns out to be, understanding what neural networks are doing will matter.” His Digg vibe profile (37% Mechanistic Interpretability, “Informing” and “Teaching” dominant) accurately captures a communicator whose public presence is primarily pedagogical — less concerned with debate than with building shared understanding of a genuinely hard empirical problem.


References