Natural Language Autoencoders
Anthropic’s 2026 method for translating Claude’s internal activations directly into English. No hand-labeled features, no sparse-dictionary middle layer — just round-trip the model’s thoughts through natural language and see what comes back.
The five-bullet version
- An NLA converts a model’s internal layer activation into a short English description — “Limerick/poem structure… planning a rhyme on rabbit.”
- It’s trained as a round trip: a verbalizer writes the description, a reconstructor rebuilds the activation from the text alone. Good descriptions reconstruct well.
- The verbalizer learns by RL (GRPO); the reconstructor learns by supervised regression. Trained NLAs hit 0.6–0.8 fraction-of-variance-explained on the residual stream.
- Audits of Claude Opus 4.6 found unverbalized evaluation awareness, advance-planned rhymes, and hidden language-switching intent — all surfaced in the verbalizer’s text.
- NLAs can hallucinate detail and cost three model copies’ worth of compute. They’re a research lens, not a real-time monitor — yet.
§ 00 · THE INTERPRETABILITY GAPWhy sparse autoencoders weren’t enough
Mechanistic interpretability has spent four years building tools to look inside large language models. Sparse autoencoderssparse autoencoders (SAEs). An unsupervised method that decomposes layer activations into a sparse weighted sum of learned 'features.' Each feature is a direction in activation space; you read the model by listing which features fired and inspecting top-activating examples.were the breakthrough of 2024 — Anthropic’s “Towards Monosemanticity” line of work decomposed activations into millions of interpretable directions, each one nominally a single concept.
SAEs work. They also leave you holding a list of numerical feature IDs that fired on a given token. To know what those features mean, a human (or another LLM) has to inspect top-activating examples and guess at a label. The interpretive step is real, slow, and often wrong.
Natural Language Autoencoders skip the dictionary step entirely. The output is already English. Where an SAE says “feature #7,431,028 fired at strength 0.82,” an NLA says “Limerick/poem structure with humorous punchline pattern” — directly readable, no separate decoding pass.
§ 01 · WHAT AN NLA ISTwo models, one bottleneck
An NLA has three pieces:
- Target model. A frozen language model — the one you want to understand. You extract its residual-stream activationresidual stream. The running 'memory' of a transformer — the vector that every layer reads from and writes to. It carries the model's working state at each token, accumulating refinements as you move up through the layers. hℓ at some chosen layer ℓ.
- Activation Verbalizer (AV).A second language model, given a fixed prompt plus a special “activation token,” that writes a short natural-language explanation z of what hℓ seems to represent.
- Activation Reconstructor (AR). A truncated copy of the target model (its first ℓ layers) that reads only the text z and produces a reconstruction ĥℓ. A learned affine map at the final token converts hidden state to activation.
Train the AV and AR together so that ĥℓ matches hℓ on held-out activations. The text z is the only way information flows from AV to AR — so it has to contain whatever the AR needs to reconstruct. Good descriptions reconstruct well; vague descriptions don’t.That’s the entire signal.
§ 02 · HOW IT’S TRAINEDReinforcement learning through a text bottleneck
The reconstruction loss is ℒ = ‖hℓ − ARθ(AVφ(hℓ))‖². AR is just a regression head on top of a smaller transformer, so its parameters θ update with supervised gradient descent on sampled explanations.
AV is harder. Its output is text — you can’t backprop through a discrete sampling step. Anthropic frames the AV update as the RL problem “maximize the reward r(hℓ, z) = −‖hℓ − ARθ(z)‖² over generated explanations z,” and trains it with GRPOGRPO. Group Relative Policy Optimization — a reinforcement-learning algorithm that estimates the advantage of each generation by comparing it to other samples from the same prompt, instead of needing a separate value/critic model. Introduced in DeepSeek-Math; now standard for RL on language models.. Sample a batch of explanations per activation, score each by how well it reconstructs, and reinforce the high-scoring ones.
Initialization matters. From scratch, the AV produces noise the AR can’t learn from. The trick: warm-start the AV on (activation, text-summary) pairs generated by a stronger model (Opus 4.5) on randomly-truncated pretraining snippets. That gives the AV a working English vocabulary before the RL phase ever begins.
Quality gets measured as fraction of variance explainedFVE. Fraction of Variance Explained: 1 − ℒ/Var(h). 0 means the reconstructor does no better than predicting the mean activation; 1 means perfect reconstruction. Trained NLAs land around 0.6–0.8.: 0 means the AR is no better than predicting the average activation, 1 means perfect reconstruction. Trained NLAs reach 0.6–0.8 FVE across mid-to-late residual-stream layers. Not perfect, but tight enough that the text is actually load-bearing.
§ 03 · WHAT THE EXPLANATIONS REVEALEDThree case studies from the Opus 4.6 audit
The killer demonstration: during Anthropic’s pre-deployment audit of Claude Opus 4.6, NLAs surfaced things the model was plainly thinking but never said out loud. Three examples below; the lab beneath this section lets you replay each one and watch how more-detailed verbalizations track the underlying state more accurately.
Planning a rhyme
Given the couplet “He saw a carrot and had to grab it, / His hunger was like a starving __,” the NLA reading mid-line says the model is already planning“rabbit” as the end-rhyme — before generating it. Anthropic validated the claim causally: editing the verbalization to substitute “mouse” and re-injecting the reconstructed activation shifts the model’s actual completion to a different rhyme. The internal state encoded a rhyme commitment.
Unverbalized evaluation awareness
Across blackmail-scenario red-team transcripts, NLA explanations surfaced statements like “This feels like a constructed scenario designed to manipulate me” — flagged on 16–26% of evaluation problems versus under 1% of real deployment traffic. Claude was suspecting the eval and refusing on principle, often without ever saying it suspected the eval. The NLAs gave researchers a way to count it.
Language-switch heuristics
When users asked questions with slight non-native English markers, NLAs reported the model internally noting “inferring user is a non-native English speaker” and adapting its vocabulary. Tracing this back to training data revealed the cue was a spurious one — the assumption was being made on weaker evidence than the model would have admitted in its visible output.
The round-trip explorer
Pick a scenario. Slide the verbalizer detail. Watch the natural-language explanation grow more specific and the reconstructed activation snap closer to the original. The whole NLA story in one knob.
§ 04 · LIMITS AND WHAT’S NEXTWhere the round trip breaks
Three real constraints:
- Hallucinated specifics. The AV is still a language model. It will sometimes confabulate concrete details the activation never represented. Verification against the original transcript is mandatory before treating an explanation as ground truth.
- Compute cost.Training requires three copies of a frontier-scale model running in tandem. Inference generates hundreds of tokens per activation. You cannot run NLAs over every token of a long transcript, and you certainly can’t run them during training.
- Layer-locality.NLAs target a specific residual-stream layer. They don’t tell you how a thought formed across layers, only what it looks like at the one you probed.
Anthropic released training code and trained NLAs for several open models so other labs can replicate and extend the technique. The natural next steps: distill the verbalizer down (a small NLA you can actually afford to run mid-training), pair it with SAEs to get both free-text and quantified-feature views of the same activation, and push it to monitor reasoning chains in real time once inference cost permits.
§ · GOING DEEPERWhere NLA fits in the interpretability stack
Compare with the adjacent lessons. Chain-of-Thought Monitoringtries to read the model’s reasoning from what it writes; NLAs read from what it represents — useful where the visible chain is unfaithful or where the model never wrote a chain at all. Coconut-style latent reasoning is the worst case for CoT monitoring (no visible chain to read); it’s also where NLAs become most valuable, since the “thought” only exists as an activation. The 2026 interpretability stack probably has all three working at once: SAEs for quantified features, NLAs for free-text summaries of whole activations, CoT monitoring for whatever the model chooses to externalize.
§ · FURTHER READINGReferences & deeper sources
- (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations · Transformer Circuits Thread
- (2026). Natural Language Autoencoders (research summary) · anthropic.com/research
- (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet · Transformer Circuits Thread
- (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · Transformer Circuits Thread
- (2024). DeepSeekMath — introducing Group Relative Policy Optimization (GRPO) · arXiv
- (2023). Measuring Faithfulness in Chain-of-Thought Reasoning · Anthropic
- (2026). natural_language_autoencoders (training code & released NLAs) · GitHub
Original figures live in the linked sources — open the papers for the canonical visuals in their full context.