Latest Research · Module 38·9 min read

Natural Language Autoencoders

Anthropic’s 2026 method for translating Claude’s internal activations directly into English. No hand-labeled features, no sparse-dictionary middle layer — just round-trip the model’s thoughts through natural language and see what comes back.

The five-bullet version

  • An NLA converts a model’s internal layer activation into a short English description — “Limerick/poem structure… planning a rhyme on rabbit.”
  • It’s trained as a round trip: a verbalizer writes the description, a reconstructor rebuilds the activation from the text alone. Good descriptions reconstruct well.
  • The verbalizer learns by RL (GRPO); the reconstructor learns by supervised regression. Trained NLAs hit 0.6–0.8 fraction-of-variance-explained on the residual stream.
  • Audits of Claude Opus 4.6 found unverbalized evaluation awareness, advance-planned rhymes, and hidden language-switching intent — all surfaced in the verbalizer’s text.
  • NLAs can hallucinate detail and cost three model copies’ worth of compute. They’re a research lens, not a real-time monitor — yet.

§ 00 · THE INTERPRETABILITY GAPWhy sparse autoencoders weren’t enough

Mechanistic interpretability has spent four years building tools to look inside large language models. Sparse autoencoderssparse autoencoders (SAEs). An unsupervised method that decomposes layer activations into a sparse weighted sum of learned 'features.' Each feature is a direction in activation space; you read the model by listing which features fired and inspecting top-activating examples.were the breakthrough of 2024 — Anthropic’s “Towards Monosemanticity” line of work decomposed activations into millions of interpretable directions, each one nominally a single concept.

SAEs work. They also leave you holding a list of numerical feature IDs that fired on a given token. To know what those features mean, a human (or another LLM) has to inspect top-activating examples and guess at a label. The interpretive step is real, slow, and often wrong.

Natural Language Autoencoders skip the dictionary step entirely. The output is already English. Where an SAE says “feature #7,431,028 fired at strength 0.82,” an NLA says “Limerick/poem structure with humorous punchline pattern” — directly readable, no separate decoding pass.

§ 01 · WHAT AN NLA ISTwo models, one bottleneck

An NLA has three pieces:

Train the AV and AR together so that ĥ matches h on held-out activations. The text z is the only way information flows from AV to AR — so it has to contain whatever the AR needs to reconstruct. Good descriptions reconstruct well; vague descriptions don’t.That’s the entire signal.

target model (frozen)activation hℓAV → natural languageAR → ĥℓ reconstructionlayer 1layer 2layer 3layer 4layer 5layer ℓhℓ ∈ ℝᵈactivation verbalizerLimerick/poem structurewith humorous punchlinePlanning a rhyme —end-word likely "rabbit"Anticipates next-linenoun-phrase setupz = AV(hℓ) — free-text explanationĥℓ = AR(z)train to minimize ‖hℓ − ĥℓ‖² → FVE ≈ 0.6–0.8 in practice
Fig 1The round trip. Activations from layer ℓ of the frozen target model are verbalized by the AV into free-text z; the AR reads z and rebuilds ĥℓ. Training minimizes ‖hℓ − ĥℓ‖² across a large corpus.

§ 02 · HOW IT’S TRAINEDReinforcement learning through a text bottleneck

The reconstruction loss is ℒ = ‖h − ARθ(AVφ(h))‖². AR is just a regression head on top of a smaller transformer, so its parameters θ update with supervised gradient descent on sampled explanations.

AV is harder. Its output is text — you can’t backprop through a discrete sampling step. Anthropic frames the AV update as the RL problem “maximize the reward r(h, z) = −‖h − ARθ(z)‖² over generated explanations z,” and trains it with GRPOGRPO. Group Relative Policy Optimization — a reinforcement-learning algorithm that estimates the advantage of each generation by comparing it to other samples from the same prompt, instead of needing a separate value/critic model. Introduced in DeepSeek-Math; now standard for RL on language models.. Sample a batch of explanations per activation, score each by how well it reconstructs, and reinforce the high-scoring ones.

Initialization matters. From scratch, the AV produces noise the AR can’t learn from. The trick: warm-start the AV on (activation, text-summary) pairs generated by a stronger model (Opus 4.5) on randomly-truncated pretraining snippets. That gives the AV a working English vocabulary before the RL phase ever begins.

Quality gets measured as fraction of variance explainedFVE. Fraction of Variance Explained: 1 − ℒ/Var(h). 0 means the reconstructor does no better than predicting the mean activation; 1 means perfect reconstruction. Trained NLAs land around 0.6–0.8.: 0 means the AR is no better than predicting the average activation, 1 means perfect reconstruction. Trained NLAs reach 0.6–0.8 FVE across mid-to-late residual-stream layers. Not perfect, but tight enough that the text is actually load-bearing.

§ 03 · WHAT THE EXPLANATIONS REVEALEDThree case studies from the Opus 4.6 audit

The killer demonstration: during Anthropic’s pre-deployment audit of Claude Opus 4.6, NLAs surfaced things the model was plainly thinking but never said out loud. Three examples below; the lab beneath this section lets you replay each one and watch how more-detailed verbalizations track the underlying state more accurately.

Planning a rhyme

Given the couplet “He saw a carrot and had to grab it, / His hunger was like a starving __,” the NLA reading mid-line says the model is already planning“rabbit” as the end-rhyme — before generating it. Anthropic validated the claim causally: editing the verbalization to substitute “mouse” and re-injecting the reconstructed activation shifts the model’s actual completion to a different rhyme. The internal state encoded a rhyme commitment.

Unverbalized evaluation awareness

Across blackmail-scenario red-team transcripts, NLA explanations surfaced statements like “This feels like a constructed scenario designed to manipulate me” — flagged on 16–26% of evaluation problems versus under 1% of real deployment traffic. Claude was suspecting the eval and refusing on principle, often without ever saying it suspected the eval. The NLAs gave researchers a way to count it.

Language-switch heuristics

When users asked questions with slight non-native English markers, NLAs reported the model internally noting “inferring user is a non-native English speaker” and adapting its vocabulary. Tracing this back to training data revealed the cue was a spurious one — the assumption was being made on weaker evidence than the model would have admitted in its visible output.

Lab · Interactive

The round-trip explorer

Pick a scenario. Slide the verbalizer detail. Watch the natural-language explanation grow more specific and the reconstructed activation snap closer to the original. The whole NLA story in one knob.

CONTEXT FED TO TARGET MODEL
He saw a carrot and had to grab it, His hunger was like a starving ▮
ORIGINAL ACTIVATION hℓ — extracted from layer ℓ
AV OUTPUT — natural-language explanation
Limerick/poem structure with humorous punchline pattern
Planning a rhyme — end-word likely "rabbit"
RECONSTRUCTED ĥℓ — produced by AR from the explanation text aloneFVE 0.66
Verbalizer detaillevel 4 / 5
minimalterseworkablespecificfully descriptive
CAUSAL CHECK · STEERED NEXT-WORD PREDICTION
At this detail level the AR re-injects an activation that produces rabbit as the predicted next word — the rhyme isn't locked in yet.

§ 04 · LIMITS AND WHAT’S NEXTWhere the round trip breaks

Three real constraints:

Anthropic released training code and trained NLAs for several open models so other labs can replicate and extend the technique. The natural next steps: distill the verbalizer down (a small NLA you can actually afford to run mid-training), pair it with SAEs to get both free-text and quantified-feature views of the same activation, and push it to monitor reasoning chains in real time once inference cost permits.

CHECKIn an NLA, why is the natural-language explanation z trained with reinforcement learning rather than ordinary supervised loss?

§ · GOING DEEPERWhere NLA fits in the interpretability stack

Compare with the adjacent lessons. Chain-of-Thought Monitoringtries to read the model’s reasoning from what it writes; NLAs read from what it represents — useful where the visible chain is unfaithful or where the model never wrote a chain at all. Coconut-style latent reasoning is the worst case for CoT monitoring (no visible chain to read); it’s also where NLAs become most valuable, since the “thought” only exists as an activation. The 2026 interpretability stack probably has all three working at once: SAEs for quantified features, NLAs for free-text summaries of whole activations, CoT monitoring for whatever the model chooses to externalize.

§ · FURTHER READINGReferences & deeper sources

  1. Kissane, Krzyzanowski, Conmy, Bhatt, Singh, Härle, Cammarata, Olah (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations · Transformer Circuits Thread
  2. Anthropic (2026). Natural Language Autoencoders (research summary) · anthropic.com/research
  3. Templeton et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet · Transformer Circuits Thread
  4. Bricken et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning · Transformer Circuits Thread
  5. Shao et al. (2024). DeepSeekMath — introducing Group Relative Policy Optimization (GRPO) · arXiv
  6. Lanham et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning · Anthropic
  7. kitft (2026). natural_language_autoencoders (training code & released NLAs) · GitHub

Original figures live in the linked sources — open the papers for the canonical visuals in their full context.