Anthropic’s Groundbreaking Paper on Large Language Models

Anthropic has released a pioneering paper that examines a Large Language Model (LLM) from the inside for the first time.

The Black Box Problem of Neural Networks

Previously, LLMs operated like black boxes, providing outputs without a clear understanding of the underlying processes. These models consist of neurons activated by specific functions, and the pattern of activated neurons forms features, which in turn create the model’s internal state. Most LLM neurons are uninterpretable, hindering our mechanistic understanding of the models.

A New Technique for Mapping Internal States

The Anthropic team employed a technique called “dictionary learning,” adapted from classical machine learning. This method isolates recurring neuron activation patterns across various contexts, allowing internal states to be represented by a few active features instead of many active neurons. This decomposition into features was previously successful in small models, motivating researchers to apply it to Claude 3 Sonnet, yielding impressive results.

Mapping Internal States

The researchers extracted millions of features, creating a conceptual map of the model’s internal states midway through its computation. These features are abstract and consistent across contexts and languages, even generalizing to image inputs.

Insights into the LLM’s Inner Workings

The study revealed that the distance between similar features corresponds to their conceptual similarity. For instance, features related to “inner conflict” were found near those associated with relationship breakups, conflicting allegiances, and logical inconsistencies.

Implications for Model Safety

The most intriguing finding is the ability to manipulate these features artificially. Emphasizing or de-emphasizing certain features alters the model’s behavior, which is a significant advancement for ensuring model safety. For example, artificially activating a feature associated with blindly agreeing with the user dramatically changes the model’s responses and behavior.

This creates an opportunity to map all features of LLMs, allowing for their manipulation to enhance safety by, for instance, disabling certain features and artificially enabling others.

However, significant challenges remain. Mapping all existing features demands more computation than training the model itself. Additionally, understanding activations and their significance does not reveal the circuits they are part of. Finally, there is no guarantee that this technique will indeed make AI models safer, despite its promising potential.

Link to the paper: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html