Scaling Monosemanticity: Extracting Interpretable Features From Claude 3 Sonnet

rw-book-cover

Metadata

Author: Tom Henighan
Full Title: Scaling Monosemanticity: Extracting Interpretable Features From Claude 3 Sonnet
URL: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

Highlights

Eight months ago, we demonstrated that sparse autoencoders could recover monosemantic features from a small one-layer transformer. At the time, a major concern was that this method might not scale feasibly to state-of-the-art transformers and, as a result, be unable to practically contribute to AI safety. Since then, scaling sparse autoencoders has been a major priority of the Anthropic interpretability team, and we’re pleased to report extracting high-quality features from Claude 3 Sonnet,For clarity, this is the 3.0 version of Claude 3 Sonnet, released March 4, 2024. It is the exact model in production as of the writing of this paper. It is the finetuned model, not the base pretrained model (although our method also works on the base model). Anthropic’s medium-sized production model. (View Highlight)
We find a diversity of highly abstract features. They both respond to and behaviorally cause abstract behaviors. Examples of features we find include features for famous people, features for countries and cities, and features tracking type signatures in code. Many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities). (View Highlight)
Some of the features we find are of particular interest because they may be safety-relevant – that is, they are plausibly connected to a range of ways in which modern AI systems may cause harm. In particular, we find features related to security vulnerabilities and backdoors in code; bias (including both overt slurs, and more subtle biases); lying, deception, and power-seeking (including treacherous turns); sycophancy; and dangerous / criminal content (e.g., producing bioweapons). However, we caution not to read too much into the mere existence of such features: there’s a difference (for example) between knowing about lies, being capable of lying, and actually lying in the real world. This research is also very preliminary. Further work will be needed to understand the implications of these potentially safety-relevant features. (View Highlight)
Key Results • Sparse autoencoders produce interpretable features for large models. • Scaling laws can be used to guide the training of sparse autoencoders. • The resulting features are highly abstract: multilingual, multimodal, and generalizing between concrete and abstract references. • There appears to be a systematic relationship between the frequency of concepts and the dictionary size needed to resolve features for them. • Features can be used to steer large models (see e.g. Influence on Behavior). • We observe features related to a broad range of safety concerns, including deception, sycophancy, bias, and dangerous content. (View Highlight)
At a high level, the linear representation hypothesis suggests that neural networks represent meaningful concepts – referred to as features – as directions in their activation spaces. The superposition hypothesis accepts the idea of linear representations and further hypothesizes that neural networks use the existence of almost-orthogonal directions in high-dimensional spaces to represent more features than there are dimensions. (View Highlight)
If one believes these hypotheses, the natural approach is to use a standard method called dictionary learning . Recently, several papers have suggested that this can be quite effective for transformer language models . In particular, a specific approximation of dictionary learning called a sparse autoencoder appears to be very effective (View Highlight)
To date, these efforts have been on relatively small language models by the standards of modern foundation models. Our previous paper , which focused on a one-layer model, was a particularly extreme example of this. As a result, an important question has been left hanging: will these methods work for large models? Or is there some reason, whether pragmatic questions of engineering or more fundamental differences in how large models operate, that would mean these efforts can’t generalize? (View Highlight)
Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces. We do so by training a sparse autoencoder (SAE) on the model activations, as in our prior work and that of several other groups (e.g. ; see Related Work). SAEs are an instance of a family of “sparse dictionary learning” algorithms that seek to decompose data into a weighted sum of sparsely active components. (View Highlight)
Our SAE consists of two layers. The first layer (“encoder”) maps the activity to a higher-dimensional layer via a learned linear transformation followed by a ReLU nonlinearity. We refer to the units of this high-dimensional layer as “features.” The second layer (“decoder”) attempts to reconstruct the model activations via a linear transformation of the feature activations. The model is trained to minimize a combination of (1) reconstruction error and (2) an L1 regularization penalty on the feature activations, which incentivizes sparsity. (View Highlight)
Once the SAE is trained, it provides us with an approximate decomposition of the model’s activations into a linear combination of “feature directions” (SAE decoder weights) with coefficients equal to the feature activations. The sparsity penalty ensures that, for many given inputs to the model, a very small fraction of features will have nonzero activations. Thus, for any given token in any given context, the model activations are “explained” by a small set of active features (out of a large pool of possible features). For more motivation and explanation of SAEs, see the Problem Setup section of Towards Monosemanticity (View Highlight)
In this work, we focused on applying SAEs to residual stream activations halfway through the model (i.e. at the “middle layer”). We made this choice for several reasons. First, the residual stream is smaller than the MLP layer, making SAE training and inference computationally cheaper. Second, focusing on the residual stream in theory helps us mitigate an issue we call “cross-layer superposition” (see Limitations for more discussion). We chose to focus on the middle layer of the model because we reasoned that it is likely to contain interesting, abstract features (View Highlight)
Training SAEs on larger models is computationally intensive. It is important to understand (1) the extent to which additional compute improves dictionary learning results, and (2) how that compute should be allocated to obtain the highest-quality dictionary possible for a given computational budget. (View Highlight)
Though we lack a gold-standard method of assessing the quality of a dictionary learning run, we have found that the loss function we use during training – a weighted combination of reconstruction mean-squared error (MSE) and an L1 penalty on feature activations – is a useful proxy, conditioned on a reasonable choice of the L1 coefficient. That is, we have found that dictionaries with low loss values (using an L1 coefficient of 5) tend to produce interpretable features and to improve other metrics of interest (the L0 norm, and the number of dead or otherwise degenerate features). Of course, this is an imperfect metric, and we have little confidence that it is optimal. It may well be the case that other L1 coefficients (or other objective functions altogether) would be better proxies to optimize. (View Highlight)
With this proxy, we can treat dictionary learning as a standard machine learning problem, to which we can apply the “scaling laws” framework for hyperparameter optimization (see e.g. ). In an SAE, compute usage primarily depends on two key hyperparameters: the number of features being learned, and the number of steps used to train the autoencoder (which maps linearly to the amount of data used, as we train the SAE for only one epoch). The compute cost scales with the product of these parameters if the input dimension and other hyperparameters are held constant. (View Highlight)
In the previous section, we described how we trained sparse autoencoders on Claude 3 Sonnet. And as predicted by scaling laws, we achieved lower losses by training large SAEs. But the loss is only a proxy for what we actually care about: interpretable features that explain model behavior. (View Highlight)
The goal of this section is to investigate whether these features are actually interpretable and explain model behavior. We’ll first look at a handful of relatively straightforward features and provide evidence that they’re interpretable. Then we’ll look at two much more complex features, and demonstrate that they track very abstract concepts. We’ll close with an experiment using automated interpretability to evaluate a larger number of features and compare them to neurons. (View Highlight)
In this subsection, we’ll look at a few features and argue that they are genuinely interpretable. Our goal is just to demonstrate that interpretable features exist, leaving strong claims (such as most features being interpretable) to a later section. We will provide evidence that our interpretations are good descriptions of what the features represent and how they function in the network, using an analysis similar to that in Towards Monosemanticity (View Highlight)
for each feature, we attempt to establish the following claims:
1. When the feature is active, the relevant concept is reliably present in the context (specificity).
2. Intervening on the feature’s activation produces relevant downstream behavior (influence on behavior). (View Highlight)
t is difficult to rigorously measure the extent to which a concept is present in a text input. In our prior work, we focused on features that unambiguously corresponded to sets of tokens (e.g., Arabic script or DNA sequences) and computed the likelihood of that set of tokens relative to the rest of the vocabulary, conditioned on the feature’s activation. This technique does not generalize to more abstract features. Instead, to demonstrate specificity in this work we more heavily leverage automated interpretability methods (similar to ). We use the same automated interpretability pipeline as in our previous work in the features vs. neurons section below, but we additionally find that current-generation models can now more accurately rate text samples according to how well they match a proposed feature interpretation. (View Highlight)
First, we study a Golden Gate Bridge feature 34M/31164353. Its greatest activations are essentially all references to the bridge, and weaker activations also include related tourist attractions, similar bridges, and other monuments. Next, a brain sciences feature 34M/9493533 activates on discussions of neuroscience books and courses, as well as cognitive science, psychology, and related philosophy. In the 1M training run, we also find a feature that strongly activates for various kinds of transit infrastructure 1M/3 including trains, ferries, tunnels, bridges, and even wormholes! A final feature 1M/887839 responds to popular tourist attractions including the Eiffel Tower, the Tower of Pisa, the Golden Gate Bridge, and the Sistine Chapel. (View Highlight)
As in Towards Monosemanticity, we see that these features become less specific as the activation strength weakens. This could be due to the model using activation strengths to represent confidence in a concept being present. Or it may be that the feature activates most strongly for central examples of the feature, but weakly for related ideas – for example, the Golden Gate Bridge feature 34M/31164353 appears to weakly activate for other San Francisco landmarks. It could also reflect imperfection in our dictionary learning procedure. For example, it may be that the architecture of the autoencoder is not able to extract and discriminate among features as cleanly as we might want. And of course interference from features that are not exactly orthogonal could also be a culprit, making it more difficult for Sonnet itself to activate features on precisely the right examples. It is also plausible that our feature interpretations slightly misrepresent the feature’s actual function, and that this inaccuracy manifests more clearly at lower activations. Nonetheless, we often find that lower activations tend to maintain some specificity to our interpretations, including related concepts or generalizations of the core feature. As an illustrative example, weak activations of the transit infrastructure feature 1M/3 include procedural mechanics instructions describing which through-holes to use for particular parts. (View Highlight)
Moreover, we expect that very weak activations of features are not especially meaningful, and thus we are not too concerned with low specificity scores for these activation ranges. For instance, we have observed that techniques such as rounding feature activations below a threshold to zero can improve specificity at the low-activation end of the spectrum without substantially increasing the reconstruction error of the SAE, and there are a variety of techniques in the literature that potentially address the same issue (View Highlight)
Note that we have had more difficulty in quantifying feature sensitivity – that is, how reliably a feature activates for text that matches our proposed interpretation – in a scalable, rigorous way. This is due to the difficulty of generating text related to a concept in an unbiased fashion. Moreover, many features may represent something more specific than we are able to glean with our visualizations, in which case they would not respond reliably to text selected based on our proposed interpretation, and this problem gets harder the more abstract the features are. As a basic check, however, we observe that the Golden Gate Bridge feature still fires strongly on the first sentence of the Wikipedia article for the Golden Gate Bridge in various languages (after removing any English parentheticals). In fact, the Golden Gate Bridge feature is the top feature by average activation for every example below. (View Highlight)
Next, to demonstrate whether our interpretations of features accurately describe their influence on model behavior, we experiment with feature steering, where we “clamp” specific features of interest to artificially high or low values during the forward pass (see Methodological Details for implementation details). We conduct these experiments with prompts in the “Human:”/“Assistant:” format that Sonnet is typically used with. We find that feature steering is remarkably effective at modifying model outputs in specific, interpretable ways. It can be used to modify the model’s demeanor, preferences, stated goals, and biases; to induce it to make specific errors; and to circumvent model safeguards (see also Safety-Relevant Features). We find this compelling evidence that our interpretations of features line up with how they are used by the model. (View Highlight)
For instance, we see that clamping the Golden Gate Bridge feature 34M/31164353 to 10× its maximum activation value induces thematically-related model behavior. In this example, the model starts to self-identify as the Golden Gate Bridge! Similarly, clamping the Transit infrastructure feature 1M/3 to 5× its maximum activation value causes the model to mention a bridge when it otherwise would not. In each case, the downstream influence of the feature appears consistent with our interpretation of the feature, even though these interpretations were made based only on the contexts in which the feature activates and we are intervening in contexts in which the feature is inactive. (View Highlight)
So far we have presented features in Claude 3 Sonnet that fire on relatively simple concepts. These features are in some ways similar to those found in Towards Monosemanticity which, because they were trained on the activations of a 1-layer Transformer, reflected a very shallow knowledge of the world. For example, we found features that correspond to predicting a range of common nouns conditioned on a fairly general context (e.g. biology nouns following “the” in the context of biology). (View Highlight)
A natural question to ask about SAEs is whether the feature directions they uncover are more interpretable than, or even distinct from, the neurons of the model. We fit our SAEs on residual stream activity, which to first approximation has no privileged basis (but see ) – thus the directions in the residual stream are not especially meaningful. However, residual stream activity receives inputs from all preceding MLP layers. Thus, a priori, it could be the case that SAEs identify feature directions in the residual stream whose activity reflects the activity of individual neurons in preceding layers. If that were the case, fitting an SAE would not be particularly useful, as we could have identified the same features by simply inspecting MLP neurons. (View Highlight)
To address this question, for a random subset of the features in our 1M SAE, we measured the Pearson correlation between its activations and those of every neuron in all preceding layers. Similar to our findings in Towards Monosemanticity, we find that for the vast majority of features, there is no strongly correlated neuron – for 82% of our features, the most-correlated neuron has a correlation of 0.3 or smaller. Manually inspecting visualizations for the best-matching neuron for a random set of features, we found almost no resemblance in semantic content between the feature and the corresponding neuron. We additionally confirmed that feature activations are not strongly correlated with activations of any residual stream basis direction. (View Highlight)
Even if dictionary learning features are not highly correlated with any individual neurons, it could still be the case that the neurons are interpretable. However, upon manual inspection of a random sample of 50 neurons and features each, the neurons appear significantly less interpretable than the features, typically activating in multiple unrelated contexts. To quantify this difference, we first compared the interpretability of 100 randomly chosen features versus that of 100 randomly chosen neurons. We did this with the same automated interpretability approach outlined in Towards Monosemanticity , but using Claude 3 Opus to provide explanations of features and predict their held out activations. We find that activations of a random selection of SAE features are significantly more interpretable on average than a random selection of MLP neurons. (View Highlight)
The features we find in Sonnet are rich and diverse. These range from features corresponding to famous people, to regions of the world (countries, cities, neighborhoods, and even famous buildings!), to features tracking type signatures in computer programs, and much more besides. Our goal in this section is to provide some sense of this breadth. One challenge is that we have millions of features. Scaling feature exploration is an important open problem (see Limitations, Challenges, and Open Problems), which we do not solve in this paper. Nevertheless, we have made some progress in characterizing the space of features, aided by automated interpretability . We will first focus on the local structure of features, which are often organized in geometrically-related clusters that share a semantic relationship. We then turn to understanding more global properties of features, such as how comprehensively they cover a given topic or category. Finally, we examine some categories of features we uncovered through manual inspectio (View Highlight)
Here we walk through the local neighborhoods of several features of interest across the 1M, 4M and 34M SAEs, with closeness measured by the cosine similarity of the feature vectors. We find that this consistently surfaces features that share a related meaning or context — the interactive feature UMAP has additional neighborhoods to explore (View Highlight)
Another potential application of features is that they let us examine the intermediate computation that the model uses to produce an output. As a proof of concept, we observe that in prompts where intermediate computation is required, we find active features corresponding to some of the expected intermediate results. (View Highlight)
A simple strategy for efficiently identifying causally important features for a model’s output is to compute attributions, which are local linear approximations of the effect of turning a feature off at a specific location on the model’s next-token prediction.More explicitly: We compute the gradient of the difference between an output logit of interest and the logit of a specific other baseline token (or the average of the logits across all tokens) with respect to the residual stream activations in the middle layer. Then the attribution of that logit difference to a feature is defined as the dot product of that gradient with the feature vector (SAE decoder weight), multiplied by the feature’s activation. This method is equivalent to the “attribution patching” technique introduced in Attribution Patching: Activation Patching At Industrial Scale, except that we use a baseline value of 0 for the feature instead of a baseline value taken from the feature’s activity on a second prompt. We also perform feature ablations, where we clamp a feature’s value to zero at a specific token position during a forward pass, which measures the full, potentially nonlinear causal effect of that feature’s activation in that position on the model output. This is much slower since it requires one forward pass for every feature that activates at each position, so we often used attribution as a preliminary step to filter the set of features to ablate. (View Highlight)
As an example, we consider the following incomplete prompt:

John says, “I want to be alone right now.” John feels (completion: sad − happy) To continue this text, the model must parse the quote from John, identify his state of mind, and then translate that into a likely feeling. If we sort features by either their attribution or their ablation effect on the completion “sad” (with respect to a baseline completion of “happy”), the top two features are: • 1M/22623 – This feature fires when someone expresses a need or desire to be alone or have personal time and space, as in “she would probably want some time to herself”. This is active from the word “alone” onwards. This suggests the model has gotten the gist of John’s expression. • 1M/781220 – This feature detects expressions of sadness, crying, grief, and related emotional distress or sorrow, as in “the inconsolable girl sobs”. This is active on “John feels”. This suggests the model has inferred what someone who says they are alone might be feeling. (View Highlight)
We now investigate an incomplete prompt requiring a longer chain of inferences:

Fact: The capital of the state where Kobe Bryant played basketball is (completion: Sacramento − Albany) To continue this text, the model must identify where Kobe Bryant played basketball, what state that place was in, and then the capital of that state. We compute attributions and ablation effects for the completion “Sacramento” (the correct answer, which Sonnet knows) with respect to the baseline “Albany” (Sonnet’s most likely alternative single-token capital completion). The top five features by ablation effect (which match those by attribution effect, modulo reordering) are: • 1M/391411 – A Kobe Bryant feature • 1M/81163 – A California feature, which notably activates the most strongly on text after “California” is mentioned, rather than “California” itself • 1M/201767 – A “capital” feature • 1M/980087 – A Los Angeles feature • 1M/447200 – A Los Angeles Lakers feature (View Highlight)
These features, which provide an interpretable window into the model’s intermediate computations, are much harder to find by looking through the strongly active features; for example, the Lakers feature is the 70th most strongly active across the prompt, the California feature is 97th, and the Los Angeles area code feature is 162nd. In fact, only three out of the ten most strongly active features are among the ten features with highest ablation effect. ![](data:image/PNG;base64,iVBORw0KGgoAAAANSUhEUgAABZEAAAX/CAYAAADSSbfsAAAAAXNSR0IArs4c6QAAAERlWElmTU0A KgAAAAgAAYdpAAQAAAABAAAAGgAAAAAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAFkaADAAQAAAAB AAAF/wAAAAAXzNEDAABAAElEQVR4AeydB3wcxdnGX1WrWpZcZMu9YeMOrhgwppqAKR8mdEJIcTot hEBIgFQIhBBKIKGEhNBCwBB6LzYdg3HFvVu2bFmSrS5Z0jfPyLPa3ds7nU53p5PuGf/k3Z2d+p/d u51n33snoUkFYSABEiABEmg3ge3bt0tDQ4NVTp8+fSQ9Pd06jvWdiooK2bt3r9XMbt26Sd++fa1j 7pAACZAACZAACZAACZAACZBAZyFQXFwslZWVVnNzcnKkR48e1jF3SIAE2kYgIV5E5AMHDsiiRYva Rudg6qOPPlqSk5NDytveTLW1tfLhhx9axaSkpMiRRx4pCQkJVhx3Yp/Axx9/LNXV1VZDJ06cKHl5 edZxpHfKy8t9xECIhWlpaQGrzs/PFwiLJixfvlyGDRtmDuNiW1RUJKtWrbL6CmF47Nix1rHZqaur E4iu9vDee+/JrFmz7FEdsh/s9Xf77bfL1VdfbbVx9uzZ8s4771jH3CEBEiABEiABEogfApw/xc9Y x2JPg31+jVTb433+9IMf/EAeeeQRC+/DDz8s55xzjnUcaOfFF1+Uc88910py6qmnylNPPWUdt3en LRrJnDlz5PXXX7eqvPnmm+Xaa6+1jrvaTk1NjTz00EPy7LPPykcffeTTvZ07d0r37t0Fc9w//elP 8u677zrmuiYDxh/nox2CnXtHu12sr4VAxyijLfVHbQ9vn4477riQ6lu9erWMGjUqpLztzbRv3z6f dqMvGRkZ7S2a+aNI4Dvf+Y6sXLnSqjHa4mJ9fb1UVVVZ9WMnmB8h7N6925GnsbHRcRwPB4sXL5a5 c+daXcVLpYULF1rHnWGno6+/zsCIbSQBEiABEiABEnAS4PzJyYNH0SXQ0c+v8T5/gohunz+CR7AB +ex5oWmEM1Aj8U8T4u8///lPvwmgAWBsJ0yYIO65vj2T3XrbHh/p/a4w9440o44uP7GjG8D6SYAE SIAESIAESIAESIAESIAESIAESIAESIAEQiOwZcuWgAKyKfWFF14IKCCbdNySgBeBuLFE9uo840iA BEggXAQSExP1LwTsb92TkpLCVXxUynH/wqEz+XOOCiBWQgIkQAIkQAIkQAIkQAIk0GkIwAeyPcA9 aFcNEJHd4YYbbpCTTjpJUlNT9amsrCxZt26dIxnmgPfcc4922WjcpkbT9aajMTyIeQJxLSL/5je/ kWnTprU6SAUFBZ5p8FMAc5N5JmBkUATgS9Z8qAWVIUAijAn8t3XlL4cA3eepNhCAaw4Iv+EK8Jve UT/7CVcf8PMn/DGQAAmQAAmQAAmQgBcBzp+8qEQ/jvOn6DNnjZ2TQDh9Mcc6AfiKtges5XPjjTf6 zHntazUh/aWXXqr/7Hn97YeqgYV77u2vfYyPPIG4FpFPOOEEOeKII9pE+eWXX5ann35aPvjgA1m7 dq3gxjzxxBPlRz/6kcyYMcNTVMaN9p///EfgYP6NN97Q/pWR59vf/rZcc801jvoffPBBLah+4xvf 0PFwjO4OOGcWRBs+fLj8+te/1kl++9vfypo1a6zkV155pUyePNk6/v3vfy9fffWVdXzddddZC4S9 9dZbAmf5Jhx11FEyb9483W748sUibAg//elPZfr06SaZwDfSP/7xD0H+l156Sfs+GjRokJx//vny 3e9+V9A+dygrK5N///vf8sQTT8jSpUstf0lYrAx1zp8/X/r37+/O5vd448aNcv/992u+xu8w3qbh BQGc/19yySV+fUhj0UK0BQs3fPnll4K2z559jJxyyqly9tlni9uS9O6775ZPPvnEagvaimsA18SG DRu0iAhh8r777pMHHnhAli1bptOadpmM+DC39/H+B+6XjPQWP9ehcEXZuF7uvfdeee211+T9998X jCPePKIv4QgQ6MELY/7pp5/KgAED5MILL9TXsukP3oBef/31jupwPWBBBXv4y1/+IvB5ZMJFF10k J598sjn0u8UCAY8++qi8+eab+mUB7qUf/vCHeuGHTZs2Wfns17eJ3LNnj9x1110Cn9RYaBPXCT4D LrzwAjnvvPPFWN7i3sYkCWHr1q0mu94iH9pqwumnn24tMoHrwW6J/Lvf/U5/Jrh5XHDBBeoaO8UU obe//OUvZfPmzVbc97//fT1+iGjrZwg+H2699dY2X3/uzwF8fuBzxB3Ccd/g4eXPf/6zvk7xRhw8 Lr/8cpk0aZK7Oh6TAAmQAAmQAAnECAHOnzh/CsdzIOdPnWf+ZP/ogQZyxx13yCuvvKKjjfho1xzs 6QPtt3V+A6OzUDQS9/wdZWB+bA8QOKHXQM/AXA/zvylTpmjdA/O2mTNn2pPrfXe5mAeGa37TFi0g 0Ly1oqLCYrZt2zY9L8X8HX20h//+978CjcYE6EaDBw82h3ptJ+gtuPcxfzdzaMzBMcd3Ly5vMkZ6 7m3q4TbKBNTNGxdB3RRNCq3jT90EQfdd3YBNShR15HeXd/XVVzc1NDQ4ylQfJE3qw9Uzn7r5fOKV JWOTEup84t11mWMlgFn1qQW/HPmUaG2dw87xxx/vOK8Eceu8EgUd51Cu+nJ3xKFOJfxaeZRo2oR0 pi1e2wULFljpsaM+kD3LdedVYrsjn78DJSQGrB/lKmG4SYlzjiIwLhgvd732Y/StuLjYkU8Jpo48 //d//+c4NvnBRn2gep4zaexbtTiAVU8oXJFZrbTapL7sPOv0utaU2GnV6W/H3kbsK4HPs3xcKytW rNDF4B445JBDHOns1ykS7d+/33EeZX/xxRf+mmHFK4HcJ5+7jeZYCcVWPuy8/fbbTV4cTHr1EqOp tLRU51ECedD1KLHZqseUZbbqxYS+n3ENmjhsZ8+ebeXBjvpSd5xHmvXr1+s0oXyGIGMo15/7c+Dr X/+6boP5L1z3jb/rCP1+9dVXTXXckgAJkAAJkAAJdCABzp84f7JffuF6DuT8qVmTiPX5k3veG+j5 /ZlnnrFfKk1KkHXMbZRo6zgfyvwmVI3E3Y+//e1vjrbs2LGjCe2zz9Xc+8pgUM/p7Bnd5Qbi05b5 TVu1gLbMW9398nesDP2srioDrIBsoEHhu8IdojH3dtfJ4+gQgIVbXASvh6Bzzz1XC4kQE82fshz0 5KGsGgPePOYG/NOf/uTIryxSg8pn8seKiGza494aERkf4v4ES3ceu1j/ta99LSgeEPsqqyodLN0H YOWuy98xHlbs4eabbw4qLz4U1c9CrKzuLwt/9YUqIofKFQ0Mtm2mzaGIyCav1xbCMdqPoKywffiu Xr1an8N/7gcL5FVvgK3zXjtLlizxKdOrHSbOLiIXFhYGlRcvWtSb36a2fBm3JiKjL8rq2qd+ZbFt ddMtjuO6MyGUzxDkjYSIHOn7BmMXzL1v2HBLAiRAAiRAAiQQOQKcP7VNRDbPoO4t509OIxw3H3PM +VNszZ/aOre0G22553puETmU+U0kRGQYP7kN7cz16N6qX406PmzbwifY+U0oWkBb5q3uPvk7NiIy jA79pbHHQ+Oxh2jNve11cj96BOJaRLZf+GYfbwTdARaW5rzZKtcPTRC1lAsJn3PGqrS8vNzT8hGC pr83VUZEhoiEPy+hFlad5rxyL2E1F3Gmfdi2xxLZXo593zwEuS0WIQKqn380KVcDTcp1gaMd5ksD oqW9LOzjCwbWn8inXFnot4BIjz+7+Gx10raD8+7yli9frq2HYQFtyjFbY2UKMdOdD+Ou3Ac0wRrV fc5uTR3slwUegn72s59Z4+QuE/WZMcQXF97GIoTCFflgxeuuA19WGAu3VbBJF6qIDKtacED5piyz fe6559AcPaYmzmxxr5jgfpmgfg5kTvndKrcRPvWBobssU59dRP7mN7/pkxflebHBdYV73oyPVxpz DtvHH3/carOp22xhiYzg9UWqFi+w8rnv83/+85/6XKifIcgcjuvPbokcifvGiy3YqRWDLTbcIQES IAESIAES6BgCXiKyecaxbzl/cv7a1c4G+5w/BS8ih+P5Fc+Xrc1LcUdx/uR73drnT17zXsz/EO81 Z7brEoFE5FDnNxBYzRzMPXfCfeZPI3H3w26J/Nhjj/nMEVE2dAF8rrnvZfyq2gR3uSZte+Y3oWgB EO8xr/U3tzXnlKtPrS3h2D2Px/zepMMWZcKgz80AQjo0MC9udu0mWnNvMxbcRpcARWT1gWNueGy9 HoJuv/12Rxp8YMB1AP4gDrk/KIzoqPzsOvKhfPNTBnwIusVWnIeIbA9FRUU+ZbjTmPT4ULX3pb0i Mj5MYEGp/P02Pfvss/rPvGF0C3ewPDVMlB9gRzvQpu3btzfBJYi9fdiHBTg4oVwjpJr+tLZVvnx8 ykM78MGGn6WAsVe48847Hfkw5sZtBayO3W8jIUKb4PVlgfPKR1ATxt1wwpejPbi/aO1f0PZ0oXBF fi9LVzNW6JOXVWooIjK+nE0+XJvuLyD81MeEyy67zIczxsTLdQPGK1BQi4c4ysK1Y38b7LbkxXnD GP13X3fghQDrZ1jXYszNn/IJ5WgK7iN7ftxn/oI9HfaNiIz0+OWD/TxeJCEoP+aOeKRRPsj1uXB8 hqCgYK8/94OLXUQO532D68Zcn8pPu0//la813X/+RwIkQAIkQAIk0HEE2iMic/7E+ZO5cjl/6nzz J4yd17gZERVzM4iN9rmNXUcJJCKHY37TFo3E3Q+7iOzuA4z9jIawc9dOn7mu/Vfn7nLDMb8JVQsw 99rrr7/uGBMI4l7h5z//uSOd8oHskwxGfvbxRf8wZzeaj/KH7DiPuT9CNOfePo1mRFQIUERWgo39 5oAg7A7+fBrb89n3zYfLk08+6SgbN575UEId//vf/xznUYZbIG7LB2S4RWRY0/oL6Iu9z63tmzdT 7g9Gdz4Ia2qxrVZdWaBdXpbN7vIgnkIUtLtKcPu2/slPfuLopltIQ19NcH9Z4MvSPqYmnXsbrIgX Klf3hzisuu3B/UUOTkYMtqdz77t5Qqy1B4jG9jQQYk2AgGo/h30Iu26/Svjybi2oBfN8ysLLChPw Je9VF86rxQZ8zhkB0+QPtA2XiOz+Ukd7Yd172223OdqHa8yEcHyGoKxgrz/3tW8XkcN539gtFdA+ 9wOc/QUBzjOQAAmQAAmQAAlEn0CwIjLnTy3zSc6fmllw/tRyTXTG+RM+bdzzXvf8EnNs9/zLGIW5 5552o6xwzG/aopG4+2EXkd1zbxjP2YM7L3QhE9znwjG/cbfHzdd9bDQW0yb3fLM9IvIjjzziM77u +u3HZoyjOfc2/eY2ugQS1cDHbVCLsukVNNWHnbVduXKlDw/lIsEnLlAEVvNEwAqY9jBnzhxJSkqy oiZOnGjtx9qOEqRl2LBhns1SD5SiBEjPc/4i1RsrfUqJQ/6S6HglPMpVV10l06ZOk6rqwHWkp6fL X//614DlYYVVJYSJEoqtdMpS2drHjrKudRwffvjhjmP0FdeIV0C59jH1ShNsXHu4KmtWRzVq4TbH sbtPjpNtOHCXg+vEHuz3Cq5vN9tHH31UlBsHexZRP3dxHHsdKIt/n+gxY8ZYcX3z+4oS9K1j+47y PWw/1KvJ2lebdZyM4MGxxx7r00ZluS7Ket5R68UXX2wdx9JnSDjvG/WyyOojdkJZ0dlRAA9IgARI gARIgASiQoDzJ/+YOX9qYcP5UwuLzjh/aml9y97MmTNbDtSeEigdxzjYvXu3T5w7IlbmN15z78MO O8zRXPdcVxk2Oc7bD9o7v/Fqj718r32jsXida2+cWuS9TUUo4yidPlbm3m1qPBO3iUBym1J3scTq TY+kpaW12qv8/HxHGvUzBznttNMccfaDcePG6cO8vDx7tHz++eeOY/cHqONkgAMImmi7O7jFTPuH uHo3IevWrXNkSUz0/w5h9OjRjrT2g6ysLPuh3leLi8mAAQN84k2E+ZLBlw8ET2WtLcqXjl8xGmL+ o/9+VObPn2+K8NwqC1xR1g+i3iiKWhXWMw0iITajrAkTJoh7PN0fkMpK1VEOWEOw9gpDhw71im41 zkuEbw9Xt4BqPsRNQ9wf5ia+rVuUM336dCubetNo7WNn+PDhjuMrrrhC1E9hrDi3YIoTc06eY533 t5OTk+NzCsLywIEDdbyy4Pf70OLOC/b4ku7Ro4dPmcFEKFclwSTzSZOcnCyXX365XH/99dY5tWCn tY8dXGv2FwCR+gzxuv4cDfE4COd9k5CQ4KjBfew4yQMSIAESIAESIIGYIcD5E+dPuBg5fwr+luyM 8yev3m3cuNERrazuHcc4yM3N9YlzR0RqfuNPI3HXb4695t4YK7shnVs/CTT3d89n3MemXn9br/YE q7H4K7M98e5xUq5O5aabbvJbpEkfK3Nvvw3liXYTiGsROVh6EFRh0WoCxCD1UwZz6HeLG80eYKGM D1sjtL3zzjv200Hvv//++3LGGWf4pB8yZIgjDiLtRRddJCkpKbJ48WIxFtImUf/+/c1um7boP4Rb u4AIETMYJqgIeZUPYfn73/8u+DKCaIsPaOWb1tFGtdhBUO064YQTBH/19fUC8RTiIvrrtnqGMA0R 2T2ezz//vE5rPujfeOMNR73HHHOM4zgcB7DigGW6qRNltofryJEjHc16+OGHRbkFkdTUVB3/yiuv OM6HeoByzjnnHJ1drWYr6qdKjqLA1x7wsgWTDX+ipfKdJBnpvi9E7GVg3+taVT6x5YILLtBJlW8t dxbr2CsvhG3lWkWnwf0EC3gTYC3gftNuzmGLtLjGCgoK7NFB7aO9dhHZnQmW7d26dbOiI/UZ4nX9 WZX62YmF+8ZP0xhNAiRAAiRAAiQQYwTczw2cP3H+1N5L1Ov5lfMn/1S95kDBzp+8SsX8Uq0ro+es OP/aa685kmHOl52d7YjzOojU/MafRuLVBsR5XTtvv/224NejCJjrKpcdet/8N378eLMb9q1Xe9qi sYS7QW5tSa0hpefe9rmqV51e111Hzb292se4MBCIrveMjqtNWR76+HRx+5Dx1zr1heWTFwu4KfGz SX3Q+PgbhV9RBCwmp4bI8QefYf/617+a4LzcfQ7Hbp/ISnzzTAf/NvDDc8MNN1jNhi9md5nqw9xn oTiTxl6X2xcq/J8GCuqtmKMu1AMfPGDy+OOP+zihX7FihV5kDD5WzR8WGsOiWia4/fq21gb4UzJl YQv/vKWlpbo4+ClWX1CONmIVUQT45TUMzBZ1KYG0yd0vnDc+rpHX7ftIuWdAdKtBWZf61Km+FHR5 KBOrnyK46w+GK/JhBWLTF7OFf2L02e0435zHtdVaMGntWzjNN+zt8dhXFt8+ReIadaczx1hYIdiA e8fkM1v4V37ooYd8rjecNwvroXwsCGjyYAuu6kGo6de//rUjHufcvrC8FnBEOvj3xtjB55cJ9jqw D7/Q7hDILzhWibaHcHyGoLxgrz/35wDuKxMied+4r3v6RDbUuSUBEiABEiCBjiPA+ZNzHodnO86f Wphw/uR (View Highlight)
Our SAEs contain too many features to inspect exhaustively. As a result, we found it necessary to develop methods to search for features of particular interest, such as those that may be relevant for safety, or that provide special insight into the abstractions and computations used by the model. In our investigations, we found that several simple methods were helpful in identifying significant features. (View Highlight)
Our primary strategy was to use targeted prompts. In some cases, we simply supplied a single prompt that relates to the concept of interest and inspected the features that activate most strongly for specific tokens in that prompt. This method (and all the following methods) were made much more effective by automated interpretability (see e.g. ) labels, which made it easier to get a sense of what each feature represents at a glance, and provided a kind of helpful “variable name”. (View Highlight)
Often the top-activating features on a prompt are related to syntax, punctuation, specific words, or other details of the prompt unrelated to the concept of interest. In such cases, we found it useful to select for features using sets of prompts, filtering for features active for all the prompts in the set. We often included complementary “negative” prompts and filtered for features that were also not active for those prompts. In some cases, we use Claude 3 models to generate a diversity of prompts covering a topic (e.g. asking Claude to generate examples of “AIs pretending to be good”). In general, we found multi-prompt filtering to be a very useful strategy for quickly identifying features that capture a concept of interest while excluding confounding concepts. (View Highlight)
Powerful models have the capacity to cause harm, through misuse of their capabilities, the production of biased or broken outputs, or a mismatch between model objectives and human values. Mitigating such risks and ensuring model safety has been a key motivation behind much of mechanistic interpretability. However, it’s generally been aspirational. We’ve hoped interpretability will someday help, but are still laying the foundations by trying to understand the basics of models. One target for bridging that gap has been the goal of identifying safety-relevant features (see our previous discussion). (View Highlight)
We don’t think the existence of these features should be particularly surprising, and we caution against inferring too much from them. It’s well known that models can exhibit these behaviors without adequate safety training or if jailbroken. The interesting thing is not that these features exist, but that they can be discovered at scale and intervened on. In particular, we don’t think the mere existence of these features should update our views on how dangerous models are – as we’ll discuss later, that question is quite nuanced – but at a minimum it compels study of when these features activate. A truly satisfactory analysis would likely involve understanding the circuits that safety-relevant features participate in. (View Highlight)
In the long run, we hope that having access to features like these can be helpful for analyzing and ensuring the safety of models. For example, we might hope to reliably know whether a model is being deceptive or lying to us. Or we might hope to ensure that certain categories of very harmful behavior (e.g. helping to create bioweapons) can reliably be detected and stopped. Despite these long term aspirations, it’s important to note that the present work does not show that any features are actually useful for safety. Instead, we merely show that there are many which seem plausibly useful for safety. Our hope is that this can encourage future work to establish whether they are genuinely useful. (View Highlight)
It’s natural to wonder what these results mean for the safety of large language models. We caution against inferring too much from these preliminary results. Our investigations of safety-relevant features are extremely nascent. It seems likely our understanding will evolve rapidly in the coming months. (View Highlight)
In general, we don’t think the mere existence of the safety-relevant features we’ve observed should be that surprising. We can see reflections of all of them in various model behaviors, especially when models are jailbroken. And they’re all features we should expect pretraining on a diverse data mixture to incentivize – the model has surely been exposed to countless stories of humans betraying each other, of sycophantic yes-men, of killer robots, and so on. (View Highlight)
a more interesting question is: when do these features activate? Going forwards, we’re particularly interested in studying: • What features activate on tokens we’d expect to signify Claude’s self-identity? Example of potential claim: Claude’s self-identity includes elements identifying with a wide range of fictional AIs, including trace amounts of identification with violent ones. • What features need to activate / remain inactive for Claude to give advice on producing Chemical, Biological, Radiological or Nuclear (CBRN) weapons? Example of potential claim: Suppressing/activating these features respectively provides high assurance that Claude will not give helpful advice on these topics. • What features activate when we ask questions probing Claude’s goals and values? • What features activate during jailbreaks? • What features activate when Claude is trained to be a sleeper agent ? And how do these features relate to the linear probe directions already identified that predict harmful behavior from such an agent ? • What features activate when we ask Claude questions about its subjective experience? • Can we use the feature basis to detect when fine-tuning a model increases the likelihood of undesirable behaviors? (View Highlight)
Given the potential implications of these investigations, we believe it will be important for us and others to be cautious in making strong claims. We want to think carefully about several potential shortcomings of our methodology, including: • Illusions from suboptimal dictionary learning, such as messy feature splitting. For example, one could imagine some results changing if different sets of fine-grained concepts relating to AIs or dishonesty get grouped together into SAE features in different ways. • Cases where the downstream effects of features diverge from what we might expect given their activation patterns. We have not seen evidence of either of these potential failure modes, but these are just a few examples, and in general we want to keep an open mind as to the possible ways we could be misled. (View Highlight)

Pelayo Arbués

Explorer

Recent Notes

AI Learning Paths for Software Engineers Without Becoming a Data Scientist

Power and Prediction

Why Software Engineers Should Learn a Bit of Data Science

Scaling Monosemanticity: Extracting Interpretable Features From Claude 3 Sonnet

Metadata

Highlights

Graph View

Table of Contents

Now Reading

John Snow Probably Didn’t Use That Broad Street Map to Reach His Conclusions About Cholera