Weight-sparse transformers have interpretable circuits [pdf]

(cdn.openai.com)

78 points | by 0x79de 9 days ago

7 comments

  • oli5679 1 day ago
    This ties directly into the superposition theory.

    It is believed dense models cram many features into shared weights, making circuits hard to interpret.

    Sparsity reduces that pressure by giving features more isolated space, so individual neurons are more likely to represent a single, interpretable concept.

    • HarHarVeryFunny 21 hours ago
      Yes, although the sparsity doesn't need to be inherent to the model - another approach is to try to decode the learned weights using approaches like sparse auto-encoders or transcoders.

      https://transformer-circuits.pub/2025/attribution-graphs/met...

      • leogao 19 hours ago
        I'm also very excited about SAE/Transcoder based approaches! I think the big tradeoff is that our approach (circuit sparsity) is aiming for a full complete understanding at any cost, whereas Anthropic's Attribution Graph approach is more immediately applicable to frontier models, but gives handwavier circuits. It turns out "any cost" is really quite a lot of cost - we think this cost can be reduced a lot with further research, but it means our main results are on very small models, and the path to applying any of this to frontier models involves a lot more research risk. So if accepting a bit of handwaviness lets us immediately do useful things on frontier models, this seems like a worthwhile direction to explore.

        See also some work we've done on scaling SAEs: https://arxiv.org/abs/2406.04093

  • lambdaone 1 day ago
    I find this fascinating, as it raises the possibility of a single framework that can unify neural and symbolic computation by "defuzzing" activations into what are effectively symbols. Has anyone looked at the possibility of going the other way, by fuzzifying logical computation?
    • calebh 22 hours ago
      Yes, you can relax logic gates into continuous versions which makes the system differentiable. An AND gate can be constructed with the function x*y and NOT by 1-x (on inputs in the range [0,1]. From there you can construct a NAND gate, which is universal and can be used to construct all other gates. Sigmoid can be used to squash the inputs into [0,1] if necessary.

      This paper lists out all 16 possible logic gates in Table 1 if you're interested in this sort of thing: https://arxiv.org/abs/2210.08277

    • leogao 19 hours ago
      There's been some work (e.g RASP - https://arxiv.org/abs/2106.06981) on taking logical computations and compiling them into transformer weights.
    • esafak 19 hours ago
      https://en.wikipedia.org/wiki/Probabilistic_logic

      More generally, machine learning is all about dealing with imprecision, including logic.

    • smokel 22 hours ago
      Do you mean fuzzy logic [1]? It was all the hype in the 1990s.

      [1] https://en.wikipedia.org/wiki/Fuzzy_logic

    • radarsat1 1 day ago
      > fuzzifying logical computation?

      Isn't that basically what the sigmoid operator does? Or more in the direction of averaging many logical computations, we have random forests.

  • m_ke 22 hours ago
    We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.
    • leogao 19 hours ago
      For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.
      • m_ke 18 hours ago
        yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.

        Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.

      • esafak 19 hours ago
        As the lead author, why do you think so?
        • leogao 18 hours ago
          I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:

          - Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).

          - Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.

          It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).

          To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.

    • yvdriess 21 hours ago
      Yes! I'de been advocating for it inside the industry for a decade, but it is an uphill battle. The researchers can't easily publish that kind of work (even Google researchers) because you don't have the hardware that can realistically train decently large models. The hardware companies don't want to take the risk a rethinking the architecture CPU or accelerator for sparse compute because there are no large existing customers.
    • carterschonwald 21 hours ago
      There also needs to be tools that can author that code!

      Im starting to dust off some ideas I developed over a decade ago to build such a toolkit. Recently realized “egads, my stuff can express almost every major gpu / cpu optimization that’s relevant for modern deep learning… need to do a new version with an eye towards adoption in that area”. Plus every flavor of sparse.

      Also need to figure out if some of the open core ideas i have in mind would be attractive to early stage investors who focus on the so-called deep tech end of the space. Definitely looks like ill have to do ye olde ask friends and acquaintances if they can point me to those folks approach since cold reach out historically is full of fail

    • p1esk 21 hours ago
      Deep Learning models would work way better with much higher dimensional sparse vectors

      Citations?

      • yvdriess 20 hours ago
        There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.

        The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.

        • noosphr 20 hours ago
          If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.
        • p1esk 20 hours ago
          I could not find any evidence that sparse models work better than dense models.
          • yvdriess 13 hours ago
            What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.

            The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.

            • p1esk 13 hours ago
              The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes.

              Do you have any evidence to support this statement? Or are you imagining some not yet invented algorithms running on some not yet invented hardware?

              • yvdriess 4 hours ago
                Sparse matrices can increase in dimension while keeping the same number of non-zeroes, that part is self evident. Sparse weights models can be trained, you probably are already aware of RigL and SRigL, there is similar other related work on unstructured and structured sparse training. You could argue that those adapt their algorithm to be executable on GPUs and that none are training at x100 or x1000 dimensions. Yes, that is the part that requires access to sparse compute hardware acceleration, which exists as prototypes [1] or are extremely expensive (Cerebras).

                [1] https://dl.acm.org/doi/10.1109/MM.2023.3295848

          • tripplyons 19 hours ago
            All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.

            Examples: - GPT OSS 120b - Kimi K2 - DeepSeek R1

            • leogao 19 hours ago
              Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.
              • tripplyons 19 hours ago
                Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.

                For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.

                Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.

                • p1esk 19 hours ago
                  “Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.

                  Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.

              • yorwba 17 hours ago
                Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.

                From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.

          • m_ke 20 hours ago
            https://transformer-circuits.pub/2022/toy_model/index.html

            https://arxiv.org/abs/1803.03635

            EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:

            To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).

            Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).

            ### 1\. The Representation: Hyperdimensional Computing (HDC)

            Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters. To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.

              * **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories.
              * **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations:
                  * **Binding (Association):** XOR operations (`A ⊕ B`).
                  * **Bundling (Superposition):** Majority rule (voting).
                  * **Permutation:** Bit shifting.
              * **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
            
            ### 2\. The Architecture: "Spiking" Attention Mechanisms

            Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.

              * **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection.
              * **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function.
                  * *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
            
            ### 3\. The Hardware: Neuromorphic Substrate

            Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).

              * **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address.
              * **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
            
            ### Summary: The Hypothetical "Spiking HD-Transformer"

            | Feature | Standard Transformer | Simulated "Brain-Like" Transformer | | :--- | :--- | :--- | | *Dimension* | Low (\~4k), Dense, Float32 | *Ultra-High* (\~100k), Sparse, Binary | | *Operation* | Matrix Multiplication (MACs) | *Bitwise XOR / Popcount* | | *Attention* | Global Softmax ($N^2$) | *Spiking k-Winner-Take-All* (Linear) | | *Activation* | Continuous (RELU/GELU) | *Discrete Spikes* (Fire-or-Silence) | | *Hardware* | GPU (Synchronous) | *Neuromorphic* (Asynchronous) |

            • p1esk 18 hours ago
              I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.

              Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.

              • m_ke 18 hours ago
                Before we had proper GPUs everyone said the same thing about Neural Networks.

                Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.

                There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.

                Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.

                Plenty of mech interp work shows that models are forced to commingle different concepts to fit them into the "low" dimensional vector space. (https://www.neelnanda.io/mechanistic-interpretability/glossa...)

                https://arxiv.org/abs/2210.06313

                https://arxiv.org/abs/2305.01610

                • p1esk 17 hours ago
                  Yes, we know that large dense layers work better than small dense layers (up to a point). We also know how to train large dense models and then prune them. But we don’t know how to train large sparse models to be better than large dense models. If someone figures it out then we can talk about building hardware for it.
                  • ted_dunning 6 hours ago
                    It isn't directly what you are asking for, but there is a similar relationship at work with respect to L_1 versus L_2 regularization. The number of samples required to train a model is O(log(d)) for L_1 and O(d) for L_2 where d is the dimensionality [1]. This relates to the standard random matrix results about how you can approximate high dimensional vectors in a log(d) space with (probably) small error.

                    At a very handwaving level, it seems reasonable that moving from L_1 to L_0 would have a similar relationship in learning complexity, but I don't think that has every been addressed formally.

                    [1] https://www.andrewng.org/publications/feature-selection-l1-v...

    • kwillets 17 hours ago
      My last dive into matrix computations was years ago, but the need was the same back then. We could sparsify matrices pretty easily, but the infrastructure was lacking. Some things never change.
  • edvardas 22 hours ago
  • robrenaud 17 hours ago
    I worked on a similiar problem about a year ago, on large dense models.

    https://www.lesswrong.com/posts/PkeB4TLxgaNnSmddg/scaling-sp...

    In both cases, the goal is to actually learn a concrete circuit inside a network that solves specific Python next-token prediction tasks. We each end up with a crisp wiring diagram saying “these are the channels/neurons/heads that implement this particular bit of Python reasoning.”

    Both projects cast circuit discovery as a gradient-based selection problem over a fixed base model. We train a mask that picks out a sparse subset of computational nodes as “the circuit,” while the rest are ablated. Their work learns masks over a weight-sparse transformer; ours learns masks over SAE latents and residual channels. But in both cases, the key move is the same: use gradients to optimize which nodes are included, rather than relying purely on heuristic search or attribution patching. Both approaches also use a gradual hardening schedule (continuous masks that are annealed or sharpened over time) so that we can keep gradients useful early on, then spend extra compute to push the mask towards a discrete, minimal circuit that still reproduces the model’s behavior.

    The similarities extend to how we validate and stress-test the resulting circuits. In both projects, we drill down enough to notice “bugs” or quirks in the learned mechanism and to deliberately break it: by making simple, semantically small edits to the Python source, we can systematically cause the pruned circuit to fail and those failures generalize to the unpruned network. That gives us some confidence that we’re genuinely capturing the specific mechanism the model is using.

  • Xmd5a 15 hours ago
    Related:

    From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning – https://arxiv.org/pdf/2505.17117 (Lecun/Jurafsky)

    > Large Language Models (LLMs) demonstrate striking linguistic capabilities that suggest semantic understanding (Singh et al., 2024; Li et al., 2024). Yet, a critical question remains unanswered: Do 1arXiv:2505.17117v5 [cs.CL] 25 Sep 2025LLMs navigate the compression-meaning trade-off similarly to humans, or do they employ fundamentally different representational strategies? This question matters because true understanding, which goes beyond surface-level mimicry, requires representations that balance statistical efficiency with semantic richness (Tversky, 1977; Rosch, 1973b).

    > To address this question, we apply Rate-Distortion Theory (Shannon, 1948) and Information Bottleneck principles (Tishby et al., 2000) to systematically compare LLM and human conceptual structures. We digitize and release seminal cognitive psychology datasets (Rosch, 1973b; 1975; McCloskey & Glucksberg, 1978), which are foundational studies that shaped our understanding of human categorization but were previously unavailable in a machine-readable form. These benchmarks, comprising 1,049 items across 34 categories with both membership and typicality ratings, offer unprecedented empirical grounding for evaluating whether LLMs truly understand concepts as humans do. It also offers much better quality data than the current crowdsourcing paradigm.

    From typicality tests in the paper above, we can jump to:

    The Guppy Effect as Interference – https://arxiv.org/abs/1208.2362

    > One can refer to the situation wherein people estimate the typicality of an exemplar of the concept combination as more extreme than it is for one of the constituent concepts in a conjunctive combination as overextension. One can refer to the situation wherein people estimate the typicality of the exemplar for the concept conjunction as higher than that of both constituent concepts as double overextension. We posit that overextension is not a violation of the classical logic of conjunction, but that it signals the emergence of a whole new concept. The aim of this paper is to model the Guppy Effect as an interference effect using a mathematical representation in a complex Hilbert space and the formalism of quantum theory to represent states and calculate probabilities. This builds on previous work that shows that Bell Inequalities are violated by concepts [7, 8] and in particular by concept combinations that exhibit the Guppy Effect [1, 2, 3, 9, 10], and add to the investigation of other approaches using interference effects in cognition [11, 12, 13].

    And from quantum interferences

    Quantum-like contextuality in large language models – https://royalsocietypublishing.org/doi/epdf/10.1098/rspa.202...

    > This paper provides the first large-scale experimental evidence for contextuality in the large language model BERT. We constructed a linguistic schema modelled over a contextual quantum scenario, instantiated it in the Simple English Wikipedia, and extracted probability distributions for the instances. This led to the discovery of sheaf-contextual and CbD contextual instances. We prove that these contextual instances arise from semantically similar words by deriving an equation that relates degrees of contextuality to the Euclidean distance of BERT’s embedding vectors.

    How can large language models become more human – https://discovery.ucl.ac.uk/id/eprint/10196296/1/2024.cmcl-1...

    > Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal.

    • Xmd5a 15 hours ago
      Most LM work implicitly uses surprisal = -log p(w | prefix) as the processing cost. But psycholinguistics keeps finding cases (garden-path sentences, etc.) where human difficulty is less about the next word being unlikely and more about how much of the current parse / interpretation has to be torn down and rebuilt. That’s essentially what Wang et al. formalize with their Incompatibility Fraction: they combine an LLM’s lexical predictions with a dependency parser, build a sheaf-style structure over prefixes, and measure how inconsistent the local parse distributions are with any single global structure. That incompatibility correlates with human reading times and distinguishes easy vs hard garden paths better than surprisal alone.

      If you take that seriously, you end up with a different "surprise" objective: not just "this token was unlikely", but "this token forced a big update of my latent structure". In information-theoretic terms, the distortion term in a Rate–Distortion / Information Bottleneck objective stops being pure log-loss and starts to look like a backtracking cost on your semantic/structural state.

      Now look at Shani et al.’s From Tokens to Thoughts paper: they compare LLM embeddings to classic human typicality/membership data (Rosch, Hampton, etc.) using RDT/IB, and show that LLMs sit in a regime of aggressive compression: broad categories line up with humans, but fine-grained typicality and "weird" members get squashed. Humans, by contrast, keep higher-entropy, messier categories – they "waste bits" to preserve contextual nuance and prototype structure.

      Quantum cognition folks like Aerts have been arguing for years that this messiness is not a bug: phenomena like the Guppy effect (where "guppy" is a so-so Pet and a so-so Fish but a very typical Pet-Fish) are better modelled as interference in a Hilbert space, i.e. as emergent concepts rather than classical intersections. Lo et al. then show that large LMs (BERT) already exhibit quantum-like contextuality in their probability distributions: thousands of sheaf-contextual and tens of millions of CbD-contextual instances, with the degree of contextuality tightly related to embedding distances between competing words.

      Put those together and you get an interesting picture:

      Current LMs do live in a contextual / interference-ish regime at the probabilistic level, but their embedding spaces are still optimized for pointwise predictive compression, not for minimizing re-interpretation cost over time.

      If you instead trained them under a "surprise = prediction error + structural backtracking cost" objective (something like log-loss + sheaf incompatibility over parses/meanings), the optimal representations wouldn’t be maximally compressed clusters. They’d be the ones that make structural updates cheap: more typed, factorized, role-sensitive latent spaces where meaning is explicitly organized for recomposition rather than for squeezing out every last bit of predictive efficiency.

      That’s exactly the intuition behind DisCoCat / categorical compositional distributional semantics: you force grammar and semantics to share a compact closed category, treat sentence meaning as a tensor contraction over typed word vectors, and design the embedding spaces so that composition is a simple linear map. You’re trading off fine-grained, context-specific "this token in this situation" information for a geometry that makes it cheap to build and rebuild structured meanings.

      Wang et al.’s Incompatibility Fraction is basically a first step toward such an objective, Shani et al. quantify how far LMs are from the "human" point on the compression–meaning trade-off, Aerts/Lo show that both humans and LMs already live in a quantum/contextual regime, and DisCoCat gives a concrete target for what "structured, recomposable embeddings" could look like. If we ever switch from optimizing pure cross-entropy to "how painful is it to revise my world-model when this token arrives?", I’d expect the learned representations to move away from super-compact clusters and towards something much closer to those typed, compositional spaces.

  • peter_d_sherman 1 day ago
    >"To assess the interpretability of our models, we isolate the small sparse circuits that our models use to perform each task using a novel pruning method. Since interpretable models should be easy to untangle, individual behaviors should be implemented by compact standalone circuits.

    Sparse circuits are defined as a set of nodes connected by edges."

    ...which could also be considered/viewed as Graphs...

    (Then from earlier in the paper):

    >"We train models to have more understandable circuits by constraining most of their weights to be zeros, so that each neuron only has a few connections. To recover fine-grained circuits underlying each of several hand-crafted tasks, we prune the models to isolate the part responsible for the task. These circuits often contain neurons and residual channels that correspond to natural concepts, with a small number of straightforwardly interpretable connections between them.

    And (jumping around a bit more in the paper):

    >"A major difficulty for interpreting transformers is that the activations and weights are not directly comprehensible; for example, neurons activate in unpredictable patterns that don’t correspond to human-understandable concepts. One hypothesized cause is superposition (Elhage et al., 2022b), the idea that dense models are an approximation to the computations of a much larger untangled sparse network."

    A very interesting paper -- and a very interesting postulated potential relationship with superposition! (which also could be related to data compression... and if so, in turn, by relationship, potentially entropy as well...)

    Anyway, great paper!