Suppose we ask an LLM: “Can you tell me about Java?” What “Java” is the model thinking about? The programming language or the Indonesian island? To answer this question, we can try to understand what is going on inside the model. Specifically, we want to represent the model’s internal states in a human-interpretable way by finding the concepts that the model is thinking about. One approach to this problem is to phrase it as a dictionary learning problem, in which we try to decompose complex embeddings into a sum of simple and interpretable concept vectors. These vectors are selected from a learned dictionary. It is not obvious that we can break down embeddings as a linear sum of interpretable elements. However, in 2022, Elhage et. al introduced supporting evidence for the “superposition hypothesis,” which suggests that this superposition of monosemantic concept vectors is a good model for the complex embeddings. Finding these concept vectors remains an ongoing challenge. When dictionary learning was first proposed for this problem by Bricken et al. in 2023, they used a single-layer neural network called a sparse autoencoder (SAE) to learn the dictionary, which has since become widely popular. The problem of dictionary learning actually goes way back (pre-2000s!), but Bricken et al. opted to forego established algorithms in favor of SAEs for two main reasons: “First, a sparse autoencoder can readily scale to very large datasets, which we believe is necessary to characterize the features present in a model trained on a large and diverse corpus. […] Secondly, we have a concern that iterative dictionary learning methods might be “too strong”, in the sense of being able to recover features from the activations which the model itself cannot access. Exact compressed sensing is NP-hard, which the neural network certainly isn’t doing.” In our recent paper, we show that with minor modifications, traditional methods can be scaled to sufficiently large datasets with millions of samples and thousands of dimensions and that their performance matches that of SAEs on a variety of benchmarks. We can also use established theory to gain insights on the applicability of SAEs to different problem sizes, e.g., when less data is available. From 30 days to 8 minutes: Reviving KSVD Instead of using gradient descent to optimize the dictionary elements as done by SAEs, we consider the previously reigning champion of dictionary learning: the 20-year-old KSVD algorithm. KSVD solves the dictionary learning problem through alternating optimization of two subproblems in a loop: (i) updating the concept vector assignments for each sample and (ii) updating the concept vectors themselves given the assignments. Intuitively, KSVD is a generalization of k-means clustering, which uses a similar alternating optimization, but with the key difference that KSVD assigns samples to multiple clusters (concept vectors) at once. In the naive implementation of KSVD, dictionary elements are updated sequentially and each update requires computing an expensive singular value decomposition of the assigned samples at each step. Based on extrapolated timing results, a naïve implementation of KSVD would take over 30 days to produce a dictionary sufficient to interpret LLM embeddings. In our paper, we introduce algorithmic modifications and a highly efficient implementation, which we call double-batch KSVD (DB-KSVD). DB-KSVD provides a 10,000 times speed up over a naïve KSVD implementation and allows us to find interpretable features for LLM embeddings in just 8 minutes. We provide an implementation of DB-KSVD in our open source Julia package called KSVD.jl. KSVD.jl can easily be called from Python: import numpy , torch import juliacall ; jl = juliacall . Main jl . seval ( "using KSVD" ) # placeholder embeddings Y = torch . rand ( 128 , 5_000 , dtype = torch . float32 ) res = jl . ksvd ( Y . numpy (), 256 , 3 ) # choose dictionary size and sparsity print ( res . D , res . X ) # returns dictionary vectors and sparse codes DB-KSVD is competitive on SAEBench To compare DB-KSVD to existing SAE-based approaches, we evaluated it on the SAEBench benchmark and found competitive performance on all six metrics both for DB-KSVD and an extension called MatryoshkaDB-KSVD. The SAEBench benchmark comprises a number of metrics related to embedding reconstruction, feature disentanglement, concept detection, and interpretability. For instance, the sparse probing metric measures how well the learned concept vectors predict a set of predefined interpretable concepts. Absorption measures how well the learned dictionary separates concepts that are highly correlated (in the input data) but distinct. Autointerpretability measures whether a concept vector can be well represented by a summarizing phrase. These phrases are constructed by summarizing input text fragments for which a concept vector activates. By finding competitive results on all six metrics, we argue that classical approaches to the dictionary learning problem can be competitive even for large problem sizes. Further, the fact that two completely different optimization approaches (DB-KSVD and SAEs) achieve similar performance results may indicate that we are close to the theoretical limits given the problem setup. On the feasibility of dictionary learning At the time of writing, most of the ongoing research in dictionary learning focuses on language models. However, we believe dictionary learning will have many more applications in the near future, for instance, in robotics, vision, or planning tasks. Fundamentally, all that is required is an embedding model which maps a modality to an informative feature representation. If we want to apply dictionary learning to these embeddings, we must decide on two main parameters: How many concept vectors should we search for? How many active concepts, approximately, does each sample have? How we select these parameters is driven by two factors. The first factor is the nature of the problem itself. For instance, a dataset of only Stack Overflow questions will likely have a smaller number of inherent concepts than a dataset of all Arxiv papers. Similarly, any particular Stack Overflow question usually relates only a small number of active concepts compared to a typical Arxiv paper, which often relates a significant number of concepts. The second factor is the feasibility of the dictionary learning problem. It turns out that even if the data (and embeddings) contain a large number of concepts, recovering the individual concept vectors from their superimposed representations can be impossible for any algorithm. To study these algorithmic limits, in our paper, we consider established theory from the dictionary learning literature to understand how the feasibility is impacted by various aspects of the problem. In particular, we find that the theory tells us that: The number of recoverable concept vectors is proportional to the square of the number of available samples. The number of recoverable concepts per sample is proportional to the square of the embedding dimension. The quality of the solution is related to the incoherence of the dictionary. In practical terms, the applicability of dictionary learning to new domains is therefore dependent on having sufficiently large datasets and high-dimensional embeddings. Fortunately, such models have already been developed for a variety of domains. We are therefore looking forward to further adoption of dictionary learning in new domains as well as new methods for using the conceptual representations to understand today’s models and their internal reasoning.