A 20-Year-Old Algorithm Can Help Us Understand Transformer Embeddings
Suppose we ask an LLM: “Can you tell me about Java?” What “Java” is the model thinking about? The programming language or the Indonesian island? To answer this question, we can try to understand what is going on inside the model. Specifically, we want to represent the model’s internal states in a human-interpretable way by finding the concepts that the model is thinking about. One approach to this problem is to phrase it as a dictionary learning problem, in which we try to decompose complex emb