Authors: Moussa Koulako Bala Doumbouya, Dan Jurafsky, Christopher D. Manning
Paper: https://arxiv.org/abs/2506.11035
Once a year, some interesting architecture inevitably appears where they change some fundamental building block. This happened with KAN last year, where they changed the parameterization of the neuron activation function (though it's unclear what the outcome is after a year — many follow-up works seem to have appeared, but KANs haven't displaced anyone anywhere yet). The same is true in this current work, where they change the proximity function from the classic scalar product as in transformers (or cosine similarity, which is roughly the same) to a more sophisticated asymmetric function named after Amos Tversky. Jurafsky and Manning are co-authors (and in KANs, Tegmark was a co-author), so these aren't exactly random people.
What's the idea?
Modern deep learning architectures, from CNNs to transformers, are built on a fundamental but often overlooked assumption: similarity between concepts can be measured geometrically using functions like dot product or cosine similarity. While this approach is computationally convenient, cognitive psychology has long known that this geometric model poorly reflects human similarity judgments. As Amos Tversky noted in his landmark 1977 work, human perception of similarity is often asymmetric — we say that a son resembles his father more than the father resembles the son. This asymmetry violates the metric properties inherent in geometric models.
Tversky proposed an alternative: a feature matching model where similarity is a function of common and distinctive features. Despite its psychological plausibility, this model relied on discrete set operations, making it incompatible with the differentiable, gradient-based optimization that underlies modern deep learning. The authors of this paper have elegantly bridged this gap.
Differentiable Tversky similarity
The key innovation is a differentiable parameterization of Tversky similarity. The authors propose a dual representation where objects are simultaneously both vectors (as usual, of Rd dimensionality) and feature sets (this is new). A feature (from a given finite set Ω) is considered "present" in an object if the scalar product of the object vector and the feature vector is positive. This construction allows reformulating traditionally discrete intersection and difference operations as differentiable functions.
The Tversky similarity function is defined as:
S(a, b) = θf(A ∩ B) − αf(A − B) − βf(B − A), (1)
... continue reading