Aligning machine and human visual representations across abstraction levels

Soft alignment

This section is organized as follows. We start by describing how we transform model representations into a space that matches human similarity judgements about coarse-grained semantic object relations. We introduce an affine transformation that matches human similarity judgements and injects the uncertainties that humans assign to their triplet odd-one-out choices into a model’s representation space creating a surrogate teacher model. Using the teacher model’s human-aligned representations, we sample triplets of ImageNet38 images differently than uniform random sampling by clustering the representations into superordinate categories and using those clusters for data partitioning. We pseudo-label these triplets with human-aligned judgement distributions from the surrogate teacher model. Finally, after having created AligNet triplets, we fine-tune student models with a triplet loss object function.

Representational alignment

Data

To increase the degree of alignment between human and neural network similarity spaces, we begin from the publicly available THINGS dataset, which is a large behavioural dataset of 4.7 million unique triplet responses from 12,340 human participants for m = 1,854 natural object images51 from the public THINGS object concept and image database26. The THINGS dataset can formally be defined as $D{(\{{a}_{s},{b}_{s}\}|\{{i}_{s},{j}_{s}\,,{k}_{s}\})}_{s=1}^{n}$, which denotes a dataset of n object triplets and corresponding human odd-one-out responses, where $\{{a}_{s},{b}_{s}\}\subset \{{i}_{s},{j}_{s}\,,\,{k}_{s}\}$ and $\{{a}_{s},{b}_{s}\}$ is the object pair that was chosen by a human participant among the s-th triplet to have the highest similarity. Let ${\bf{X}}\in \,{{\mathbb{R}}}^{m\times p}$ be the teacher model representations for the m = 1,854 objects in the THINGS dataset, where p is the dimension of the image-representation vector. It is noted that each category in the THINGS dataset is represented by one object image. From X we can construct a similarity matrix for all object pairs ${\bf{S}}:= {\bf{X}}\,{{\bf{X}}}^{{\rm{\top }}}\in {{\mathbb{R}}}^{m\times m}$, where ${S}_{i,j}={{\bf{x}}}_{i}^{{\rm{\top }}}{{\bf{x}}}_{j}$ is the representational similarity for objects i and j, ${\rm{\top }}$ denotes the matrix transpose, and x i refers to the i-th column of X.

Odd-one-out accuracy

The triplet odd-one-out task is frequently used in the cognitive sciences to measure human notions of object similarity52,53,54,55. To measure the degree of alignment between human and neural network similarity judgements in the THINGS triplet task, we embed the m = 1,854 THINGS images into the representation space of a neural network with ${\bf{X}}\in {{\mathbb{R}}}^{m\times p}$. Given vector representations x 1 , x 2 and x 3 of the 3 images in a triplet, we first construct a similarity matrix ${\bf{S}}\in {{\mathbb{R}}}^{3\times 3}$ where ${S}_{i,j}\,:= \,{{\bf{x}}}_{i}^{{\rm{\top }}}{{\bf{x}}}_{j}$ is the dot product between a pair of image representations. We identify the closest pair of images in the triplet as ${{\rm{a}}{\rm{r}}{\rm{g}}\,{\rm{m}}{\rm{a}}{\rm{x}}}_{i,j > i}{S}_{i,j}$ with the remaining image being the odd one out. We define odd-one-out accuracy as the fraction of triplets where the odd one out ‘chosen by a model’ is identical to the human odd-one-out choice. Thus, our goal is to learn an affine transformation into the THINGS human object similarity space of the form: ${{\bf{x}}}^{^{\prime} }={\bf{W}}{\bf{x}}+{\bf{b}}$. Here, ${\bf{W}}\in {{\mathbb{R}}}^{p\times p}$ is a learned transformation matrix, ${\bf{b}}\in {{\mathbb{R}}}^{p}$ is a bias and ${\bf{x}}\in {{\mathbb{R}}}^{p}$ is the neural network representation for a single object image in the THINGS dataset. We learn the affine transformation for the representation of the image encoder space of the teacher model (see the ‘Surrogate teacher model’ section for details about the teacher model). Using this affine transformation, an entry in the pairwise similarity matrix S′—which represents the similarity between two object images i and j—can now be written as ${S}_{i,j}^{^{\prime} }\,:= \,{({\bf{W}}{{\bf{x}}}_{i}+{\bf{b}})}^{{\rm{\top }}}({\bf{W}}{{\bf{x}}}_{j}+{\bf{b}})$.

Hard-alignment loss

Given a similarity matrix of neural network representations S and a triplet {i, j, k}, the likelihood of a particular pair, $\{a,b\}\subset \{i,j,k\}$, being most similar to the remaining object being the odd one out, is modelled by the softmax of the object similarities,

$$\sigma ({\bf{S}},\tau ):= \exp ({S}_{a,b}/\tau )/\exp ({S}_{i,j}/\tau )+\exp ({S}_{i,k}/\tau )+\exp ({S}_{j,k}/\tau )$$ (1)

... continue reading