Playing with Vision Embeddings

Corn kernels + Triumphal Arch = Kernel Arch

Embeddings are, in a sense, the native language of neural networks. They are how networks can encode a rich variety of semantically meaningful representations with just a list of numbers. However, those numbers are frustratingly opaque. You certainly won't be able to make sense of them by reading them one after another. In this post, we try to make sense of one neural network's embeddings.

The Model

The model we're going to be looking at in this post is DINOv3 ViT-S (Siméoni et al., 2025). DINOv3 is interesting because it learns to map raw pixels to a rich feature space with very few priors. It doesn't know language, it can't describe what it sees, but it still learns to make sense of images. We won't go into full detail about how DINOv3 was trained, but two things matter for this post: it compresses any image into a single embedding (a list of 384 numbers) and it was trained so that different crops and augmentations of an image will have similar embeddings. Our goal is to understand what information is encoded in those 384 numbers.

Generating Images from Embeddings

In order to start playing around in this 384-dimensional space, we need some way to translate these numbers back into something that humans can understand. The most natural place to do this is in the one language humans and DINOv3 both understand: images. More concretely, we want to be able to take a point in this 384-dimensional space and generate an image that DINOv3 says would coincide with that point.

To do this, we leverage two ideas. The first is that DINOv3 is fully differentiable -- when you feed an image into the model, you can tweak the pixels to make the output vector closer to some target. So we can maximize cosine similarity between the generated image's embedding and some target embedding. People have generated images this way for a while, see for example DeepDream (Mordvintsev et al., 2015) and Olah et al.'s feature visualization work (Olah et al., 2017).

The second is that DINOv3 was trained such that different crops and augmentations of an image land in the same embedding space. We can mimic that same cropping and augmentation strategy when we build up the gradient for the pixels. This helps in two ways: first, it stops the optimizer from cheating with high-frequency noise (Olah et al., 2017); second, it optimizes for the model's own definition of sameness.

There are a couple of other tricks we use to make the images look nicer: we produce the image with an untrained transformer backbone (similar in spirit to Deep Image Prior, Ulyanov et al., 2017), and we minimize an auxiliary total variation loss.

Once this pipeline is set up, when given an arbitrary direction in 384-dimensional space, we can generate an image that DINOv3 says would point in that direction. For example, below we take a photo of an alpine landscape, compute its DINOv3 embedding, and then use our generation technique to produce an image that points in that same direction.

... continue reading