Apple researchers have created an AI model that reconstructs a 3D object from a single image, while keeping reflections, highlights, and other effects consistent across different viewing angles. Here are the details.
A bit of context
While the concept of latent space in machine learning is not exactly new, it has become more popular than ever in recent years, with the explosion of AI models based on the transformer architecture and, more recently, world models.
In a nutshell (and running the risk of being slightly imprecise to explain the bigger picture), “latent space,” or “embedding space,” are terms that describe what happens when you:
Boil down information to numerical representations of their concepts; Organize these numbers in a multi-dimensional space, making it possible to calculate the distances between them for each different dimension.
If that still sounds too abstract, one classic example is to get the mathematical representation of the token “king”, subtract the mathematical representation of the token “man”, add the mathematical representation of the token “woman”, and you will end up in the general multi-dimensional region of the token “queen”.
In practical terms, storing information as mathematical representations in latent space makes it faster and less computationally expensive to measure distances between them and estimate the probability of what should be generated.
Here’s a short video that explains latent space using a different analogy:
Although the examples above focus on storing text in latent space, the same idea can be applied to many other types of data. Which brings us to Apple’s study.
LiTo: Surface Light Field Tokenization
... continue reading