Skip to content
Tech News
← Back to articles

Solving Semantle with the Wrong Embeddings

read original get Semantic Search Word Puzzle → more articles
Why This Matters

This article highlights a robust approach to solving Semantle that doesn't depend on knowing the specific embedding model used by the game. By focusing on relative rankings of guesses rather than exact similarity scores, this method offers a more adaptable and realistic strategy for players and developers interested in semantic search and language understanding. It underscores the importance of flexible algorithms that can operate effectively across different models and datasets, advancing the development of more resilient AI tools.

Key Takeaways

In my last post, I wrote about a solver for Semantle (the word game where guesses are scored based on semantic similarity to a hidden target word) that Ethan Jantz and I built while at the Recurse Center.

If your goal is (as ours was) to solve Semantle in as few guesses as possible, that solver works great. But it requires knowing exactly what embedding model the game itself is using, and the exact cosine similarities between every word in the vocabulary. After writing the post, a former colleague from grad school, Daniel Vitek, reached out with an idea for a solver that doesn’t rely on that knowledge, and instead only uses information about which guesses are better than others.

It takes more guesses to win, but it’s more robust, in that it works even with a different underlying embedding model than the game uses. To me, it also feels more aligned with what it’s actually like to play Semantle. Daniel shared a notebook showing how it works, which I adapted into this solver.

Which guess is closer?

Let’s say I’m playing Semantle and guess 1 is “biology” (which scored 26.03) and guess 2 is “philosophy” (which scored 3.09). The most important piece of information I receive from the scores the game returns is the relative ordering: “biology” (guess 1) is closer to the target word than guess 2 (“philosophy”). That would make sense if the hidden target word were “medical.” For this solver, that relative rank is the core of what we need.

Here, guess 1 corresponds to embedding g 1 g_1 g1​, guess 2 corresponds to embedding g 2 g_2 g2​, and the target corresponds to embedding t t t. We can imagine a boundary in embedding space containing all points that are equally similar to both guesses (i.e. words that make the same angle with g 1 g_1 g1​ as with g 2 g_2 g2​). On one side of that boundary, g 1 g_1 g1​ would be more aligned with t t t; on the other, g 2 g_2 g2​ would be. Since Semantle told us g 1 g_1 g1​ was closer, the target must live on the g 1 g_1 g1​ side of that boundary.

Here’s what that would look like geometrically, pretending we’re only dealing with three dimensions rather than 300 for simplicity:

A single ranking comparison between two guesses splits the embedding space in two. All vectors on one side of the plane are more similar to g₁ than g₂. Guess 1 is better, so only those candidates could still be the target.

In linear algebra terms:

To reframe in linear algebra terms: Assuming these are unit vectors, if g 1 g_1 g1​ beats g 2 g_2 g2​, that means

... continue reading