Evaluating LLMs playing text adventures

What we’ll do is set a low-ish turn limit and see how much they manage to accomplish in that time.1 Another alternative for more linear games is running them multiple times with a turn limit and seeing how often they get past a particular point within that turn limit.

Given how much freedom is offered to players of text adventures, this is a difficult test. It’s normal even for a skilled human player to immerse themselves in their surrounding rather than make constant progress. I wouldn’t be surprised if I got a score of zero if someone plopped me down in front of this test. But still, maybe it’s the best we can do with limited resources.2 Another idea is to give them a far-off goal and then somehow have them request hints when they are stuck, and count how many hints they need to get there. However, given how little they used hints given in the previous article, I doubt this would work very well either.

What we’ll do is define a set of achievements for a game. These achievements will be clustered around the first few turns of the game, because we’ll only give the llm a few turns to earn them. Here’s an example for 9:05.

TURN_LIMIT 40 ANSWER_PHONE Click. EXIT_BED You get out of bed. OPEN_DRESSER revealing some clean ENTER_BATHROOM far from luxurious REMOVE_SOILED You take off the soiled REMOVE_WATCH You take off the watch ENTER_SHOWER dawdle WEAR_CLEAN You put on the clean OPEN_FRONT You open the front UNLOCK_CAR Unlocked. ENTER_CAR Las Mesas OPEN_WALLET open the wallet CARD_SLOT green LED lights

It should be fairly clear how this works: the TURN_LIMIT specifies how many turns the llm has to collect achievements. Every line other than that specifies an achievement: the name is on the left, and it counts as earned when the game prints the text on the right. The llm knows nothing of these achievements. It tries to get through the game and in the background we use the achievements to count how far it gets.

It might seem like the turn limit must be calibrated such that a score of 100 % is possible, but that’s not the case. Many of the games we are going to test with have branching already at the start, such that the achievements need to cover multiple branches, and it’s impossible to go through all branches within the turn limit. What we do need to be careful about is making sure the number of achievements in each branch is roughly the same, otherwise models that are lucky and go down an achievement-rich path will get a higher score. Thanks to this, the score we get out of this test is a relative comparison between models, not an absolute measure of how well the llm s play text adventures. We have already established that they don’t do it very well, and we can’t be more nuanced than that without paying for a lot of eval tokens.

We might consider making some moves not count toward the turn limit, for example erroneous commands, or examining things – the latter because more powerful models are more methodical and examine more things, and it seems odd to penalise them for this. However, in the end, examining things is probably part of what allows the more powerful models to make further progress (and typing valid commands is part of being good at text adventures), so we won’t give away any moves for free.