Evaluating LLMs playing text adventures
What we’ll do is set a low-ish turn limit and see how much they manage to accomplish in that time.1 Another alternative for more linear games is running them multiple times with a turn limit and seeing how often they get past a particular point within that turn limit. Given how much freedom is offered to players of text adventures, this is a difficult test. It’s normal even for a skilled human player to immerse themselves in their surrounding rather than make constant progress. I wouldn’t be su