From Pokémon Red to Standardized Game-as-an-Eval

lmgame_pokemon

Pokémon is increasingly used to evaluate modern large language models, but current practices lack standardization, depend heavily on game-specific scaffolding, and are costly. We address these issues with lmgame-bench, a new framework offering standardized evaluations and initial results across diverse games.