To measure how agents perform within this vast universe, we create a set of evaluation tasks using games and worlds that remain separate from the data used for training. These “held-out” tasks include specifically human-designed tasks like hide and seek and capture the flag.
Because of the size of XLand, understanding and characterising the performance of our agents can be a challenge. Each task involves different levels of complexity, different scales of achievable rewards, and different capabilities of the agent, so merely averaging the reward over held out tasks would hide the actual differences in complexity and rewards — and would effectively treat all tasks as equally interesting, which isn’t necessarily true of procedurally generated environments.
To overcome these limitations, we take a different approach. Firstly, we normalise scores per task using the Nash equilibrium value computed using our current set of trained players. Secondly, we take into account the entire distribution of normalised scores — rather than looking at average normalised scores, we look at the different percentiles of normalised scores — as well as the percentage of tasks in which the agent scores at least one step of reward: participation. This means an agent is considered better than another agent only if it exceeds performance on all percentiles. This approach to measurement gives us a meaningful way to assess our agents’ performance and robustness.
More generally capable agents
After training our agents for five generations, we saw consistent improvements in learning and performance across our held-out evaluation space. Playing roughly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the final generation experienced 200 billion training steps as a result of 3.4 million unique tasks. At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human. And the results we’re seeing clearly exhibit general, zero-shot behaviour across the task space — with the frontier of normalised score percentiles continually improving.