Games will teach you more about model capabilities than benchmarks ever could  ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

To Improve LLMs, Coach Them Like Athletes in an Arena

Games will teach you more about model capabilities than benchmarks ever could

by Alex Duffy

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.


This week I and several colleagues published our findings about how, with a little elbow grease and creativity, anyone can dramatically increase performance of any LLM.

The secret is in coaching. Allow me to explain.

The reason an athlete can credibly claim to be “best in the world” is because arenas and structured competition—games—exist. There are rules, clocks, standings, and tape you can study. The AI world has benchmarks—but benchmarks only check facts. Games reveal a model’s behavior, which can be recorded and studied to help models get better. That is what we did with AI Diplomacy, a project in which we turned the classic strategy game Diplomacy into a competitive arena for language models.

AI Diplomacy works because it has clear goals—try to outfox your opponents and take over Europe—and room to improvise. But subtlety and guile are key parts of the game, which centers on tactical negotiations (we have a complete list of rules here). When we first set up the game environment, the LLMs were lost. After we got past a bunch of thorny technical problems, we realized that we could learn a ton about the models’ strengths and weaknesses from how they play against each other—and that we could coach them to be better. For example, prompting models to act more aggressively turned GPT-5 from a patsy into a formidable contender. Claude Sonnet 4, meanwhile, was a strong, speedy player even without specialized prompting.

These are useful differences. One model is highly steerable, the other is fast and consistent. That improvement tells you how the model will respond to a real-world task. If you have more time to craft a great prompt and need the best result, GPT-5 would be great. In a rush? Try Claude 4.

The industry is starting to realize that games can help evaluate models and push them to new levels of performance. Google has launched Google Arena, for instance, because the company says game are “[the] perfect testbed for evaluating models & agents.”

We agree. In fact, we think there’s so much potential here that we’re putting up $1,000 in prize money to see who can prompt their agent to victory in AI Diplomacy in our Battle of the Bots in September.

In the meantime, let’s break down our findings so far.

What AI Diplomacy taught us about models


Become a paid subscriber to Every to unlock this piece and learn about:

  1. How Alex and his team learned to plug LLMs into a complex game environment
  2. How prompting totally changed how models performed
  3. Quirky behavior that emerged, including models getting "drunk on power"
Upgrade to paid


Every Subscription
Lock Start free trial

What is included in a subscription?

Daily insights from AI pioneers + early access to powerful AI tools

AI Tools Showcase
Pencil Front-row access to the future of AI
Check In-depth reviews of new models on release day
Check Playbooks and guides for putting AI to work
Check Prompts and use cases for builders
Sparks Bundle of AI software
Sparkle Sparkle: Organize your Mac with AI
Cora Cora: The most human way to do email
Spiral Spiral: Repurpose your content endlessly

You received this email because you signed up for emails from Every. No longer interested in receiving emails from us? Click here to unsubscribe.

221 Canal St 5th floor, New York, NY 10013