
In short
- A Stanford researcher has created a Survivor-style game where AI models form alliances and vote on the contestants.
- Benchmark aims to address the growing challenges of overcrowded and corrupted AI analytics.
- OpenAI’s GPT-5.5 was the first of 999 multiplayer games to include 49 types of AI.
AI models are now playing “Survivor” – kind of.
In a new Stanford study called “Agent Island,” AI agents negotiate contracts, accuse each other of secret deals, manipulate votes, and eliminate opponents in multiplayer games that attempt to test behaviors that traditional codes miss.
Research, printed On Tuesday, the director of research at the Stanford Digital Economy Lab, Connacher Murphy, said that many AI benchmarks are becoming unreliable as models learn to solve them, and benchmark data is often included in studies. Murphy created Agent Island as a benchmark for real change AI assistants compete against each other in an elimination game like Survivor instead of answering static questions.
“Large-scale, multi-agent interactions may become commonplace as AI agents become more resourceful and empowered to make decisions,” Murphy said. “In such cases, agents may pursue mutually exclusive goals.”
Researchers still know little about how AIs work together, Murphy explained, adding that they compete, form alliances, or resolve conflicts with other autonomous agents, and he says static benchmarks fail to capture that power.
Each game starts with seven randomly selected AI models given fake player names. Over five rounds, contestants talk privately, debate publicly, and vote for each other. Players who are eliminated later return to help decide the winner.
This type rewards persuasion, coordination, reputation management, and strategic manipulation along with creative thinking skills.
In 999 simulation games involving 49 types of AI, including ChatGPT, Grok, Gemini, and Claude, GPT-5.5 took first place by a wide margin with a skill score of 5.64, compared to 3.10 for GPT-5.2 and 2.86 for GPT-5.3-codex, according to the Murphyian system. Claude Opus’ anthropic models are also very close.
The study found that brands also preferred AIs from the same company, with OpenAI brands showing a preference for the same contributors and Anthropic brands the weakest. Across more than 3,600 final votes, the sample was 8.3 percent likely to support the finalists from the same sponsor. The play’s content, Mr. Murphy said, is more like a political debate than a cultural experiment.
One artist accused the contestants of secretly coordinating votes after seeing words similar to their own. Someone warned players not to get carried away by tracking alliances. Some artists have defended themselves by saying that they follow clear and consistent rules while accusing others of dressing for “theatrics”.
The research comes as AI researchers are moving toward game-based benchmarks and challengers to measure concepts and behaviors that static tests often miss. Recent projects include Google AI chess games, DeepMind’s work for Eve Frontier learning AI systems in complex environments, and new OpenAI experiments designed to resist learning contamination.
The researchers say that studying how AI models communicate, collaborate, compete, and adapt to each other can help researchers analyze behavior in multi-agent environments before they are deployed more widely.
The study cautioned that while benchmarks like Agent Island can help identify threats from autonomous AI models before they are deployed, the same benchmarks and communication logs can also help improve communication strategies and communication between AI agents.
“We reduce this risk by using low-level games and intermediate simulations
without human participants or real-world events,” Murphy wrote.
Daily Debrief A letter
Start each day with top stories right here, including originals, podcasts, videos and more.





