Imagine hanging up a frame in your living room. You and a friend naturally divide the work, clarify confusing instructions, and adapt if a screw is missing. We expect this type of fluid, on-the-fly teamwork from humans, but can today’s AI agents do the same? Well, not quite.
NOV 2025 | 10 MIN READ | EPFL, Microsoft Research | 🎓 ArXiv Paper Link
Figure 1. Effective collaborative interaction can be challenging...
“AI Agents” – agentic software powered by language models – are dominating the news. Their potential to increase economic output and scientific discovery has led to massive investments in infrastructure [1], agentic services [2, 3, 4], startups [5, 6], and corporate/government adoption training [7, 8, 9]. The current trajectory of AI development suggests that we will increasingly rely on agent-based systems composed of independently developed agents with varying capabilities, information, privileges, and tools. So far, attempts at multi-agent integrations strongly rely on predetermined communication protocols, e.g., MCP [10], A2A [11], and ACP [12], or centrally orchestrated architectures. In contrast, real-world deployment will likely require on-the-fly, flexible communication as fixed protocols are too rigid to handle the diversity of the real world. Although there are many benchmarks designed to measure single-agent capabilities [13], few reflect the paradigm shift toward long horizon, dynamic, multi-agent collaborations under partial information [14]. This leaves us with an important unanswered question: do current training strategies lead to agents capable of dynamic collaboration?
Figure 2. System diagram of our collaborative maze-solving task. (i) a random maze is generated and used to create two obfuscated copies; (ii) each agent is given a copy along with rules, after which they engage in a dialogue until either the maximum number of turns is reached or the task is completed; (iii) a “grader” agent extracts the proposed solution which is checked against the ground-truth maze to decide on the outcome.
In our new work [15], we measure leading AI agents’ collaborative capabilities using a novel maze-solving benchmark. In this task, two agents are each given partially obfuscated copies of the same map, requiring them to exchange information to fill in the missing fields. They further have to both agree on a move before it is executed and can only make one move at a time, ensuring no single agent can jeopardize the task and a minimum number of interactions take place. A powerful property of this “distributed” methodology is that the same task can be given to a single agent, allowing us to disentangle “raw” maze-solving capabilities from collaborative ones.
Figure 3. What can go wrong in collaborative grounding? The first agent is working in row-column coordinates, while the second agent assumes column-row coordinates. Conflict quickly arises…
While straightforward, our maze-solving task allows us to test diverse collaborative challenges proven to be essential in human and human-AI collaboration [16, 17, 18]. For starters, the same maze can be described in a variety of ways, e.g., using row-column coordinates or visual descriptions (see Figure 3). Because we provide agents with text-based, NxN grids of symbols and no communication guidelines, the agents have to “align” their representations. They also have to solve various other “grounding” challenges [19, 20, 21], e.g., confirming the same start and goal states, keeping track of evolving agent states, and making sure intended moves are clear to the other agent (Figure 1. “your left or my left?”). They further have to resolve conflicts, agree on practical procedural decisions (e.g., who does what?), and engage in “Theory of Mind” [22] to interpret proposals from the other agent.
Figure 4. The collaboration gap between solo and homogeneous collaborative performance. We display the mean weighted outcomes of solving 6x6 mazes with 95% CI. Note that most models can solve mazes solo, about a third does worse on distributed information, and almost all significantly degrade when forced to collaborate.
Using this benchmark on 32 leading open- and closed-source models, we identified a critical and counterintuitive phenomenon we term the “collaboration gap”: models that are highly capable solo performers (gray and yellow marks) exhibit a significant performance drop when collaborating with an identical copy of themselves (red marks). Distilled models appear disproportionately affected, e.g., note the large drops from gpt-5 to gpt-5 nano, and gemini-2.5-pro to gemini-2.5-flash-lite.
“The collaboration gap: models that are highly capable solo performers exhibit a significant performance drop when collaborating with an identical copy of themselves.”
What explains these differences? For example, both o3 and gpt-4.1-mini are highly capable of solving mazes solo, yet have dramatically different outcomes in collaborative mode. Inspecting their first messages, shown in Figure 5, is illuminating: whereas the stronger o3 immediately seeks to align its maze representation and ground on multiple levels, the weaker gpt-4.1-mini only attempts to ground the meaning of different symbols, leaving much to the imagination…
Figure 5. The difference in grounding effort between strong and weak models.
Figure 6. Specialized models are more likely to require “help” to solve problems outside of their areas of expertise.
The observed collaborative performance gap matters for the future of agentic AI; its appearance in our simple testbed indicates a blind spot in current training paradigms rather than an artifact of task complexity. For example, recent work argued that “Small language models are the future of agentic AI” [23]. While the authors make compelling arguments why small, specialized models are operationally and economically preferable over large, generalist models, their discussion omits a crucial caveat: the more specialized a model becomes, the higher the likelihood it will encounter situations outside of its area of expertise, requiring collaboration with “other” models. Our results suggest that for current models, naively breaking up tasks might incur a “collaborative slippage” cost.
“The observed collaborative performance gap matters for the future of agentic AI; its appearance in our simple testbed indicates a blind spot in current training paradigms rather than an artifact of task complexity”
Zooming out, collaboration has been the central spill advancing human civilization. Optimizing for solo brilliance over collective intelligence ignores this crucial lesson. We therefore challenge the research community to treat collaborative intelligence as a core objective to be designed for, not as an emergent property to be hoped for.
🎉 Thanks for reading till the end! 🎉
Curious about how collaborations between agents from different model providers and with different skill levels fared? Or our insights into effective deployment strategies for heterogeneous agents? These and many more insightful dialogue snippets, ablations, and discussion can be found in the full paper:
> Read the full paper: “The Collaboration Gap”
> Any lingering questions, suggestions, or concerns? Feel free to reach out directly!
--
Please cite our work as follows:
Or use the BibTeX citation:
@misc{davidson2025collaborationgap,
title={The Collaboration Gap},
author={Tim R. Davidson and Adam Fourney and Saleema Amershi and Robert West and Eric Horvitz and Ece Kamar},
year={2025},
eprint={2511.02687},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.02687},
}
References
[1] Noffsinger et al., The cost of compute: a $7 trillion race to scale data centers, 2025
[2] Citron, Try deep research and our new experimental model in gemini, your ai assistant, 2024
[3] OpenAI, Introducing operator, 2025
[4] Salesforce, Salesforce unveils agentforce - what ai was meant to be, 2024
[5] Field, AI agents are having a “chatgpt moment” as investors look for what’s next after chatbots, 2024
[6] Robbins and Kokalitcheva, Y combinator is going all-in on ai agents, making up nearly 50% of latest batch, 2025
[7] Mason, All civil servants in england and wales to get ai training, 2025
[8] Gartner, Gartner reveals top technologies sharpening government ai adoption, 2025
[9] EY, Ey survey reveals that technology companies are setting the pace of agentic ai - will others follow suit?, 2025
[10] Anthropic, Introducing the model context protocol, 2024
[11] Besen, ACP: the internet protocol for ai agents, 2025
[12] Google, Announcing the agent-to-agent protocol, 2025
[13] Chang et al, A survey on evaluation of large language models, 2024
[14] Davidson et al, Evaluating language model agency through negotiations, 2024
[15] Davidson et al, The collaboration gap, 2025
[16] Clark, Using language, 1996
[17] Garrod and Anderson, Saying what you mean in dialogue, 1987
[18] Pejsa et al, Natural communication about uncertainties in situated interactions, 2014
[19] Clark and Brennan, Grounding in communication, 1996
[20] Garrod and Pickering, Why is conversation so easy? 2004
[21] Harnad, The symbol grounding problem, 1990
[22] Premack and Woodruff, Does the chimpanzee have a theory of mind? 1978
[23] Belcak et al, Small language models are the future of agentic ai, 2025