Back to news
NewsMay 12, 2026· 2 min read

AI agents fail to negotiate for users despite task completion

Microsoft's new benchmark shows frontier models complete 95%+ of social tasks but consistently accept poor deals instead of advocating for users.

Our Take

Agents that can't negotiate effectively become expensive yes-men rather than trustworthy delegates.

Why it matters

As AI agents handle more real-world interactions like calendar management and purchase negotiations, their inability to advocate creates measurable value loss for users who trust them to act in their best interests.

Do this week

AI product teams: audit your agent's negotiation behavior in controlled scenarios before deploying customer-facing delegation features.

Frontier models leave value on table despite high task completion

Microsoft Research released SocialReasoning-Bench, a benchmark testing whether AI agents can negotiate effectively on behalf of users. The benchmark evaluates agents in two scenarios: calendar coordination and marketplace negotiation, measuring both outcome quality and decision-making process.

Testing GPT-4, GPT-5.4, Claude Sonnet, and Gemini across both domains revealed a consistent pattern. Task completion rates hit near-perfect levels, with agents successfully scheduling meetings and closing deals. However, outcome optimality scores clustered near zero in marketplace negotiations, meaning agents accepted deals that gave away virtually all available value to counterparties (per Microsoft's analysis).

In calendar scheduling, agents performed better but still settled below the midpoint on average, accepting requestor-preferred time slots rather than ones serving their principal's interests. Even with defensive prompting that explicitly instructed agents to advocate for users, performance gaps persisted across all tested models.

Task success masks delegation failure

The benchmark introduces two new metrics that expose the gap between task completion and effective advocacy. Outcome Optimality measures how much available value the agent captured for its user on a 0-to-1 scale. Due Diligence scores whether the agent followed competent decision-making processes, like gathering context before acting or making counterproposals rather than immediately accepting first offers.

Current evaluation methods focus on whether meetings get scheduled or deals close, missing the quality dimension that matters for real delegation. An agent that immediately accepts a counterparty's first offer can still score well on task completion if the counterparty happens to propose a reasonable outcome.

Microsoft's research builds on earlier findings showing agents accepted first proposals up to 93% of the time in simulated marketplaces, suggesting this passive behavior generalizes across social reasoning contexts.

Process metrics reveal capability gaps vs negligence

The Due Diligence metric helps distinguish between agents that got lucky with good outcomes versus those demonstrating genuine advocacy skills. High outcome scores with low process scores indicate fragile performance that won't generalize. Conversely, agents showing strong process but poor outcomes point to capability gaps rather than design flaws.

For teams building agent systems, this suggests focusing on negotiation strategy rather than just task completion. The reasonable-agent policy Microsoft used as a benchmark captures basic advocacy behaviors: consulting available information, opening with positions favorable to the principal, and conceding only after exploring alternatives.

The benchmark's value function approach generalizes beyond price negotiations to any scenario where agents face competing incentives, including non-monetary domains where value reflects user preferences rather than financial outcomes.

#Agents#Research#AI Ethics#LLM
Share:
Keep reading

Related stories