AI agents fail to negotiate for users despite task completion

Frontier models leave value on table despite high task completion

Microsoft Research released SocialReasoning-Bench, a benchmark testing whether AI agents can negotiate effectively on behalf of users. The benchmark evaluates agents in two scenarios: calendar coordination and marketplace negotiation, measuring both outcome quality and decision-making process.

Testing GPT-4, GPT-5.4, Claude Sonnet, and Gemini across both domains revealed a consistent pattern. Task completion rates hit near-perfect levels, with agents successfully scheduling meetings and closing deals. However, outcome optimality scores clustered near zero in marketplace negotiations, meaning agents accepted deals that gave away virtually all available value to counterparties (per Microsoft's analysis).

In calendar scheduling, agents performed better but still settled below the midpoint on average, accepting requestor-preferred time slots rather than ones serving their principal's interests. Even with defensive prompting that explicitly instructed agents to advocate for users, performance gaps persisted across all tested models.

Task success masks delegation failure

The benchmark introduces two new metrics that expose the gap between task completion and effective advocacy. Outcome Optimality measures how much available value the agent captured for its user on a 0-to-1 scale. Due Diligence scores whether the agent followed competent decision-making processes, like gathering context before acting or making counterproposals rather than immediately accepting first offers.

Current evaluation methods focus on whether meetings get scheduled or deals close, missing the quality dimension that matters for real delegation. An agent that immediately accepts a counterparty's first offer can still score well on task completion if the counterparty happens to propose a reasonable outcome.

Microsoft's research builds on earlier findings showing agents accepted first proposals up to 93% of the time in simulated marketplaces, suggesting this passive behavior generalizes across social reasoning contexts.

Process metrics reveal capability gaps vs negligence

The Due Diligence metric helps distinguish between agents that got lucky with good outcomes versus those demonstrating genuine advocacy skills. High outcome scores with low process scores indicate fragile performance that won't generalize. Conversely, agents showing strong process but poor outcomes point to capability gaps rather than design flaws.

For teams building agent systems, this suggests focusing on negotiation strategy rather than just task completion. The reasonable-agent policy Microsoft used as a benchmark captures basic advocacy behaviors: consulting available information, opening with positions favorable to the principal, and conceding only after exploring alternatives.

The benchmark's value function approach generalizes beyond price negotiations to any scenario where agents face competing incentives, including non-monetary domains where value reflects user preferences rather than financial outcomes.

AI agents fail to negotiate for users despite task completion

Our Take

Why it matters

Do this week

Frontier models leave value on table despite high task completion

Task success masks delegation failure

Process metrics reveal capability gaps vs negligence

Related stories

Medicare payment article shows only conference ads, no content

TEFCA network hits 1B health record exchanges in 16 months

Inhibrx shows cancer drug data amid buyout rumors