Our Take
Companies have working models and profit promises but no proven path between them, creating a hype vacuum that market speculation fills daily.
Why it matters
Enterprise AI budgets and workforce planning decisions are based on theoretical task assessments rather than real workplace performance data. Markets now swing on single social media posts because actual deployment evidence remains scarce.
Do this week
CTOs: Run workplace task pilots with measurable success criteria before committing to enterprise AI contracts.
Workplace AI agents fail most real-world tasks
Researchers at Mercor tested AI agents from OpenAI, Anthropic, and Google DeepMind across 480 workplace tasks performed by bankers, consultants, and lawyers. Every agent failed to complete most duties assigned to it.
The results contradict theoretical job impact studies like Anthropic's recent analysis predicting major changes for managers, architects, and media workers. That study based conclusions on perceived LLM capabilities rather than actual workplace performance.
Meanwhile, Pause AI protesters in London captured the deployment gap with signs reading "Step 1: Grow a digital super mind. Step 2: ? Step 3: ?" The reference to South Park's underpants gnomes business plan ("Phase 1: Collect underpants. Phase 2: ? Phase 3: Profit") reflects how companies have built technology and promised outcomes without proven implementation paths.
Missing evidence creates speculation cycles
The disconnect between theoretical assessments and real performance creates an information vacuum. Without transparency from model makers or standardized workplace evaluation methods, bold claims drive market reactions regardless of supporting evidence.
OpenAI's chief scientist Jakub Pachocki recently described AI as an "economically transformative technology" heading toward "sunny uplands," but acknowledged the destination remains "hazy." Each company takes different routes with no guarantee any will reach claimed outcomes.
The problem extends beyond coding tasks, where AI tools show fastest improvement. Independent studies find LLMs perform poorly at strategic judgment calls, and workplace deployment must account for existing workflows and human integration factors.
Test real tasks before major commitments
The gap between capability claims and workplace results demands evidence-based evaluation. Rather than relying on vendor assessments or theoretical task analyses, organizations need controlled tests using their actual workflows.
Current evaluation methods fail to predict real-world performance because they ignore contamination from existing processes and human factors. Adding AI can worsen outcomes in complex environments, requiring workflow redesign that takes significant time and organizational commitment.
The technology industry's economic promises rest on AI's potential transformation, but workplace evidence remains limited. Until transparent evaluation methods emerge and model makers provide deployment data, practitioners should treat capability claims as hypotheses requiring validation.