Our Take
The reasoning capability in GPT-Realtime-2 is real, but the 15.2% benchmark improvement comes from OpenAI's own evaluations without independent verification.
Why it matters
Voice interfaces are moving beyond simple chatbots to agents that can reason through complex requests and use tools mid-conversation. Developers building voice products now have production-ready models that handle the messy reality of human speech patterns.
Do this week
Voice product teams: Test GPT-Realtime-2's tool calling against your current stack this week to measure the reasoning improvement on your specific use cases.
OpenAI launches three specialized voice models
OpenAI released GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper through its Realtime API. GPT-Realtime-2 adds reasoning capabilities to voice interactions, scoring 15.2% higher on Big Bench Audio compared to GPT-Realtime-1.5 (per OpenAI benchmarks). The model can now handle parallel tool calls while maintaining conversation flow and supports adjustable reasoning levels from minimal to extra-high.
GPT-Realtime-Translate handles live speech translation across 70+ input languages into 13 output languages, maintaining conversation pace. GPT-Realtime-Whisper provides streaming speech-to-text transcription as speakers talk, rather than waiting for complete utterances.
Zillow reported a 26-point improvement in call success rates during testing (95% vs 69% after prompt optimization). The models include safety guardrails with active classifiers monitoring sessions for policy violations.
Voice agents can now reason while talking
Previous voice models handled simple back-and-forth but broke down during complex, multi-step requests. GPT-Realtime-2 addresses this by maintaining context through tool calls and interruptions while providing transparency about what it's doing through preambles like "let me check that."
The context window expansion from 32K to 128K tokens enables longer conversations without losing track of earlier requests. This matters for enterprise use cases where voice agents need to handle complex workflows rather than just answer questions.
Live translation removes the delay that made cross-language voice interactions feel stilted. BolnaAI reported 12.5% lower word error rates across Hindi, Tamil, and Telugu compared to other models they tested (company-reported).
Production voice apps become viable
GPT-Realtime-2 costs $32 per million audio input tokens and $64 per million output tokens, with cached input at $0.40 per million tokens. GPT-Realtime-Translate runs $0.034 per minute, while GPT-Realtime-Whisper costs $0.017 per minute.
The models support three emerging patterns: voice-to-action for task completion, systems-to-voice for proactive guidance, and voice-to-voice for cross-language conversations. Companies like Priceline are building end-to-end travel management through voice interactions.
Developers must disclose AI interaction to users unless obvious from context. The Agents SDK allows custom safety guardrails beyond OpenAI's built-in protections. All three models are available immediately through the Realtime API.