Our Take
Code-switching cost varies sharply by model and language pair, but the top systems (Scribe, Gemini, AssemblyAI) incur only 1–3 percentage point WER penalties versus monolingual baselines, suggesting the problem is robustness, not inherent difficulty.
Why it matters
Contact centers and IT helpdesks routinely handle bilingual customers who code-switch mid-sentence. Until now, there was no enterprise-grade benchmark to measure which ASR systems handle that correctly, making vendor comparisons impossible.
Do this week
Voice platform teams: run your current ASR vendor against this benchmark (AU-Harness) on your customer base's language pair before committing to annual contracts, so you know the actual WER and semantic error cost.
Seven ASR systems benchmarked on code-switched speech
ServiceNow released a benchmark measuring how seven ASR systems handle code-switched speech—language switching mid-utterance—in HR and IT support scenarios. The dataset covers 918 utterances across four language pairs: Spanish-English (259 records), French-English (298), Canadian French-English (188), and German-English (173). All utterances were synthesized from parallel English and non-English transcripts, then reviewed by native-speaker linguists.
Three metrics were used: Word Error Rate (WER) for raw transcription accuracy, Semantic Word Error Rate (SWER) for meaning preservation, and Answer Error Rate (AER) for downstream task success—whether transcription errors cause downstream QA systems to fail.
Results showed ElevenLabs Scribe V2, Google Gemini 3 Flash, and AssemblyAI Universal 3-Pro as top performers. Scribe led on WER, with narrow margins over AssemblyAI (separated by 0–0.13 percentage points across language pairs). On semantic metrics, Gemini 3 Flash consistently outperformed AssemblyAI on AER and SWER, suggesting that Large Audio Language Models benefit from language-understanding advantages. OpenAI Whisper Large V3 Turbo ranked last, with WER ranging from 0.16 to 0.61—a known limitation from its default behavior of translating rather than transcribing code-switched audio.
Deepgram Nova-3 showed an unusual weakness: mid-tier SWER but last or second-to-last AER, indicating semantic errors concentrated in high-stakes details like case numbers and dates.
The cost of code-switching is smaller than expected for top models
Scribe V2, Gemini 3 Flash, and AssemblyAI incurred only 1–3 percentage point WER penalties when switching from monolingual to code-switched speech. In contrast, lower-ranked models degraded more substantially, suggesting code-switching primarily exposes robustness differences rather than creating a universal difficulty spike.
This matters because half the world's population speaks more than one language, and many bilingual speakers code-switch naturally in enterprise settings. Prior to this benchmark, vendors had no standardized way to measure cost. Contact centers and IT helpdesks had no way to predict whether a given ASR system would handle their actual customer base accurately.
The analysis also revealed that error type matters. A model like Deepgram Nova-3 might transcribe words correctly at the semantic level but still mishandle case numbers or names—exactly the details that propagate downstream. Raw WER alone misses this failure mode.
How to use this benchmark before signing an ASR contract
ServiceNow published the benchmark through AU-Harness, an open evaluation harness for voice models. Practitioners should test any candidate ASR vendor against this dataset using their actual language pair and use case (HR, IT support, or custom ITSM scenarios).
Pay attention to AER, not just WER. A vendor with strong WER but weak AER will still break downstream tasks. Check the cost-of-code-switching analysis: compare the vendor's monolingual performance to its code-switched performance on the matrix language baseline. Top performers show small deltas; poor performers show deltas larger than 0.3.
If your customer base includes German-English speakers, note that Nvidia Parakeet performed better than Nova-3 and Voxtral on that pair, suggesting language-pair-specific behavior is worth testing separately rather than relying on overall rankings.