Seven ASR models tested on bilingual speech — Scribe V2 and Gemini 3 Flash lead

Seven ASR systems benchmarked on code-switched speech

ServiceNow released a benchmark measuring how seven ASR systems handle code-switched speech—language switching mid-utterance—in HR and IT support scenarios. The dataset covers 918 utterances across four language pairs: Spanish-English (259 records), French-English (298), Canadian French-English (188), and German-English (173). All utterances were synthesized from parallel English and non-English transcripts, then reviewed by native-speaker linguists.

Three metrics were used: Word Error Rate (WER) for raw transcription accuracy, Semantic Word Error Rate (SWER) for meaning preservation, and Answer Error Rate (AER) for downstream task success—whether transcription errors cause downstream QA systems to fail.

Results showed ElevenLabs Scribe V2, Google Gemini 3 Flash, and AssemblyAI Universal 3-Pro as top performers. Scribe led on WER, with narrow margins over AssemblyAI (separated by 0–0.13 percentage points across language pairs). On semantic metrics, Gemini 3 Flash consistently outperformed AssemblyAI on AER and SWER, suggesting that Large Audio Language Models benefit from language-understanding advantages. OpenAI Whisper Large V3 Turbo ranked last, with WER ranging from 0.16 to 0.61—a known limitation from its default behavior of translating rather than transcribing code-switched audio.

Deepgram Nova-3 showed an unusual weakness: mid-tier SWER but last or second-to-last AER, indicating semantic errors concentrated in high-stakes details like case numbers and dates.

The cost of code-switching is smaller than expected for top models

Scribe V2, Gemini 3 Flash, and AssemblyAI incurred only 1–3 percentage point WER penalties when switching from monolingual to code-switched speech. In contrast, lower-ranked models degraded more substantially, suggesting code-switching primarily exposes robustness differences rather than creating a universal difficulty spike.

This matters because half the world's population speaks more than one language, and many bilingual speakers code-switch naturally in enterprise settings. Prior to this benchmark, vendors had no standardized way to measure cost. Contact centers and IT helpdesks had no way to predict whether a given ASR system would handle their actual customer base accurately.

The analysis also revealed that error type matters. A model like Deepgram Nova-3 might transcribe words correctly at the semantic level but still mishandle case numbers or names—exactly the details that propagate downstream. Raw WER alone misses this failure mode.

How to use this benchmark before signing an ASR contract

ServiceNow published the benchmark through AU-Harness, an open evaluation harness for voice models. Practitioners should test any candidate ASR vendor against this dataset using their actual language pair and use case (HR, IT support, or custom ITSM scenarios).

Pay attention to AER, not just WER. A vendor with strong WER but weak AER will still break downstream tasks. Check the cost-of-code-switching analysis: compare the vendor's monolingual performance to its code-switched performance on the matrix language baseline. Top performers show small deltas; poor performers show deltas larger than 0.3.

If your customer base includes German-English speakers, note that Nvidia Parakeet performed better than Nova-3 and Voxtral on that pair, suggesting language-pair-specific behavior is worth testing separately rather than relying on overall rankings.

Seven ASR models tested on bilingual speech — Scribe V2 and Gemini 3 Flash lead

Our Take

Why it matters

Do this week

Seven ASR systems benchmarked on code-switched speech

The cost of code-switching is smaller than expected for top models

How to use this benchmark before signing an ASR contract

Related stories

Eve Launches EveOS Platform to Sync AI Agents With Case Management Systems

Lexsoft Embeds Curated Knowledge Into Claude, Copilot, Harvey

Daiichi Sankyo targets top-five oncology by 2035 with $19.1B ADC pipeline