Hugging Face adds private test data to ASR leaderboard

Private datasets added to combat test set optimization

Hugging Face added private evaluation datasets to its Open ASR Leaderboard, sourced from Appen Inc. and DataoceanAI. The new datasets include 11 splits covering scripted and conversational speech across American, British, Australian, Canadian, and Indian English accents, totaling approximately 28 hours of audio (per Hugging Face's breakdown).

The leaderboard maintains its existing public dataset scoring by default. Users can toggle private dataset inclusion to see performance changes via a "Rank Δ" column. The private datasets test specific conditions: scripted vs conversational speech, American vs non-American accents, with intentionally aggregated scoring to prevent optimization for individual data providers.

Models are added through the same GitHub pull request process, with Hugging Face running private evaluations after verifying public dataset results. The leaderboard has recorded 710,000+ visits since its September 2023 launch (company-reported).

Gaming prevention faces structural limits

The move addresses benchmark-specific optimization where models improve leaderboard scores without corresponding real-world gains. However, the solution has gaps. Training data overlap with private test distributions remains possible, especially since data providers may offer similar datasets to model developers through other channels.

The multiple data provider approach partially mitigates this risk, but Hugging Face acknowledges that "data from a similar distribution could still help the model on the corresponding evaluation set." The impact depends on whether private test performance correlates better with production ASR accuracy than existing public benchmarks.

Evaluation strategy implications

The accent and style breakdowns provide more targeted evaluation than aggregate Word Error Rate scores. Conversational and non-American accent performance often diverges from scripted American English results, making the granular metrics useful for deployment planning.

Teams should evaluate models with private datasets enabled before production deployment, particularly for applications serving diverse accents or conversational audio. The leaderboard's dataset toggle feature allows customization for specific use cases, though practitioners still need domain-specific validation beyond any public benchmark.

The self-reporting option via YAML model cards provides faster feedback during development, with verification following through the official submission process.

Hugging Face adds private test data to ASR leaderboard

Our Take

Why it matters

Do this week

Private datasets added to combat test set optimization

Gaming prevention faces structural limits

Evaluation strategy implications

Related stories

Gresham and FundGuard merge data platforms for asset managers

ANNA Money adds 3.66% savings account for UK small businesses

Payward buys Reap for $600M to merge stablecoin cards with B2B rails