Six rerankers beat their size class on MTEB, all under Apache 2.0

Six open rerankers, one training recipe

Hugging Face released six CrossEncoder rerankers (17.6M to 1.0B parameters) built on Ettin ModernBERT encoders from Johns Hopkins. All six rank in the top tier of their size class on MTEB(eng, v2) Retrieval benchmark (per independent MTEB evaluation). The 1B model scores 0.6114 NDCG@10, within 0.0001 of the 1.54B teacher model (mixedbread-ai/mxbai-rerank-large-v2, 0.6115); the 400M variant hits 0.6091.

Each model uses the same architecture: a ModernBERT backbone plus a four-layer classification head (Transformer → CLS pooling → Dense → LayerNorm → Dense). The training recipe is distillation via pointwise MSE against the teacher's scores on a curated subset of lightonai/embeddings datasets. Full training code and hyperparameters are public.

All six support up to 8,192 tokens of context. With Flash Attention 2 and bfloat16 loading, the models achieve 1.7x to 8.3x speedup over default fp32 + SDPA, depending on model size and sequence length (per Hugging Face benchmarking).

Size diversity matters more than absolute rank

The headline finding is not that the 1B model matches the teacher. It is that you can now pick a model by latency budget and still land in the 0.55–0.61 NDCG@10 range. The 17M variant scores 0.5576; the 68M hits 0.5915. That spread gives retrieval teams real optionality.

Reranking is expensive by design: a cross-encoder must run once per (query, document) pair in the top-K retrieved set, whereas an embedding model encodes each document once and reuses it. The standard production pattern is retrieve-then-rerank. If your retriever pulls top-100 candidates at <1ms per query, your reranker can afford 5–10ms per ranking pass without breaking user-facing latency budgets. The 17M model on consumer hardware typically hits that window. The 1B model does not, unless your corpus is small or your end-to-end budget is generous.

Because the training recipe is public and the teacher weights are available, teams can also distill and tune variants for domain-specific corpora without starting from scratch. The distillation approach (MSE on teacher scores) is standard, but the transparency reduces the friction of customization.

Benchmark before swapping; expect 0.5–2ms latency per ranking

Do not assume the 1B model is the right choice. Test the 68M or 32M variant first on your domain. MTEB results are over 10 general retrieval tasks; your corpus may reward different ranking patterns. Use the provided usage examples (three lines of code) to score your top 100 queries in production, measure NDCG or MRR against your ground truth, and trade down in size until you breach your target quality threshold.

If you are using an older proprietary reranker (e.g., Cohere Reranker v2 or cross-encoder/ms-marco-MiniLM-L12-v2, which scores 0.5066 NDCG@10), the 32M Ettin model (0.5779) is a drop-in replacement with lower latency. If you are not reranking at all, start with the 68M variant paired with a fast embedding model like sentence-transformers/static-retrieval-mrl-en-v1 or google/embedding-gemma-300m. Verify the latency on your actual batch sizes and sequence lengths before committing to production.

Six rerankers beat their size class on MTEB, all under Apache 2.0

Our Take

Why it matters

Do this week

Six open rerankers, one training recipe

Size diversity matters more than absolute rank

Benchmark before swapping; expect 0.5–2ms latency per ranking

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software