Our Take
Production-grade rerankers at every size tier, with published training code and distillation weights from a 1.5B teacher, mean you can now swap models to fit your latency/quality tradeoff instead of accepting one vendor's default.
Why it matters
Reranking is the proven pattern in search and retrieval systems (retrieve fast, rank accurately), but most practitioners license a single proprietary model. Open, size-graduated alternatives with transparent training reduce vendor lock and cut per-query cost by choosing the smallest model that hits your NDCG target.
Do this week
Search engineer: benchmark the 32M and 68M variants against your current reranker on your top 10 queries this week so you can quantify latency gain and relevance loss before migration.
Six open rerankers, one training recipe
Hugging Face released six CrossEncoder rerankers (17.6M to 1.0B parameters) built on Ettin ModernBERT encoders from Johns Hopkins. All six rank in the top tier of their size class on MTEB(eng, v2) Retrieval benchmark (per independent MTEB evaluation). The 1B model scores 0.6114 NDCG@10, within 0.0001 of the 1.54B teacher model (mixedbread-ai/mxbai-rerank-large-v2, 0.6115); the 400M variant hits 0.6091.
Each model uses the same architecture: a ModernBERT backbone plus a four-layer classification head (Transformer → CLS pooling → Dense → LayerNorm → Dense). The training recipe is distillation via pointwise MSE against the teacher's scores on a curated subset of lightonai/embeddings datasets. Full training code and hyperparameters are public.
All six support up to 8,192 tokens of context. With Flash Attention 2 and bfloat16 loading, the models achieve 1.7x to 8.3x speedup over default fp32 + SDPA, depending on model size and sequence length (per Hugging Face benchmarking).
Size diversity matters more than absolute rank
The headline finding is not that the 1B model matches the teacher. It is that you can now pick a model by latency budget and still land in the 0.55–0.61 NDCG@10 range. The 17M variant scores 0.5576; the 68M hits 0.5915. That spread gives retrieval teams real optionality.
Reranking is expensive by design: a cross-encoder must run once per (query, document) pair in the top-K retrieved set, whereas an embedding model encodes each document once and reuses it. The standard production pattern is retrieve-then-rerank. If your retriever pulls top-100 candidates at <1ms per query, your reranker can afford 5–10ms per ranking pass without breaking user-facing latency budgets. The 17M model on consumer hardware typically hits that window. The 1B model does not, unless your corpus is small or your end-to-end budget is generous.
Because the training recipe is public and the teacher weights are available, teams can also distill and tune variants for domain-specific corpora without starting from scratch. The distillation approach (MSE on teacher scores) is standard, but the transparency reduces the friction of customization.
Benchmark before swapping; expect 0.5–2ms latency per ranking
Do not assume the 1B model is the right choice. Test the 68M or 32M variant first on your domain. MTEB results are over 10 general retrieval tasks; your corpus may reward different ranking patterns. Use the provided usage examples (three lines of code) to score your top 100 queries in production, measure NDCG or MRR against your ground truth, and trade down in size until you breach your target quality threshold.
If you are using an older proprietary reranker (e.g., Cohere Reranker v2 or cross-encoder/ms-marco-MiniLM-L12-v2, which scores 0.5066 NDCG@10), the 32M Ettin model (0.5779) is a drop-in replacement with lower latency. If you are not reranking at all, start with the 68M variant paired with a fast embedding model like sentence-transformers/static-retrieval-mrl-en-v1 or google/embedding-gemma-300m. Verify the latency on your actual batch sizes and sequence lengths before committing to production.