Back to news
NewsJune 2, 2026· 2 min read

JetBrains releases Mellum2, a 12B code-and-text model activating only 2.5B parameters per token

Mellum2 is a Mixture-of-Experts model optimized for routing, RAG, and sub-agent tasks in multi-model systems. The model achieves more than 2x faster inference than similarly-sized competitors while maintaining competitive benchmarks.

Our Take

A well-scoped routing and orchestration model for production systems is useful; the 2x faster inference claim rests on vendor benchmarks with no independent reproduction.

Why it matters

Teams building agent systems and RAG pipelines often overpay for large models on latency-sensitive tasks like routing and context compression. Mellum2 targets that exact gap with an open, self-hostable alternative.

Do this week

Infrastructure teams: download Mellum2 from Hugging Face and benchmark it against your current routing and sub-agent stack before the next sprint planning cycle, so you can decide whether to swap in a smaller model.

JetBrains open-sources a purpose-built routing model

JetBrains released Mellum2, a 12-billion-parameter Mixture-of-Experts model designed for text-and-code workloads. The model activates only 2.5B parameters per token, reducing memory footprint and latency during inference (per the company's technical report on arXiv). It is distributed under the Apache 2.0 license.

The stated use cases are routing and orchestration in multi-model systems, retrieval-augmented generation pipelines, agent subtasks, and self-hosted private deployments. JetBrains claims Mellum2 delivers more than 2x faster inference than similarly-sized open models while remaining competitive on code generation, reasoning, science, and math benchmarks (company-reported).

The MoE architecture allows the model to maintain high total capacity while activating only a subset of parameters per token. This efficiency is the core tradeoff: fewer active parameters mean lower serving costs for real-time workloads that do not require full model capacity.

The real win is scope, not scale

Modern production AI systems are becoming less monolithic. Teams now stitch together retrievers, routers, validators, tool callers, and larger reasoning models. Each of these components runs frequently and latency-sensitive; few require a frontier model's full capability.

Mellum2 targets that middle layer. It is built to handle prompt classification, tool selection, context compression, summarization, planning, and validation without invoking a larger model for every intermediate step. This reduces both wall-clock latency and per-inference serving cost.

The open license and MoE efficiency also unlock self-hosted deployment on proprietary data. Teams working in regulated environments or handling sensitive codebases can run Mellum2 on private infrastructure without vendor lock-in.

The vendor-published benchmarks show competitive performance, but no independent reproduction yet exists. The 2x inference speedup claim should be validated against your own hardware and inference stack before committing to a swap.

Evaluate Mellum2 for your routing and sub-agent tier

If you are running a multi-model system with routing, RAG, or agent tasks, treat Mellum2 as a drop-in test candidate. Download the model from Hugging Face and run it against your current stack using your own latency and accuracy baselines.

Three questions to answer: Does it meet your latency SLA? Does it maintain accuracy on your benchmark tasks? Can you serve it cheaper than your current router or sub-agent model? If all three are yes, the engineering lift to integrate is low.

Pay special attention to the inference backend you choose. MoE models benefit from hardware that can parallelize across sparse activations; some inference frameworks do not exploit this efficiently. Benchmark on your production hardware, not generic numbers.

#LLM#Open Source#Agents#RAG#Developer Tools
Share:
Keep reading

Related stories