JetBrains releases Mellum2, a 12B code-and-text model activating only 2.5B parameters per token

JetBrains open-sources a purpose-built routing model

JetBrains released Mellum2, a 12-billion-parameter Mixture-of-Experts model designed for text-and-code workloads. The model activates only 2.5B parameters per token, reducing memory footprint and latency during inference (per the company's technical report on arXiv). It is distributed under the Apache 2.0 license.

The stated use cases are routing and orchestration in multi-model systems, retrieval-augmented generation pipelines, agent subtasks, and self-hosted private deployments. JetBrains claims Mellum2 delivers more than 2x faster inference than similarly-sized open models while remaining competitive on code generation, reasoning, science, and math benchmarks (company-reported).

The MoE architecture allows the model to maintain high total capacity while activating only a subset of parameters per token. This efficiency is the core tradeoff: fewer active parameters mean lower serving costs for real-time workloads that do not require full model capacity.

The real win is scope, not scale

Modern production AI systems are becoming less monolithic. Teams now stitch together retrievers, routers, validators, tool callers, and larger reasoning models. Each of these components runs frequently and latency-sensitive; few require a frontier model's full capability.

Mellum2 targets that middle layer. It is built to handle prompt classification, tool selection, context compression, summarization, planning, and validation without invoking a larger model for every intermediate step. This reduces both wall-clock latency and per-inference serving cost.

The open license and MoE efficiency also unlock self-hosted deployment on proprietary data. Teams working in regulated environments or handling sensitive codebases can run Mellum2 on private infrastructure without vendor lock-in.

The vendor-published benchmarks show competitive performance, but no independent reproduction yet exists. The 2x inference speedup claim should be validated against your own hardware and inference stack before committing to a swap.

Evaluate Mellum2 for your routing and sub-agent tier

If you are running a multi-model system with routing, RAG, or agent tasks, treat Mellum2 as a drop-in test candidate. Download the model from Hugging Face and run it against your current stack using your own latency and accuracy baselines.

Three questions to answer: Does it meet your latency SLA? Does it maintain accuracy on your benchmark tasks? Can you serve it cheaper than your current router or sub-agent model? If all three are yes, the engineering lift to integrate is low.

Pay special attention to the inference backend you choose. MoE models benefit from hardware that can parallelize across sparse activations; some inference frameworks do not exploit this efficiently. Benchmark on your production hardware, not generic numbers.

JetBrains releases Mellum2, a 12B code-and-text model activating only 2.5B parameters per token

Our Take

Why it matters

Do this week

JetBrains open-sources a purpose-built routing model

The real win is scope, not scale

Evaluate Mellum2 for your routing and sub-agent tier

One daily brief. Every story gets a hype verdict.

Related stories

Fenergo hires Finastra CRO to lead global revenue expansion

UK banks have 18 months to map third-party risks under PS26/2

Quantifind Lands $200M to Scale AI-Native Financial Crime Detection