Cohere's 30B Coding Model Matches 120B Rivals on agentic Tasks

Cohere releases North Mini Code: 30B sparse MoE with 3B active parameters

Cohere has released North Mini Code as the first model in its new family designed specifically for agentic software engineering tasks. The model is a decoder-only Transformer-based sparse Mixture-of-Experts architecture with 128 experts, of which 8 are activated per token. It uses interleaved sliding-window attention and global attention in a 3:1 ratio, with SwiGLU feed-forward blocks and efficient attention implementation.

On Artificial Analysis' Coding Index, North Mini Code scores 33.4, outperforming Qwen 3.5 (35B-A3B), Gemma 4 (26B-A4B), and substantially larger models including Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B) (per Cohere's published benchmarks). The model achieves 80.2% pass@10 on SWE-Bench Verified and 55.1% pass@10 on Terminal-Bench v2 after supervised fine-tuning, improving to 61.0% pass@1 on mini-SWE-Agent harness after reinforcement learning with verifiable rewards (RLVR).

Post-training discipline, not scale, drives the gains

The model's efficiency comes from a two-stage cascaded supervised fine-tuning pipeline followed by RLVR. The first SFT stage uses 64K context and a mixed dataset where code forms 70% of trainable tokens (43% agentic tool-use data, 27% competitive or scientific programming). The second stage trains on 128K context using only 4.5 billion tokens of high-quality agentic and reasoning-driven samples, where code forms 61% of trainable tokens and all tool calls and completions are verified as executable.

Rather than optimize for quantitative metrics during SFT, Cohere treated it as priming for RLVR, relying on sample-level filtering to remove invalid tool calls, malformed tokens, and hallucinated citations. Over 70,000 verifiable tasks across approximately 5,000 unique repositories were used, with deduplication against SWE-Bench and SWE-Bench-Pro to avoid source leakage.

The RLVR stage used an asynchronous training loop (a vLLM sidecar fed rollouts continuously to an offline learner) to handle variable-length code traces. Weights were exported every four learner steps, and the model trained on both terminal-based and software engineering tasks simultaneously using binary rewards derived from unit-test-based verifiers. RLVR improved pass@1 performance by 7.9 percentage points on Terminal-Bench v2 and 3.0 percentage points on SWE-Bench.

Cross-harness generalization and inference cost matter more than benchmark scores

Cohere trained North Mini Code on multiple agent harnesses (SWE-Agent, mini-SWE-agent, OpenCode, and Terminal-Bench's Terminus 2) rather than optimizing for a single one. Adding just 6% benchmark harness data during the second SFT stage yielded a 10% gain when evaluated with OpenCode while maintaining SWE-Bench Verified performance. This matters because real agents encounter diverse tooling environments with different CLI interfaces, structured JSON responses, and raw stdout formats.

The sparse activation pattern (8 of 128 experts per token) significantly reduces inference cost and memory footprint compared to 30B dense models. Practitioners building code agents should test the model against their own tool harnesses and internal verification pipelines before assuming the published benchmarks translate to their environment. The model is available under Apache 2.0 on Hugging Face.

Cohere's 30B Coding Model Matches 120B Rivals on agentic Tasks

Our Take

Why it matters

Do this week

Cohere releases North Mini Code: 30B sparse MoE with 3B active parameters

Post-training discipline, not scale, drives the gains

Cross-harness generalization and inference cost matter more than benchmark scores

Related stories

Six in 10 workers skip reading employment contracts

Jury awards former Ameris Bank exec $80M in wrongful termination case

SpaceX IPO mints 4,400 millionaires. Here's how you compete for AI talent.