Researchers Fix LLM Language Gaps With Consistency Training

A dataset and a training method for cross-lingual consistency

Researchers at multiple institutions released PolyFact, a 100K-fact multilingual QA dataset grounded in Wikidata and spanning 12 typologically diverse languages. They then compared three training approaches on Qwen-2.5-7B and OLMo-2-1124-7B: light continual pretraining on parallel data, supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO).

GRPO outperformed SFT on both cross-lingual consistency and generalization to unseen languages (per the paper). Continual pretraining on parallel data yielded limited gains. Mechanistic analysis showed that GRPO reorganizes how the models route information across languages by reducing specialization in MLP layers and attention heads, promoting shared representations instead.

The authors plan to release code, trained models, and the dataset. The paper is currently under review at EMNLP 2026.

Models know facts in English but lose them in translation

Large language models trained predominantly on English data encode world knowledge reliably in that language but fail to express the same facts accurately in other languages. This is not a knowledge gap; it is an expression problem. A model may know that Paris is the capital of France in English but produce inconsistent or false statements about French geography when prompted in French.

For deployed systems serving multilingual users, this inconsistency erodes trust and creates support burden. The PolyFact dataset formalizes the measurement of this gap, and GRPO offers a concrete training lever that avoids full retraining. That is material for any team operating multilingual LLM deployments.

Run your own GRPO vs. SFT comparison on your languages

The paper tests two 7B models on 12 languages. Before adopting GRPO in production, validate the finding against your own multilingual QA distribution and model size. If your primary use case is 2 or 3 languages, or your model is significantly larger or smaller than the test models, the relative gains may differ. Download the released code and dataset; run both GRPO and SFT in parallel on a held-out language pair. Compare not just accuracy but inference latency, as reinforcement learning methods can shift computational cost.

If GRPO wins on your tasks and languages, prioritize it over supervised fine-tuning. If the gap is marginal, SFT's simplicity and shorter training time may be the better choice for your timeline.

Researchers Fix LLM Language Gaps With Consistency Training

Our Take

Why it matters

Do this week

A dataset and a training method for cross-lingual consistency

Models know facts in English but lose them in translation

Run your own GRPO vs. SFT comparison on your languages

Related stories

25 MLOps Guidelines for Model Deployment Now Public

Deeper transformers need smarter residual routing, not just fixed weights

macOS Agents Fail Where Linux Ones Succeed: New 421-Task Benchmark Reveals the Gap