DPO cuts text degeneracy 59% in OCR models without human labels

DPO reduced text degeneration across five model families without human preference labels

Hugging Face published results showing Direct Preference Optimization applied to OCR and structured document extraction. The work tested five model families on Brazilian Portuguese text extraction, measuring text degeneration: the frequency of repetition loops instead of clean transcription.

Vanilla degeneration rates ranged from below 1% to above 33% across open-source families (per the paper). Supervised fine-tuning reduced those rates for most models but rarely to production-acceptable levels. A second training stage using DPO reduced degeneration in every family tested: average reduction of 59.4%, peak of 87.6% (Nanonets-OCR2 3B: 1.61% to 0.20%, per company-reported results).

The critical design choice: the pipeline used the SFT model's own degenerate outputs as rejected examples in preference pairs, not filtered them out as noise. The chosen outputs came from the same model on the same documents, scored by an automated LLM judge. No human annotators. No subjective preference judgments.

SFT has a structural ceiling on degeneration that DPO can address

Supervised fine-tuning optimizes token by token. Each prediction is evaluated in isolation. A repetition loop is never penalized as a completion-level failure; it is only a sequence of locally probable tokens. When the model enters a high-probability attractor region, it assigns elevated probability to the same token at the next step, deepening the loop until the sequence hits the token limit. This is a systems-level failure in the distribution geometry, not a decoding artifact.

DPO inverts this logic. The training signal is the full output, chosen or rejected, which means a degenerated completion can be explicitly labeled as the wrong outcome. One model in the benchmark (Qwen2.5-VL-3B) showed this mechanism directly: vanilla degeneration rate of 0.60%, rising to 3.23% after SFT, before DPO brought it to 1.41%. SFT moved the model toward the task and simultaneously into proximity with the degeneration attractor. DPO pushed back against the attractor specifically.

Almost all published DPO applications target chat alignment, where human judgment about helpfulness or harmlessness produces preference signals. OCR carries none of that subjectivity. The task is objective: a correct transcription is chosen; a degeneration loop is rejected. This work shows DPO works on objective tasks where preference signals come from task criteria, not annotator opinion.

Test DPO with your model's own failures as rejection pairs

If you are fine-tuning on structured generation and seeing persistent failure modes survive SFT, the mechanism is likely that your training objective does not penalize that failure directly. It only optimizes for correct outputs. The failure is simply outside the scope of what the training signal targets.

Before scaling human annotation, audit whether a second stage using DPO would help. Use your SFT model's own characteristic failures (not arbitrary low-quality outputs, but specific failure modes) as rejection examples. Pair them with validated correct outputs. Score the pairs with a task-specific judge, not human preference rankings. The results in this paper suggest the direction of improvement is consistent even when the magnitude varies across architectures.

This assumes you have a repeatable failure mode and a way to score correct outputs. If your failures are idiosyncratic or your scoring requires human judgment anyway, you are back to conventional preference annotation.

DPO cuts text degeneracy 59% in OCR models without human labels

Our Take

Why it matters

Do this week

DPO reduced text degeneration across five model families without human preference labels

SFT has a structural ceiling on degeneration that DPO can address

Test DPO with your model's own failures as rejection pairs

One daily brief. Every story gets a hype verdict.

Related stories

The 30-Day AI-Native Challenge: a free/freemium roadmap to real AI skills

Your AI compliance gap is wider than your governance framework

Compliance teams ditch spreadsheets for unified EDD software