Our Take
DPO works on objective tasks with zero human annotation by inverting a counterintuitive design choice: treat failure modes as the training signal, not noise to discard.
Why it matters
Most published DPO work targets conversational alignment where human preference judgments exist. This work proves the technique applies to objective, non-chat tasks where preference signals come from task criteria alone, not annotator opinion. That expands where the method can ship.
Do this week
If you're fine-tuning on structured generation and seeing persistent failure modes survive SFT: audit whether your training objective penalizes that failure directly, or only optimizes for correct outputs (hint: usually the latter). If it's the latter, test DPO with your model's own bad outputs as rejection pairs before scaling annotation.
DPO reduced text degeneration across five model families without human preference labels
Hugging Face published results showing Direct Preference Optimization applied to OCR and structured document extraction. The work tested five model families on Brazilian Portuguese text extraction, measuring text degeneration: the frequency of repetition loops instead of clean transcription.
Vanilla degeneration rates ranged from below 1% to above 33% across open-source families (per the paper). Supervised fine-tuning reduced those rates for most models but rarely to production-acceptable levels. A second training stage using DPO reduced degeneration in every family tested: average reduction of 59.4%, peak of 87.6% (Nanonets-OCR2 3B: 1.61% to 0.20%, per company-reported results).
The critical design choice: the pipeline used the SFT model's own degenerate outputs as rejected examples in preference pairs, not filtered them out as noise. The chosen outputs came from the same model on the same documents, scored by an automated LLM judge. No human annotators. No subjective preference judgments.
SFT has a structural ceiling on degeneration that DPO can address
Supervised fine-tuning optimizes token by token. Each prediction is evaluated in isolation. A repetition loop is never penalized as a completion-level failure; it is only a sequence of locally probable tokens. When the model enters a high-probability attractor region, it assigns elevated probability to the same token at the next step, deepening the loop until the sequence hits the token limit. This is a systems-level failure in the distribution geometry, not a decoding artifact.
DPO inverts this logic. The training signal is the full output, chosen or rejected, which means a degenerated completion can be explicitly labeled as the wrong outcome. One model in the benchmark (Qwen2.5-VL-3B) showed this mechanism directly: vanilla degeneration rate of 0.60%, rising to 3.23% after SFT, before DPO brought it to 1.41%. SFT moved the model toward the task and simultaneously into proximity with the degeneration attractor. DPO pushed back against the attractor specifically.
Almost all published DPO applications target chat alignment, where human judgment about helpfulness or harmlessness produces preference signals. OCR carries none of that subjectivity. The task is objective: a correct transcription is chosen; a degeneration loop is rejected. This work shows DPO works on objective tasks where preference signals come from task criteria, not annotator opinion.
Test DPO with your model's own failures as rejection pairs
If you are fine-tuning on structured generation and seeing persistent failure modes survive SFT, the mechanism is likely that your training objective does not penalize that failure directly. It only optimizes for correct outputs. The failure is simply outside the scope of what the training signal targets.
Before scaling human annotation, audit whether a second stage using DPO would help. Use your SFT model's own characteristic failures (not arbitrary low-quality outputs, but specific failure modes) as rejection examples. Pair them with validated correct outputs. Score the pairs with a task-specific judge, not human preference rankings. The results in this paper suggest the direction of improvement is consistent even when the magnitude varies across architectures.
This assumes you have a repeatable failure mode and a way to score correct outputs. If your failures are idiosyncratic or your scoring requires human judgment anyway, you are back to conventional preference annotation.