Our Take
Four specific config changes were needed to restore training parity, but this level of migration complexity signals deeper API stability issues for production RL workloads.
Why it matters
Teams running online RL with vLLM face hidden correctness traps when upgrading. The fixes are documented but the debugging process took substantial engineering effort that most teams lack bandwidth for.
Do this week
RL teams: audit your vLLM upgrade timeline before Q2 and budget 2-3 sprint cycles for migration testing if you depend on logprob precision.
Four config changes restored training parity
ServiceNow's PipelineRL team documented their vLLM V0 to V1 migration, revealing four specific fixes needed to match V0 reinforcement learning training dynamics. The team used vLLM 0.8.5 as their V0 reference and migrated to vLLM 0.18.1.
The required changes were: setting logprobs-mode to processed_logprobs instead of V1's raw output default, disabling V1's new prefix caching and async scheduling defaults, matching V0's inflight weight update behavior with specific pause/resume parameters, and forcing fp32 precision for the language model head projection.
Without these fixes, their initial V1 run showed immediate divergence in key training metrics. Clip rate, KL divergence, entropy, and reward curves separated from the V0 reference early in training. The team's methodology was strict: establish backend parity first, then evaluate objective-level changes.
Default changes break training assumptions
The core issue was that vLLM V1's new defaults violated assumptions built into the RL trainer. The trainer expected logprobs from the processed distribution (after temperature scaling and filtering), but V1 returned raw model outputs by default. Similarly, V1's prefix caching could reuse cached state computed before weight updates, creating staleness the trainer didn't account for.
The fp32 precision requirement matches findings from other teams. MiniMax-M1's technical report described a similar training/inference mismatch traced to the output head precision. ScaleRL later included fp32 logits computation as part of their RL recipe and demonstrated its value through ablation studies.
The team's diagnostic approach separated three possible failure modes: semantic mismatches in logprob meaning, inference-path differences from new runtime defaults, and objective-level corrections needed for remaining staleness. They systematically ruled out the first two before considering the third.
Config audits prevent silent failures
The fixes are straightforward once identified, but the debugging process required deep system knowledge. Teams should treat vLLM V1 migration as a compatibility break for RL workloads, not a drop-in replacement.
Key configuration parameters to verify: logprobs-mode set to processed_logprobs, prefix caching and async scheduling explicitly disabled for parity testing, weight update synchronization matching your V0 behavior, and fp32 precision for the final projection layer.
The team recommends fixing backend correctness before adding objective-side corrections like importance sampling. Mixed corrections can mask inference bugs and make training curves harder to interpret. Track policy ratios, clip rates, and reward curves as early indicators of migration problems.