vLLM V1 migration required four fixes to match V0 RL training

Four config changes restored training parity

ServiceNow's PipelineRL team documented their vLLM V0 to V1 migration, revealing four specific fixes needed to match V0 reinforcement learning training dynamics. The team used vLLM 0.8.5 as their V0 reference and migrated to vLLM 0.18.1.

The required changes were: setting logprobs-mode to processed_logprobs instead of V1's raw output default, disabling V1's new prefix caching and async scheduling defaults, matching V0's inflight weight update behavior with specific pause/resume parameters, and forcing fp32 precision for the language model head projection.

Without these fixes, their initial V1 run showed immediate divergence in key training metrics. Clip rate, KL divergence, entropy, and reward curves separated from the V0 reference early in training. The team's methodology was strict: establish backend parity first, then evaluate objective-level changes.

Default changes break training assumptions

The core issue was that vLLM V1's new defaults violated assumptions built into the RL trainer. The trainer expected logprobs from the processed distribution (after temperature scaling and filtering), but V1 returned raw model outputs by default. Similarly, V1's prefix caching could reuse cached state computed before weight updates, creating staleness the trainer didn't account for.

The fp32 precision requirement matches findings from other teams. MiniMax-M1's technical report described a similar training/inference mismatch traced to the output head precision. ScaleRL later included fp32 logits computation as part of their RL recipe and demonstrated its value through ablation studies.

The team's diagnostic approach separated three possible failure modes: semantic mismatches in logprob meaning, inference-path differences from new runtime defaults, and objective-level corrections needed for remaining staleness. They systematically ruled out the first two before considering the third.

Config audits prevent silent failures

The fixes are straightforward once identified, but the debugging process required deep system knowledge. Teams should treat vLLM V1 migration as a compatibility break for RL workloads, not a drop-in replacement.

Key configuration parameters to verify: logprobs-mode set to processed_logprobs, prefix caching and async scheduling explicitly disabled for parity testing, weight update synchronization matching your V0 behavior, and fp32 precision for the final projection layer.

The team recommends fixing backend correctness before adding objective-side corrections like importance sampling. Mixed corrections can mask inference bugs and make training curves harder to interpret. Track policy ratios, clip rates, and reward curves as early indicators of migration problems.

vLLM V1 migration required four fixes to match V0 RL training

Our Take

Why it matters

Do this week

Four config changes restored training parity

Default changes break training assumptions

Config audits prevent silent failures

Related stories

ADP study shows workers need space, not time, for AI skills

AMD ROCm trains medical AI model in 5 minutes, no CUDA needed

IVF success rates jumped from 15% to over 25% in two decades