Back to news
Use CaseMay 19, 2026· 3 min read

Fine-tune Cosmos 2.5 for robot video on one H100 GPU with LoRA

NVIDIA's Cosmos Predict 2.5 now supports parameter-efficient fine-tuning via LoRA and DoRA. Train domain-specific robot video models in 17 hours on a single H100, or 2.5 hours on eight GPUs, without catastrophic forgetting.

Our Take

This is a working tutorial, not a breakthrough: LoRA and DoRA are established techniques, and the post demonstrates their application to an existing model without new architectural or algorithmic contributions.

Why it matters

Robot learning teams often lack the compute budget or GPU memory for full model fine-tuning. Parameter-efficient adapters make it practical to specialize a 2B-parameter world model to specific manipulation tasks or camera viewpoints on commodity hardware.

Do this week

Roboticists: download the training script from diffusers/examples/cosmos and test LoRA rank=32 on your own robot video dataset this week so you can measure whether synthetic trajectory quality improves your downstream policy learning.

NVIDIA publishes LoRA and DoRA fine-tuning code for Cosmos Predict 2.5

NVIDIA and Hugging Face released a complete implementation guide for parameter-efficient fine-tuning of Cosmos Predict 2.5, a 2B-parameter world model that generates physically plausible video conditioned on text prompts, images, or video clips. The guide covers both Low-Rank Adaptation (LoRA) and its variant, Directional LoRA (DoRA), which decompose weight updates into magnitude and direction components.

The training setup freezes the VAE, text encoder, and DiT (diffusion transformer) base weights and injects trainable adapter modules only into the DiT's attention projections and feedforward layers. Adapter files remain small and portable: the example configuration with rank=32 produces approximately 50M trainable parameters out of 2B total.

Training on 92 robot manipulation videos with text prompts describing pick-and-place tasks takes 17 hours on a single 80GB GPU (H100) or 2.5 hours across eight H100s (company-reported). The code uses the diffusers and accelerate libraries and supports mixed-precision training in bfloat16 with float32 upcasting of LoRA parameters for numerical stability. Inference fuses the adapter weights directly into the base model, eliminating any overhead from the decomposition.

Synthetic data generation for robot learning becomes accessible

Collecting real robot trajectories for policy training is slow and expensive. A fine-tuned video world model can generate synthetic demonstrations conditioned on task descriptions and initial frames, providing a scalable alternative for downstream robot learning tasks. The bottleneck has been compute cost: full fine-tuning of a multi-billion-parameter model requires significant memory and multiple GPUs.

Parameter-efficient adapters solve this by keeping the frozen base model fixed and training only a small residual. A single H100 or even smaller GPUs become viable for specializing Cosmos to specific robot morphologies, camera setups, or manipulation domains. The adapter-swapping mechanism allows practitioners to maintain multiple domain-specific models without retraining the base.

The guide also specifies evaluation metrics: Temporal Sampson Error (geometric consistency frame-to-frame) and Cross-view Sampson Error (multi-view alignment), ensuring generated video quality is measured rather than assumed.

Adapt Cosmos to your robot domain with off-the-shelf hardware

The provided train_cosmos_predict25_lora.py script handles data loading, loss computation via rectified flow (predicting velocity toward clean data), and checkpointing. Start with LoRA rank=32 and experiment upward if your adapter file size budget permits; higher rank increases expressiveness but memory and checkpoint size.

Data preparation requires pairing videos with text captions and optional initial-frame images. The training loop samples random 93-frame windows from longer videos, enabling temporal augmentation. A learning rate scheduler warms up linearly, peaks at a configurable multiple of the base learning rate, then decays linearly to a floor.

Switch from LoRA to DoRA by adding --use_dora to the launch command; DoRA sometimes improves training stability and final quality at minimal additional cost. For reproducibility, seed the initial latent noise via NumPy so results match across GPU architectures.

Inference loads the frozen base model, attaches the adapter weights, and fuses them before generation. The pipeline accepts an image (conditioning frame), a text prompt, and optional seed for reproducible noise. Test both temporal and cross-view Sampson errors on your evaluation set to confirm the fine-tuned model preserves geometric consistency in the target domain.

#Fine-tuning#Computer Vision#Open Source#Developer Tools
Share:
Keep reading

Related stories