Fine-tune Cosmos 2.5 for robot video on one H100 GPU with LoRA

NVIDIA publishes LoRA and DoRA fine-tuning code for Cosmos Predict 2.5

NVIDIA and Hugging Face released a complete implementation guide for parameter-efficient fine-tuning of Cosmos Predict 2.5, a 2B-parameter world model that generates physically plausible video conditioned on text prompts, images, or video clips. The guide covers both Low-Rank Adaptation (LoRA) and its variant, Directional LoRA (DoRA), which decompose weight updates into magnitude and direction components.

The training setup freezes the VAE, text encoder, and DiT (diffusion transformer) base weights and injects trainable adapter modules only into the DiT's attention projections and feedforward layers. Adapter files remain small and portable: the example configuration with rank=32 produces approximately 50M trainable parameters out of 2B total.

Training on 92 robot manipulation videos with text prompts describing pick-and-place tasks takes 17 hours on a single 80GB GPU (H100) or 2.5 hours across eight H100s (company-reported). The code uses the diffusers and accelerate libraries and supports mixed-precision training in bfloat16 with float32 upcasting of LoRA parameters for numerical stability. Inference fuses the adapter weights directly into the base model, eliminating any overhead from the decomposition.

Synthetic data generation for robot learning becomes accessible

Collecting real robot trajectories for policy training is slow and expensive. A fine-tuned video world model can generate synthetic demonstrations conditioned on task descriptions and initial frames, providing a scalable alternative for downstream robot learning tasks. The bottleneck has been compute cost: full fine-tuning of a multi-billion-parameter model requires significant memory and multiple GPUs.

Parameter-efficient adapters solve this by keeping the frozen base model fixed and training only a small residual. A single H100 or even smaller GPUs become viable for specializing Cosmos to specific robot morphologies, camera setups, or manipulation domains. The adapter-swapping mechanism allows practitioners to maintain multiple domain-specific models without retraining the base.

The guide also specifies evaluation metrics: Temporal Sampson Error (geometric consistency frame-to-frame) and Cross-view Sampson Error (multi-view alignment), ensuring generated video quality is measured rather than assumed.

Adapt Cosmos to your robot domain with off-the-shelf hardware

The provided train_cosmos_predict25_lora.py script handles data loading, loss computation via rectified flow (predicting velocity toward clean data), and checkpointing. Start with LoRA rank=32 and experiment upward if your adapter file size budget permits; higher rank increases expressiveness but memory and checkpoint size.

Data preparation requires pairing videos with text captions and optional initial-frame images. The training loop samples random 93-frame windows from longer videos, enabling temporal augmentation. A learning rate scheduler warms up linearly, peaks at a configurable multiple of the base learning rate, then decays linearly to a floor.

Switch from LoRA to DoRA by adding --use_dora to the launch command; DoRA sometimes improves training stability and final quality at minimal additional cost. For reproducibility, seed the initial latent noise via NumPy so results match across GPU architectures.

Inference loads the frozen base model, attaches the adapter weights, and fuses them before generation. The pipeline accepts an image (conditioning frame), a text prompt, and optional seed for reproducible noise. Test both temporal and cross-view Sampson errors on your evaluation set to confirm the fine-tuned model preserves geometric consistency in the target domain.

Fine-tune Cosmos 2.5 for robot video on one H100 GPU with LoRA

Our Take

Why it matters

Do this week

NVIDIA publishes LoRA and DoRA fine-tuning code for Cosmos Predict 2.5

Synthetic data generation for robot learning becomes accessible

Adapt Cosmos to your robot domain with off-the-shelf hardware

One daily brief. Every story gets a hype verdict.

Related stories

Cancer patient used Claude to catch missed diagnosis, avoid unnecessary radiotherapy

Macy's embeds AI into search, inventory, code shipping—not chatbots

KPN builds AI agents for Dutch telecom customer care