Video models, not language models, are winning robot learning

Two competing architectures for robot policies

The robot learning field has split into two sustained development paths. The first, established by NVIDIA's Pi-0 and refined through GR00T, starts from pretrained vision-language models (VLMs) and adapts them to generate robot actions from visual input and language instructions. The second, emerging from research labs and companies including NVIDIA's own DreamZero, Ant Group's LingBot-VA, and Sereact's Cortex 2.0, begins with pretrained video backbones or world models and learns to predict both future scene states and robot actions jointly.

The taxonomy matters because it changes the full training and inference pipeline. VLM-based policies (called VLAs) start from massive image-text pretraining. Video-based policies (called WAMs, for world-action models) start from video prediction models like Wan or NVIDIA's Cosmos, which encode how scenes change over time. Public examples now span academic groups including Video Prediction Policy and Fast-WAM, and commercial teams at Rhoda AI, Mimic Robotics, and others (per NVIDIA's blog post).

WAMs are not new. UniPi proposed essentially this approach in 2023. What changed is adoption velocity. Research papers citing WAM methods have grown faster in the last six months than VLA papers did in the equivalent window a year ago, according to NVIDIA's Scholar Inbox tracking.

The language-to-action grounding wall

VLM-based VLAs hit a specific bottleneck: the "grounding gap." A VLM can describe what it sees and reason about language, but mapping "pick up the red mug" into the exact visual percepts and motor commands to accomplish it still requires learning from robot data. That gap does not close with larger VLMs alone.

WAMs offer a different entry point. Video models already learn how scenes transform under actions and language conditioning. If that prior transfers to behavior, the remaining gap shrinks from "language plus vision into action" to "video representation into action." Smaller gaps require less robot data to close.

The cost implications are direct. Most teams can afford to build and scale only one representation at full production scale. The choice locks in architectural constraints for the next 2–3 years of dataset collection and fine-tuning. This is not a rebranding; it is a structural fork in the field.

NVIDIA's own deployment of both Pi-0-style and video-backbone models (DreamZero, Cosmos Policy) signals that the company is hedging, not predicting a single winner. University labs and smaller robotics firms will have to choose.

Decide your backbone before you scale data collection

If your team is building a robot foundation model or a generalist manipulation policy, the first decision is which backbone to commit to: VLM or video. Both have published results on standard benchmarks (CALVIN, LIBERO, RoboArena), but neither has demonstrated a decisive advantage on all tasks or across all scales.

Run a small pilot—typically 10k to 50k demonstrations—on both recipes. Measure grounding success (does the policy actually follow language instructions?) and inference latency. Video backbones may be cheaper to pretrain but more expensive to run on edge hardware. VLM backbones may require more robot data to close the grounding gap but may adapt faster to new tasks.

Do not assume that the eventual winner is a pure VLA or pure WAM. NVIDIA's writing suggests hybrids that combine video-based action prediction with language grounding are plausible, but such models do not yet dominate the published literature. Wait for one to show clear cost or performance gains before committing. Meanwhile, the engineering question you can solve today is which backbone reduces your total annotation burden for your specific task distribution.

Video models, not language models, are winning robot learning

Our Take

Why it matters

Do this week

Two competing architectures for robot policies

The language-to-action grounding wall

Decide your backbone before you scale data collection

Related stories

Muddy Children Puzzle traced through 200 years of logic and literature

LLM dialogue system hits 100% success without retraining on user types

Your MiFIR reporting framework may be compliant but broken