Back to news
AnalysisJune 15, 2026· 3 min read

Video models, not language models, are winning robot learning

NVIDIA and other labs are shifting from language-vision robots to world-action models trained on video backbones. The shift exposes a fundamental gap in how robots learn to follow instructions.

Our Take

WAMs are a real alternative architecture, not a paradigm shift—the field is splitting into two sustained bets, and hybrids will likely win.

Why it matters

Robot foundation models represent years of compute investment and affect the trajectory of physical AI systems. The choice between VLM and video backbones now determines which teams' stacks survive the next 18 months.

Do this week

Robotics teams: audit whether your policy backbone is language-to-action or video-to-action; if you haven't committed, run a small pilot on both recipes before scaling to production data.

Two competing architectures for robot policies

The robot learning field has split into two sustained development paths. The first, established by NVIDIA's Pi-0 and refined through GR00T, starts from pretrained vision-language models (VLMs) and adapts them to generate robot actions from visual input and language instructions. The second, emerging from research labs and companies including NVIDIA's own DreamZero, Ant Group's LingBot-VA, and Sereact's Cortex 2.0, begins with pretrained video backbones or world models and learns to predict both future scene states and robot actions jointly.

The taxonomy matters because it changes the full training and inference pipeline. VLM-based policies (called VLAs) start from massive image-text pretraining. Video-based policies (called WAMs, for world-action models) start from video prediction models like Wan or NVIDIA's Cosmos, which encode how scenes change over time. Public examples now span academic groups including Video Prediction Policy and Fast-WAM, and commercial teams at Rhoda AI, Mimic Robotics, and others (per NVIDIA's blog post).

WAMs are not new. UniPi proposed essentially this approach in 2023. What changed is adoption velocity. Research papers citing WAM methods have grown faster in the last six months than VLA papers did in the equivalent window a year ago, according to NVIDIA's Scholar Inbox tracking.

The language-to-action grounding wall

VLM-based VLAs hit a specific bottleneck: the "grounding gap." A VLM can describe what it sees and reason about language, but mapping "pick up the red mug" into the exact visual percepts and motor commands to accomplish it still requires learning from robot data. That gap does not close with larger VLMs alone.

WAMs offer a different entry point. Video models already learn how scenes transform under actions and language conditioning. If that prior transfers to behavior, the remaining gap shrinks from "language plus vision into action" to "video representation into action." Smaller gaps require less robot data to close.

The cost implications are direct. Most teams can afford to build and scale only one representation at full production scale. The choice locks in architectural constraints for the next 2–3 years of dataset collection and fine-tuning. This is not a rebranding; it is a structural fork in the field.

NVIDIA's own deployment of both Pi-0-style and video-backbone models (DreamZero, Cosmos Policy) signals that the company is hedging, not predicting a single winner. University labs and smaller robotics firms will have to choose.

Decide your backbone before you scale data collection

If your team is building a robot foundation model or a generalist manipulation policy, the first decision is which backbone to commit to: VLM or video. Both have published results on standard benchmarks (CALVIN, LIBERO, RoboArena), but neither has demonstrated a decisive advantage on all tasks or across all scales.

Run a small pilot—typically 10k to 50k demonstrations—on both recipes. Measure grounding success (does the policy actually follow language instructions?) and inference latency. Video backbones may be cheaper to pretrain but more expensive to run on edge hardware. VLM backbones may require more robot data to close the grounding gap but may adapt faster to new tasks.

Do not assume that the eventual winner is a pure VLA or pure WAM. NVIDIA's writing suggests hybrids that combine video-based action prediction with language grounding are plausible, but such models do not yet dominate the published literature. Wait for one to show clear cost or performance gains before committing. Meanwhile, the engineering question you can solve today is which backbone reduces your total annotation burden for your specific task distribution.

#Computer Vision#Research#Open Source#Developer Tools
Share:
Keep reading

Related stories