Back to news
AnalysisJune 18, 2026· 3 min read

Allen AI releases MolmoMotion to predict 3D object paths from video

MolmoMotion forecasts where objects will move in 3D space given a video frame, query points, and an action description. The model outperforms existing methods on a new 2.7K-clip benchmark and improves robot manipulation planning accuracy by 20 percentage points.

Our Take

MolmoMotion solves a real prediction problem (where will this object move next?) with a pragmatic representation (sparse 3D point trajectories) and demonstrates measurable wins in two downstream domains (robotics and video generation), but the gains are incremental rather than foundational.

Why it matters

Motion forecasting is a prerequisite for embodied AI: robots need to anticipate object trajectories before grasping them, and video generators need physically plausible continuity. This model is the first to combine language-guided prediction with a general 3D representation that works across rigid, articulated, and deformable objects without per-category templates.

Do this week

Robotics engineers: benchmark MolmoMotion against your existing trajectory planners on real DROID-style tasks before committing to retraining pipelines; the 2K-step convergence claim is simulation-only and needs validation on your hardware and object distribution.

Allen AI releases 3D motion forecaster trained on 1.16M videos

Allen AI has released MolmoMotion, a language-guided 3D motion prediction model that takes an RGB video frame, a set of query points marked on an object, and a natural-language action description (e.g., "Move and rotate the wooden bowl") and predicts where those points will move over the next few seconds in 3D world space.

The model comes in two variants. MolmoMotion-AR predicts coordinates step-by-step using text representation, prioritizing smooth trajectories when motion is well-defined. MolmoMotion-FM uses flow-matching to predict in continuous 3D space, better handling cases where an instruction admits multiple plausible futures.

The system uses Molmo 2, Allen AI's vision-language backbone, to ground language instructions to objects and points in the image, then decodes future 3D trajectories. The representation itself—sparse object-attached 3D points in world space—is chosen to be class-agnostic (works across human hands, rigid objects, deformable materials without templates), view-stable (camera motion doesn't break the representation), and directly usable by downstream systems like robot controllers and video generators.

To train MolmoMotion, Allen AI built an automatic annotation pipeline to extract 3D point trajectories from unconstrained internet video. The pipeline tracks dense 2D points on objects, lifts them into metric 3D space, filters out noisy trajectories that don't move coherently with the object, and segments clips to the windows where motion actually occurs. The resulting dataset, MolmoMotion-1M, contains 1.16M videos with 3D action-described point trajectories, spanning 736 motion types and 5.6K distinct objects (company-reported).

Allen AI also released PointMotionBench, a human-validated evaluation set of 2.7K video clips across 111 object categories and 61 motion types, designed to measure 3D motion forecasting accuracy directly rather than relying on visual plausibility alone.

Motion prediction unlocks grounded robot and video planning

On PointMotionBench, MolmoMotion outperforms all tested baselines, including pixel-space video generators, parametric 3D methods, and constant-velocity heuristics (per the company's evaluation).

In downstream robot simulation tasks, a control policy built on MolmoMotion's trajectory predictions succeeds on 76.3% of pick-and-place tasks versus 56.0% for the same policy built on Molmo 2 alone—a 20-point gain. On real robot hardware (after fine-tuning on the DROID dataset), MolmoMotion reaches the same L2 error in trajectory prediction at 2K training steps that a Molmo 2 baseline requires 12K steps to achieve (company-reported).

When integrated into a video generation pipeline, MolmoMotion's predicted 3D paths steer generated video to follow action instructions more precisely than the base model. The model improves motion quality across five motion-related metrics and beats a larger image-to-video baseline on four of five measures (per company testing).

The practical benefit is clear: instead of asking a video generator to infer motion from text alone (prone to vagueness on small, precise movements), you can inject MolmoMotion's explicit 3D trajectory predictions and get more faithful motion.

Expect the model to work best on objects with coherent, predictable motion

MolmoMotion uses eight query points per object during training. That density captures useful object trajectories but falls short of densely representing surface geometry, limiting the model's ability to forecast complex deformable motion—think cloth wrinkling or clay deformation. The model also assumes action descriptions are clear and aligned with the actual motion in the video; vague or mismatched instructions will degrade predictions.

The robotics gains are strongest in simulation. Before deploying to real systems at scale, validate on your own object set, lighting conditions, and task distribution. The 2K-step convergence claim is simulation-only and depends heavily on how closely your real-world data matches DROID's domain.

For video generation, use MolmoMotion when motion is the primary constraint and text alone is insufficient. It will not rescue bad lighting, occlusion, or semantic confusion; it is a trajectory tool, not a full video prior.

Model weights, the MolmoMotion-1M dataset, and PointMotionBench are available on Hugging Face.

#Computer Vision#Research#Open Source#Agents
Share:
Keep reading

Related stories