MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation

Abstract

A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation.

We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.7 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot (a Molmo2-based VLM with a flow-matching action head), MolmoBot-Pi0 (replicating the π₀ architecture for controlled comparison), and MolmoBot-SPOC (a lightweight policy for edge deployment). Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments, reaching 79.2% success rate on real-world tabletop pick-and-place, outperforming π₀.₅ at 39.2%.

Qualitative Demos

Video Demos

MolmoBot policies are highly robust to camera pose variation. MolmoBot-Engine employs aggressive camera randomization, requiring resulting policies to learn how to handle a very wide diversity of camera poses. MolmoBot policies are even able to handle adversarial movements of the camera at test time!

MolmoBot policies generalize robustly to variation in the environment — including table height, object placement, and scene clutter. This robustness extends to gracefully handling adversarial perturbations applied at test time, such as changing the table height mid-episode.

Owing to the high degree of diversity and randomization in MolmoBot-Data, MolmoBot policies are robust to disturbances and highly steerable. SOTA VLAs are often attracted to nearby objects or receptacles, ignoring the task prompt, but MolmoBot policies strongly attend to the task instruction.

MolmoBot can open doors zero-shot on a mobile bimanual platform, demonstrating a high degree of simultaneous coordination between mobility and constrained manipulation.

MolmoBot can coordinate many degrees of freedom simultaneously to achieve robust manipulation.

System Overview

Zero-Shot Sim-to-Real Transfer

MolmoBot teaser figure — MolmoBot leverages diverse simulation data to achieve zero-shot sim-to-real transfer on multiple robotic tasks such as pick-and-place and door opening. This unlocks the ability to dramatically scale up training data for generalist robotic foundation models.

MolmoBot-Engine

Generating Data at Scale

MolmoBot-Engine is an open-source procedural data generation pipeline built on MolmoSpaces, a photorealistic simulation platform with 94,000+ indoor environments. It automatically generates expert trajectories via task-and-motion planning with aggressive randomization of objects, lighting, and camera poses — producing MolmoBot-Data, 1.7M demonstrations totaling 5,700+ hours of robot experience across 8 task types.

1.7M

Expert Demonstrations

94k+

Unique Environments

11k+

Unique Objects

9k+

Unique Receptacles

Task Types

5,704 h

Total Data

MolmoBot-Engine procedurally generates diverse pick-and-place demonstrations across thousands of simulated environments. Randomization of object placement, lighting, and camera pose drives the sim-to-real transfer capability of the resulting policies.

Key Result

DROID Real-World Performance

We evaluate our policies zero-shot on a real DROID robot across four distinct environments — Workroom, Kitchen, Bedroom, and Office — covering 40 pick-and-place tasks with varied objects and receptacles (120 episodes total). No real-world data or task-specific fine-tuning is used. MolmoBot (F=2) achieves 79.2% overall success, more than doubling the performance of π₀.₅-DROID (39.2%), a strong baseline trained on large-scale real-world demonstrations.

MolmoBot policies exhibit strong zero-shot sim-to-real performance across real-world DROID evaluations, outperforming SOTA policies trained on large-scale real-world demonstrations. Bar heights reflect mean success rate; error bars are 95% confidence intervals via stratified bootstrapping.

Per-Task Breakdown

DROID Real-World Results by Environment

For each policy, we conduct 120 real-world evaluations on 40 tasks in 4 environments across 2 institutions, using 3 physical robots. Each cell shows successes out of 3 trials. Click any cell to watch the recordings for that policy & task.

Policy	Spoon Tray	Spoon Box	Tape Tray	Tape Box	Blue Mug Tray	Blue Mug Box	Copper Mug Tray	Copper Mug Box	Timer Tray	Timer Box	Avg
π₀	0/3	0/3	1/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	3%
π₀.₅	2/3	2/3	2/3	0/3	1/3	1/3	0/3	0/3	0/3	0/3	27%
MolmoBot-Pi0	1/3	0/3	3/3	3/3	3/3	3/3	2/3	1/3	0/3	2/3	60%
MolmoBot-Img	3/3	3/3	3/3	3/3	3/3	2/3	2/3	0/3	3/3	1/3	77%
MolmoBot	3/3	2/3	3/3	3/3	3/3	3/3	3/3	1/3	3/3	3/3	90%

Policy	Apple Easy	Apple Hard	Mug Easy	Mug Hard	Banana Easy	Banana Hard	Mouse Easy	Mouse Hard	Clutter Brown	Clutter Black	Avg
π₀	0/3	1/3	0/3	0/3	1/3	0/3	3/3	1/3	0/3	0/3	20%
π₀.₅	3/3	0/3	2/3	1/3	3/3	2/3	3/3	1/3	2/3	2/3	63%
MolmoBot-Pi0	2/3	3/3	2/3	0/3	3/3	1/3	3/3	2/3	0/3	0/3	53%
MolmoBot-Img	3/3	3/3	2/3	3/3	3/3	3/3	3/3	2/3	3/3	1/3	87%
MolmoBot	1/3	2/3	3/3	3/3	3/3	3/3	3/3	3/3	0/3	0/3	70%

Policy	Pills Towel	Pills Basket	Roller Towel	Roller Basket	Banana Towel	Banana Basket	Ball Towel	Ball Basket	Clutter Towel	Clutter Basket	Avg
π₀	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0/3	0%
π₀.₅	0/3	0/3	0/3	0/3	1/3	0/3	0/3	0/3	2/3	0/3	10%
MolmoBot-Pi0	2/3	0/3	2/3	0/3	0/3	0/3	1/3	0/3	0/3	2/3	23%
MolmoBot-Img	0/3	0/3	1/3	2/3	3/3	3/3	3/3	2/3	3/3	3/3	67%
MolmoBot	3/3	3/3	3/3	0/3	3/3	2/3	3/3	3/3	3/3	3/3	87%

Policy	Knife Board	Banana Plate	Marker Mug	Scissors Bowl	Carrot Basket	Knife Green Bowl	Screwdriver Blue Bowl	Mouse Blue Bowl	Mug Bowl	Marker Box	Avg
π₀	0/3	0/3	0/3	0/3	1/3	1/3	1/3	1/3	0/3	0/3	13%
π₀.₅	2/3	3/3	1/3	1/3	1/3	1/3	2/3	1/3	3/3	2/3	57%
MolmoBot-Pi0	1/3	3/3	0/3	0/3	0/3	1/3	3/3	2/3	3/3	2/3	50%
MolmoBot-Img	2/3	1/3	0/3	1/3	3/3	1/3	3/3	2/3	3/3	2/3	60%
MolmoBot	2/3	3/3	1/3	2/3	2/3	2/3	3/3	2/3	3/3	1/3	70%

Pink rows are our models. Click any cell to watch the episode recordings.

Simulation Evaluation

DROID Simulation Results

Evaluation on held-out simulation environments. Success rates over 200 episodes per task (Pick MSProc: 1000 episodes). All models evaluated zero-shot without any task-specific fine-tuning.

Model	Pick MSProc	Pick Classic	Pick	Pick Rand-Cam	Pick&Place	PnP Next-To	PnP Color	Avg.
π₀.₅	18.1	6.4	7.0	8.0	11.7	8.2	10.4	10.0
π₀.₅-Finetune	48.0	28.3	25.8	29.7	43.5	28.4	48.3	36.0
StereoVLA	6.6	4.3	1.1	N/A	0	N/A	0	—
LAP-VLA	19.4	2.4	3.1	2.7	3.8	6.5	3.1	4.8
X-VLA	3.3	0.5	0.7	0.8	0.1	1.9	0.9	1.2

MolmoBot-Pi0	66.2	35.7	33.3	39.8	44.7	24.7	46.2	41.5
MolmoBot-Img	92.2	63.5	61.4	62.1	63.0	21.0	67.8	61.6
MolmoBot (F=2)	93.5	66.8	64.0	63.7	66.4	26.4	67.8	64.1
MolmoBot (F=3)	91.3	63.8	59.0	62.7	65.4	28.3	66.1	62.4

Pink rows are our models. Bold values are best per column. MolmoBot variants substantially outperform all baselines, with MolmoBot (F=2) achieving 64.1% average vs. 36.0% for the strongest baseline (π₀.₅-Finetune).

Mobile Manipulation

RB-Y1 Simulation Results

Zero-shot simulation evaluation for RB-Y1 policies on held-out environments.

Model	Pick	Pick & Place	Open	Door Open
MolmoBot Multitask	44.8%	22.5%	25.2%	70.2%
MolmoBot Door Specialist	—	—	—	77.7%
MolmoBot-SPOC Rigid	10.5%	1.8%	—	—
MolmoBot-SPOC Articulated	—	—	21.8%	58.8%

MolmoBot Multitask outperforms MolmoBot-SPOC across all shared tasks. The MolmoBot Door Specialist achieves 77.7% zero-shot door-opening success in simulation.