MolmoB0T: Large-Scale Simulation
Enables
Zero-Shot Manipulation

Abhay Deshpande1*Maya Guru1*Rose Hendrix1*Snehal Jauhri1,4*
Ainaz Eftekhar1,2, Rohun Tripathi1, Max Argus1, Jordi Salvador1, Haoquan Fang1,2, Matthew Wallingford1, Wilbert Pumacay1, Yejin Kim1
Quinn Pfeifer2, Ying-Chun Lee2, Piper Wolters1, Omar Rayyan3, Mingtong Zhang5, Jiafei Duan1,2, Karen Farley1, Winson Han1, Eli Vanderbilt1
Dieter Fox1,2, Ali Farhadi1,2, Georgia Chalvatzaki1
Dhruv Shah5†, Ranjay Krishna1,2†
1Allen Institute for AI  ·  2University of Washington  ·  3UC Los Angeles  ·  4Technische Universität Darmstadt  ·  5Princeton University
* Equal contribution (alphabetical)  ·  Core contributors  ·  † Equal advising

Abstract

A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation.

We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.7 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot (a Molmo2-based VLM with a flow-matching action head), MolmoBot-Pi0 (replicating the π₀ architecture for controlled comparison), and MolmoBot-SPOC (a lightweight policy for edge deployment). Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments, reaching 79.2% success rate on real-world tabletop pick-and-place, outperforming π₀.₅ at 39.2%.

Video Demos

MolmoBot policies are highly robust to camera pose variation. MolmoBot-Engine employs aggressive camera randomization, requiring resulting policies to learn how to handle a very wide diversity of camera poses. MolmoBot policies are even able to handle adversarial movements of the camera at test time!

MolmoBot policies generalize robustly to variation in the environment — including table height, object placement, and scene clutter. This robustness extends to gracefully handling adversarial perturbations applied at test time, such as changing the table height mid-episode.

Owing to the high degree of diversity and randomization in MolmoBot-Data, MolmoBot policies are robust to disturbances and highly steerable. SOTA VLAs are often attracted to nearby objects or receptacles, ignoring the task prompt, but MolmoBot policies strongly attend to the task instruction.

MolmoBot can open doors zero-shot on a mobile bimanual platform, demonstrating a high degree of simultaneous coordination between mobility and constrained manipulation.

MolmoBot can coordinate many degrees of freedom simultaneously to achieve robust manipulation.

Zero-Shot Sim-to-Real Transfer

MolmoBot teaser figure
MolmoBot leverages diverse simulation data to achieve zero-shot sim-to-real transfer on multiple robotic tasks such as pick-and-place and door opening. This unlocks the ability to dramatically scale up training data for generalist robotic foundation models.

Generating Data at Scale

MolmoBot-Engine is an open-source procedural data generation pipeline built on MolmoSpaces, a photorealistic simulation platform with 94,000+ indoor environments. It automatically generates expert trajectories via task-and-motion planning with aggressive randomization of objects, lighting, and camera poses — producing MolmoBot-Data, 1.7M demonstrations totaling 5,700+ hours of robot experience across 8 task types.

1.7M
Expert Demonstrations
94k+
Unique Environments
11k+
Unique Objects
9k+
Unique Receptacles
8
Task Types
5,704 h
Total Data

MolmoBot-Engine procedurally generates diverse pick-and-place demonstrations across thousands of simulated environments. Randomization of object placement, lighting, and camera pose drives the sim-to-real transfer capability of the resulting policies.

DROID Real-World Performance

We evaluate our policies zero-shot on a real DROID robot across four distinct environments — Workroom, Kitchen, Bedroom, and Office — covering 40 pick-and-place tasks with varied objects and receptacles (120 episodes total). No real-world data or task-specific fine-tuning is used. MolmoBot (F=2) achieves 79.2% overall success, more than doubling the performance of π₀.₅-DROID (39.2%), a strong baseline trained on large-scale real-world demonstrations.

MolmoBot policies exhibit strong zero-shot sim-to-real performance across real-world DROID evaluations, outperforming SOTA policies trained on large-scale real-world demonstrations. Bar heights reflect mean success rate; error bars are 95% confidence intervals via stratified bootstrapping.

DROID Real-World Results by Environment

For each policy, we conduct 120 real-world evaluations on 40 tasks in 4 environments across 2 institutions, using 3 physical robots. Each cell shows successes out of 3 trials. Click any cell to watch the recordings for that policy & task.

Policy Spoon TraySpoon BoxTape TrayTape Box Blue Mug TrayBlue Mug BoxCopper Mug TrayCopper Mug Box Timer TrayTimer BoxAvg
π₀ 0/30/31/30/3 0/30/30/30/3 0/30/33%
π₀.₅ 2/32/32/30/3 1/31/30/30/3 0/30/327%
MolmoBot-Pi0 1/30/33/33/3 3/33/32/31/3 0/32/360%
MolmoBot-Img 3/33/33/33/3 3/32/32/30/3 3/31/377%
MolmoBot 3/32/33/33/3 3/33/33/31/3 3/33/390%
Policy Apple EasyApple HardMug EasyMug Hard Banana EasyBanana HardMouse EasyMouse Hard Clutter BrownClutter BlackAvg
π₀ 0/31/30/30/3 1/30/33/31/3 0/30/320%
π₀.₅ 3/30/32/31/3 3/32/33/31/3 2/32/363%
MolmoBot-Pi0 2/33/32/30/3 3/31/33/32/3 0/30/353%
MolmoBot-Img 3/33/32/33/3 3/33/33/32/3 3/31/387%
MolmoBot 1/32/33/33/3 3/33/33/33/3 0/30/370%
Policy Pills TowelPills BasketRoller TowelRoller Basket Banana TowelBanana BasketBall TowelBall Basket Clutter TowelClutter BasketAvg
π₀ 0/30/30/30/3 0/30/30/30/3 0/30/30%
π₀.₅ 0/30/30/30/3 1/30/30/30/3 2/30/310%
MolmoBot-Pi0 2/30/32/30/3 0/30/31/30/3 0/32/323%
MolmoBot-Img 0/30/31/32/3 3/33/33/32/3 3/33/367%
MolmoBot 3/33/33/30/3 3/32/33/33/3 3/33/387%
Policy Knife BoardBanana PlateMarker MugScissors Bowl Carrot BasketKnife Green BowlScrewdriver Blue BowlMouse Blue Bowl Mug BowlMarker BoxAvg
π₀ 0/30/30/30/3 1/31/31/31/3 0/30/313%
π₀.₅ 2/33/31/31/3 1/31/32/31/3 3/32/357%
MolmoBot-Pi0 1/33/30/30/3 0/31/33/32/3 3/32/350%
MolmoBot-Img 2/31/30/31/3 3/31/33/32/3 3/32/360%
MolmoBot 2/33/31/32/3 2/32/33/32/3 3/31/370%

Pink rows are our models. Click any cell to watch the episode recordings.

DROID Simulation Results

Evaluation on held-out simulation environments. Success rates over 200 episodes per task (Pick MSProc: 1000 episodes). All models evaluated zero-shot without any task-specific fine-tuning.

Model Pick MSProc Pick Classic Pick Pick Rand-Cam Pick&Place PnP Next-To PnP Color Avg.
π₀.₅ 18.16.47.08.0 11.78.210.410.0
π₀.₅-Finetune 48.028.325.829.7 43.528.448.336.0
StereoVLA 6.64.31.1N/A 0N/A0
LAP-VLA 19.42.43.12.7 3.86.53.14.8
X-VLA 3.30.50.70.8 0.11.90.91.2
MolmoBot-Pi0 66.235.733.339.8 44.724.746.241.5
MolmoBot-Img 92.263.561.462.1 63.021.067.861.6
MolmoBot (F=2) 93.566.864.063.7 66.426.467.864.1
MolmoBot (F=3) 91.363.859.062.7 65.428.366.162.4

Pink rows are our models. Bold values are best per column. MolmoBot variants substantially outperform all baselines, with MolmoBot (F=2) achieving 64.1% average vs. 36.0% for the strongest baseline (π₀.₅-Finetune).

RB-Y1 Simulation Results

Zero-shot simulation evaluation for RB-Y1 policies on held-out environments.

Model Pick Pick & Place Open Door Open
MolmoBot Multitask44.8%22.5%25.2%70.2%
MolmoBot Door Specialist77.7%
MolmoBot-SPOC Rigid10.5%1.8%
MolmoBot-SPOC Articulated21.8%58.8%

MolmoBot Multitask outperforms MolmoBot-SPOC across all shared tasks. The MolmoBot Door Specialist achieves 77.7% zero-shot door-opening success in simulation.

BibTeX

@misc{deshpande2026molmobot,
      title={MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation},
      author={Abhay Deshpande and Maya Guru and Rose Hendrix and Snehal Jauhri and Ainaz Eftekhar and Rohun Tripathi and Max Argus and Jordi Salvador and Haoquan Fang and Matthew Wallingford and Wilbert Pumacay and Yejin Kim and Quinn Pfeifer and Ying-Chun Lee and Piper Wolters and Omar Rayyan and Mingtong Zhang and Jiafei Duan and Karen Farley and Winson Han and Eli Vanderbilt and Dieter Fox and Ali Farhadi and Georgia Chalvatzaki and Dhruv Shah and Ranjay Krishna},
      year={2026},
      eprint={2603.16861},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.16861},
}