Abstract
A prevailing view in robot learning is that simulation alone is not enough; effective sim-to-real transfer is widely believed to require at least some real-world data collection or task-specific fine-tuning to bridge the gap between simulated and physical environments. We challenge that assumption. With sufficiently large-scale and diverse simulated synthetic training data, we show that zero-shot transfer to the real world is not only possible, but effective for both static and mobile manipulation.
We introduce MolmoBot-Engine, a fully open-source pipeline for procedural data generation across robots, tasks, and diverse simulated environments in MolmoSpaces. With it, we release MolmoBot-Data, a dataset of 1.7 million expert trajectories for articulated object manipulation and pick-and-place tasks. We train three policy classes: MolmoBot (a Molmo2-based VLM with a flow-matching action head), MolmoBot-Pi0 (replicating the π₀ architecture for controlled comparison), and MolmoBot-SPOC (a lightweight policy for edge deployment). Without any real-world fine-tuning, our policies achieve zero-shot transfer to unseen objects and environments, reaching 79.2% success rate on real-world tabletop pick-and-place, outperforming π₀.₅ at 39.2%.
Video Demos
MolmoBot policies are highly robust to camera pose variation. MolmoBot-Engine employs aggressive camera randomization, requiring resulting policies to learn how to handle a very wide diversity of camera poses. MolmoBot policies are even able to handle adversarial movements of the camera at test time!
MolmoBot policies generalize robustly to variation in the environment — including table height, object placement, and scene clutter. This robustness extends to gracefully handling adversarial perturbations applied at test time, such as changing the table height mid-episode.
Owing to the high degree of diversity and randomization in MolmoBot-Data, MolmoBot policies are robust to disturbances and highly steerable. SOTA VLAs are often attracted to nearby objects or receptacles, ignoring the task prompt, but MolmoBot policies strongly attend to the task instruction.
MolmoBot can open doors zero-shot on a mobile bimanual platform, demonstrating a high degree of simultaneous coordination between mobility and constrained manipulation.
MolmoBot can coordinate many degrees of freedom simultaneously to achieve robust manipulation.
Zero-Shot Sim-to-Real Transfer
Generating Data at Scale
MolmoBot-Engine is an open-source procedural data generation pipeline built on MolmoSpaces, a photorealistic simulation platform with 94,000+ indoor environments. It automatically generates expert trajectories via task-and-motion planning with aggressive randomization of objects, lighting, and camera poses — producing MolmoBot-Data, 1.7M demonstrations totaling 5,700+ hours of robot experience across 8 task types.
MolmoBot-Engine procedurally generates diverse pick-and-place demonstrations across thousands of simulated environments. Randomization of object placement, lighting, and camera pose drives the sim-to-real transfer capability of the resulting policies.
DROID Real-World Performance
We evaluate our policies zero-shot on a real DROID robot across four distinct environments — Workroom, Kitchen, Bedroom, and Office — covering 40 pick-and-place tasks with varied objects and receptacles (120 episodes total). No real-world data or task-specific fine-tuning is used. MolmoBot (F=2) achieves 79.2% overall success, more than doubling the performance of π₀.₅-DROID (39.2%), a strong baseline trained on large-scale real-world demonstrations.
DROID Real-World Results by Environment
For each policy, we conduct 120 real-world evaluations on 40 tasks in 4 environments across 2 institutions, using 3 physical robots. Each cell shows successes out of 3 trials. Click any cell to watch the recordings for that policy & task.
| Policy | Spoon Tray | Spoon Box | Tape Tray | Tape Box | Blue Mug Tray | Blue Mug Box | Copper Mug Tray | Copper Mug Box | Timer Tray | Timer Box | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| π₀ | 0/3 | 0/3 | 1/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 3% |
| π₀.₅ | 2/3 | 2/3 | 2/3 | 0/3 | 1/3 | 1/3 | 0/3 | 0/3 | 0/3 | 0/3 | 27% |
| MolmoBot-Pi0 | 1/3 | 0/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 1/3 | 0/3 | 2/3 | 60% |
| MolmoBot-Img | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 2/3 | 0/3 | 3/3 | 1/3 | 77% |
| MolmoBot | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 1/3 | 3/3 | 3/3 | 90% |
| Policy | Apple Easy | Apple Hard | Mug Easy | Mug Hard | Banana Easy | Banana Hard | Mouse Easy | Mouse Hard | Clutter Brown | Clutter Black | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| π₀ | 0/3 | 1/3 | 0/3 | 0/3 | 1/3 | 0/3 | 3/3 | 1/3 | 0/3 | 0/3 | 20% |
| π₀.₅ | 3/3 | 0/3 | 2/3 | 1/3 | 3/3 | 2/3 | 3/3 | 1/3 | 2/3 | 2/3 | 63% |
| MolmoBot-Pi0 | 2/3 | 3/3 | 2/3 | 0/3 | 3/3 | 1/3 | 3/3 | 2/3 | 0/3 | 0/3 | 53% |
| MolmoBot-Img | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 1/3 | 87% |
| MolmoBot | 1/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 3/3 | 0/3 | 0/3 | 70% |
| Policy | Pills Towel | Pills Basket | Roller Towel | Roller Basket | Banana Towel | Banana Basket | Ball Towel | Ball Basket | Clutter Towel | Clutter Basket | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| π₀ | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0/3 | 0% |
| π₀.₅ | 0/3 | 0/3 | 0/3 | 0/3 | 1/3 | 0/3 | 0/3 | 0/3 | 2/3 | 0/3 | 10% |
| MolmoBot-Pi0 | 2/3 | 0/3 | 2/3 | 0/3 | 0/3 | 0/3 | 1/3 | 0/3 | 0/3 | 2/3 | 23% |
| MolmoBot-Img | 0/3 | 0/3 | 1/3 | 2/3 | 3/3 | 3/3 | 3/3 | 2/3 | 3/3 | 3/3 | 67% |
| MolmoBot | 3/3 | 3/3 | 3/3 | 0/3 | 3/3 | 2/3 | 3/3 | 3/3 | 3/3 | 3/3 | 87% |
| Policy | Knife Board | Banana Plate | Marker Mug | Scissors Bowl | Carrot Basket | Knife Green Bowl | Screwdriver Blue Bowl | Mouse Blue Bowl | Mug Bowl | Marker Box | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| π₀ | 0/3 | 0/3 | 0/3 | 0/3 | 1/3 | 1/3 | 1/3 | 1/3 | 0/3 | 0/3 | 13% |
| π₀.₅ | 2/3 | 3/3 | 1/3 | 1/3 | 1/3 | 1/3 | 2/3 | 1/3 | 3/3 | 2/3 | 57% |
| MolmoBot-Pi0 | 1/3 | 3/3 | 0/3 | 0/3 | 0/3 | 1/3 | 3/3 | 2/3 | 3/3 | 2/3 | 50% |
| MolmoBot-Img | 2/3 | 1/3 | 0/3 | 1/3 | 3/3 | 1/3 | 3/3 | 2/3 | 3/3 | 2/3 | 60% |
| MolmoBot | 2/3 | 3/3 | 1/3 | 2/3 | 2/3 | 2/3 | 3/3 | 2/3 | 3/3 | 1/3 | 70% |
Pink rows are our models. Click any cell to watch the episode recordings.
DROID Simulation Results
Evaluation on held-out simulation environments. Success rates over 200 episodes per task (Pick MSProc: 1000 episodes). All models evaluated zero-shot without any task-specific fine-tuning.
| Model | Pick MSProc | Pick Classic | Pick | Pick Rand-Cam | Pick&Place | PnP Next-To | PnP Color | Avg. |
|---|---|---|---|---|---|---|---|---|
| π₀.₅ | 18.1 | 6.4 | 7.0 | 8.0 | 11.7 | 8.2 | 10.4 | 10.0 |
| π₀.₅-Finetune | 48.0 | 28.3 | 25.8 | 29.7 | 43.5 | 28.4 | 48.3 | 36.0 |
| StereoVLA | 6.6 | 4.3 | 1.1 | N/A | 0 | N/A | 0 | — |
| LAP-VLA | 19.4 | 2.4 | 3.1 | 2.7 | 3.8 | 6.5 | 3.1 | 4.8 |
| X-VLA | 3.3 | 0.5 | 0.7 | 0.8 | 0.1 | 1.9 | 0.9 | 1.2 |
| MolmoBot-Pi0 | 66.2 | 35.7 | 33.3 | 39.8 | 44.7 | 24.7 | 46.2 | 41.5 |
| MolmoBot-Img | 92.2 | 63.5 | 61.4 | 62.1 | 63.0 | 21.0 | 67.8 | 61.6 |
| MolmoBot (F=2) | 93.5 | 66.8 | 64.0 | 63.7 | 66.4 | 26.4 | 67.8 | 64.1 |
| MolmoBot (F=3) | 91.3 | 63.8 | 59.0 | 62.7 | 65.4 | 28.3 | 66.1 | 62.4 |
Pink rows are our models. Bold values are best per column. MolmoBot variants substantially outperform all baselines, with MolmoBot (F=2) achieving 64.1% average vs. 36.0% for the strongest baseline (π₀.₅-Finetune).
RB-Y1 Simulation Results
Zero-shot simulation evaluation for RB-Y1 policies on held-out environments.
| Model | Pick | Pick & Place | Open | Door Open |
|---|---|---|---|---|
| MolmoBot Multitask | 44.8% | 22.5% | 25.2% | 70.2% |
| MolmoBot Door Specialist | — | — | — | 77.7% |
| MolmoBot-SPOC Rigid | 10.5% | 1.8% | — | — |
| MolmoBot-SPOC Articulated | — | — | 21.8% | 58.8% |
MolmoBot Multitask outperforms MolmoBot-SPOC across all shared tasks. The MolmoBot Door Specialist achieves 77.7% zero-shot door-opening success in simulation.