Benchmark Evaluation¶

Run learned policies on fixed, reproducible benchmarks.

Resources¶

This README focuses on benchmark installation and running.

Leaderboard documentation can be found here.
Submitting results, see the GitHub issue in the repository here.
Theoretical notes on policy comparison can be found here

Concepts¶

The MolmoSpaces leaderboard shows the results of various polices on benchmarks.

Benchmark Sets are collections of individual benchmarks that have been released together, e.g., the ones from the MolmoSpaces paper, prefixed by "MS-" or the ones from the MolmoBot paper prefixed by "MB-".

A benchmark is a benchmark.json file containing a list of self-contained episode specs. Each spec includes everything needed to recreate a task: scene, robot pose, object poses, cameras, language instructions.

An eval config is a normal datagen config with your policy attached. When pointed at a benchmark, episode-specific fields (cameras, init_qpos, object_poses, etc.) are overwritten by the benchmark spec. This means you can debug your policy in normal datagen using datagen/main, then just swap to benchmark mode for eval by using eval_main or importing run_evaluation from same.

Installing the Benchmarks¶

The benchmark datasets are installed when resource manager gets instantiated. You can either run

export MLSPACES_ASSETS_DIR=/path/to/symlink/resources  # optional
python -m molmo_spaces.molmo_spaces_constants

or you can explicitly code in the python script

from molmospaces.molmo_spaces_constants import get_resource_manager()
get_resource_manager()

Choosing a Benchmark¶

Which benchmarks should I evaluate on? This depends on what you want to show. The current TLDR is:

MolmoSpaces Manipulation: Easy manipulation tasks
MolmoSpaces Navigation: Navigation tasks
MolmoBot Manipulation: Harder manipulation tasks

For more information see here.

Running Benchmarks¶

Running a Generic Benchmark with Pi¶

1. Setup: install and run the Pi policy server¶

Download the checkpoint and start the policy server (leave it running in a separate terminal):

git clone https://github.com/omarrayyann/openpi
mkdir checkpoints && cd checkpoints
gsutil cp -r gs://openpi-assets/checkpoints/pi05_droid_jointpos .
# other options: `pi05_droid_jointpos`, `pi0_fast_droid_jointpos`, `pi0_droid_jointpos`

Install openpi and run the server (default port: 8080):

uv run scripts/serve_policy.py --port=8080 policy:checkpoint \
  --policy.config=<checkpoint_name> \
  --policy.dir=checkpoints/<checkpoint_name>/

2. Run the benchmark¶

Please look at the concrete commands for each task type in our leaderboard: - MolmoSpaces tasks (MS- prefix): ms-bench - MolmoBot tasks (MB- prefix): mb-bench

If using OpenPI models: pip install openpi_client.

For this we chose the easy MS-Pick benchmark, which is located here assets/benchmarks/molmospaces-bench-v1/procthor-10k/FrankaPickDroidMiniBench/FrankaPickDroidMiniBench_json_benchmark_20251231/.

Then launch benchmark episodes in MuJoCo:

python molmo_spaces/evaluation/eval_main.py \
    molmo_spaces.evaluation.configs.evaluation_configs:PiPolicyEvalConfig \
    --benchmark_dir assets/benchmarks/molmospaces-bench-v1/procthor-10k/FrankaPickDroidMiniBench/FrankaPickDroidMiniBench_json_benchmark_20251231/ \
    --task_horizon_steps 500

Make sure the port number is the same in molmo_spaces.configs.policy_configs_baselines:PiPolicyConfig

Also, see molmo_spaces/evaluation/configs/evaluation_configs.py for more examples on eval configs.

3. Run the evaluation¶

Finally we run the evaluation output script that aggegates the results as csv files.

python scripts/benchmarks/eval_to_csv.py <eval_output_dir> pi05ft --success-condition oracle  --output-csv data/pick_easy/pi05.csv

Running MolmoSpaces Benchmarks¶

see here

Running MolmoBot Benchmarks¶

see here.

Implementing Eval in an External Repo¶

You need three things: a policy class, a policy config, and an eval config.

1. Policy Class¶

Extend InferencePolicy. Must implement prepare_model, reset, and get_action.

# my_repo/policy.py
from molmo_spaces.policy.base_policy import InferencePolicy

class MyPolicy(InferencePolicy):
    def __init__(self, config, task):
        super().__init__(config, task)
        self.camera_names = config.policy_config.camera_names
        self.action_spec = config.policy_config.action_spec
        self.prepare_model()

    def prepare_model(self):
        # Load your model from config.policy_config.checkpoint_path
        self.model = load_my_model(self.config.policy_config.checkpoint_path)

    def reset(self):
        # Called at the start of each episode
        pass

    def get_action(self, observation) -> dict[str, np.ndarray]:
        # observation is a dict with camera images and robot_state
        # Return dict mapping move group names to action arrays
        # e.g. {"arm": np.array([...]), "gripper": np.array([...])}
        obs = observation[0] if isinstance(observation, list) else observation
        images = [obs[cam] for cam in self.camera_names]
        state = obs["robot_state"]["qpos"]
        return self.model.predict(images, state)

See molmo_spaces/policy/learned_policy/synthvla_policy.py for a full example with action chunking.

2. Policy Config¶

Extend BasePolicyConfig. Define your model's interface.

# my_repo/configs.py
from molmo_spaces.configs.policy_configs import BasePolicyConfig
from molmo_spaces.policy.base_policy import PolicyFactory

class MyPolicyConfig(BasePolicyConfig):
    policy_type: str = "learned"
    action_type: str = "joint_pos_rel"
    policy_cls: type = None
    policy_factory: PolicyFactory | None = None

    def model_post_init(self, __context):
        if self.policy_cls is None:
            from my_repo.policy import MyPolicy
            self.policy_cls = MyPolicy
            self.policy_factory = MyPolicy

    checkpoint_path: str
    camera_names: list[str] = ["exo_camera_1", "wrist_camera"]
    action_move_group_names: list[str] = ["arm", "gripper"]
    action_spec: dict[str, int] = {"arm": 7, "gripper": 1}

3. Eval Config¶

Extend JsonBenchmarkEvalConfig. This is the minimal config for benchmark eval - episode-specific data (cameras, poses, task params) comes from the benchmark JSON.

# my_repo/configs.py
from molmo_spaces.configs.robot_configs import FrankaRobotConfig
from molmo_spaces.evaluation.configs.evaluation_configs import JsonBenchmarkEvalConfig

class MyEvalConfig(JsonBenchmarkEvalConfig):
    robot_config: FrankaRobotConfig = FrankaRobotConfig()
    policy_config: MyPolicyConfig = MyPolicyConfig(
        checkpoint_path="/path/to/default/checkpoint"
    )
    policy_dt_ms: float = 200.0  # Match your model's expected control rate

    def model_post_init(self, __context):
        super().model_post_init(__context)
        self.robot_config.action_noise_config.enabled = False

4. Run Evaluation¶