JSON Evaluation Pipeline Lifecycle¶
This document describes the end-to-end lifecycle of a JSON benchmark evaluation
in molmo_spaces, what happens at each stage, and which knobs are available
to customize behavior. For a getting-started/setup guide, see
evaluation_guide.md. For schema details on a single
episode, see the "Sample Episode Spec" section there or
molmo_spaces/evaluation/benchmark_schema.py.
High-level flow¶
CLI args / programmatic call
│
▼
run_evaluation() (molmo_spaces/evaluation/eval_main.py)
├─ resolve eval config class (registry name or "module:Class")
├─ load benchmark episodes from JSON
├─ resolve task_horizon (CLI override > benchmark task_horizon_sec)
├─ create_eval_config(): instantiate eval config, force eval-mode flags
│ (no action noise, no datagen profiler, seed=42, output_dir, ...)
├─ apply CLI overrides (camera_config, camera_names, light intensity)
├─ JsonEvalRunner.patch_config(): attach EvalRuntimeParams
└─ JsonEvalRunner.adjust_robot(): wire robot_eval_override (if any)
│
▼
JsonEvalRunner.__init__()
├─ load_all_episodes(benchmark_dir)
├─ truncate to max_episodes / single episode_idx
├─ derive task_sampler_config.house_inds + samples_per_house
└─ super().__init__() -> ParallelRolloutRunner sets up workers
│
▼
JsonEvalRunner.run() -> ParallelRolloutRunner.run()
└─ for each (house_id, batch) work item, dispatch to workers
│
▼
ParallelRolloutRunner.process_single_house() (per worker)
├─ load_episodes_for_house() # JsonEvalRunner override
├─ for each EpisodeSpec:
│ prepare_episode_config()
│ get_episode_task_sampler() -> JsonEvalTaskSampler(exp_config, ep)
│ sample_task_from_spec() -> task = sampler.sample_task(...)
│ run_single_rollout() -> step env at policy_dt_ms
│ should_close_episode_task_sampler() (True for JSON eval)
└─ write trajectories.h5 + per-episode artifacts under output_dir
│
▼
collect_episode_results() + (optional) wandb logging
│
▼
EvaluationResults (success_count, total_count, output_dir, episode_results, exp_config)
Stage-by-stage detail¶
1. Entrypoints¶
There are two equivalent entrypoints, both implemented in
molmo_spaces/evaluation/eval_main.py:
- CLI:
python molmo_spaces/evaluation/eval_main.py <eval_config_cls> --benchmark_dir ... - Programmatic:
run_evaluation(eval_config_cls=..., benchmark_dir=..., ...)
eval_config_cls may be:
- a Python class object (subclass of
MlSpacesExpConfig), - a registry name (looked up via
get_config_class), or - a fully qualified
"module.path:ClassName"string.
A small data-version sanity check (_assert_data_versions_match) runs at import
time to ensure the assets pinned in _EXPECTED_DATA_VERSIONS match
DATA_TYPE_TO_SOURCE_TO_VERSION. Mismatches raise immediately so benchmarks are
not silently evaluated against the wrong assets.
2. Loading the benchmark¶
load_all_episodes(benchmark_dir) reads either a single benchmark.json or
the legacy house_*/episode_*.json layout and returns a flat list of
EpisodeSpec objects (defined in
molmo_spaces/evaluation/benchmark_schema.py). Each EpisodeSpec is
fully self-contained:
source: provenance (h5 file, traj key, dates)house_index,scene_dataset,data_split,seedrobot:robot_nameandinit_qposper move groupimg_resolutionandcameras(robot-mounted or exocentric specs)scene_modifications:added_objects,removed_objects,object_posestask:task_cls,robot_base_pose, and task-specific fieldstask_relevant_objects: bodies used for camera visibility checkslanguage:task_description, referral expressions, and priorities
Notably, timing parameters (policy_dt_ms, ctrl_dt_ms, sim_dt_ms) and
task_horizon are NOT stored on the episode — they come from the eval
config or CLI so the same benchmark can be replayed at different control rates.
3. Resolving task_horizon¶
determine_task_horizon() picks the horizon (in policy steps) using this
priority:
- Explicit CLI override (
--task_horizon_stepsor--task_horizon_sec, mutually exclusive). With--task_horizon_secthe value is converted to steps usingeval_config.policy_dt_ms. - Per-episode
task.task_horizon_secfrom the benchmark JSON, converted usingpolicy_dt_ms. All episodes must agree on the value, otherwise evaluation fails loudly.
If neither is available, evaluation aborts rather than silently picking a default.
4. Building the experiment config¶
create_eval_config() instantiates the user's eval config class
(typically a subclass of JsonBenchmarkEvalConfig) and applies eval-mode
defaults:
policy_config.checkpoint_pathoverridden if--checkpoint_pathwas given.output_dirset toeval_output/<config>/<timestamp>(or<output_dir>/<config>/<timestamp>if--output_diris provided).num_workersfrom the CLI / argument.robot_config.action_noise_config = ActionNoiseConfig(enabled=False)— evaluation is always deterministic w.r.t. the policy's actions.datagen_profiler = False,profile = Falsefor clean output.filter_for_successful_trajectories = False— every rollout is saved.seed = 42.task_horizonset to the resolved value from the previous step.camera_configreplaced with the override if provided (only honored forFrankaEvalCameraSystem).eval_runtime_paramspopulated with anEvalRuntimeParamsobject.
run_evaluation then applies a few late overrides:
environment_light_intensity(filament renderer only) from CLI.policy_config.camera_namesfrom--camera_names.
5. Patching runtime parameters¶
JsonEvalRunner.patch_config(...) stores an EvalRuntimeParams dataclass on
the experiment config:
episode_idx: only evaluate this single episode (--idx).max_episodes: cap the number of episodes pulled from the benchmark.add_custom_object,custom_object_path,custom_object_name: replace the target object ofpick/pick_and_place-style episodes with a user-provided XML asset (--add_custom_object --custom_object_path ...).
These parameters are read by both the runner (when shaping work items) and the worker (when loading episodes for its house).
6. Robot eval override (per-robot evaluation hooks)¶
Some robots need small environment tweaks at eval time that are not part of
the benchmark JSON (e.g. different intrinsics on the wrist camera, depth
recording, a slightly different robot base pose). These are implemented as
robot eval overrides in molmo_spaces/evaluation/robot_eval_overrides.py.
def cap_robot_eval_override(
episode_spec: EpisodeSpec,
exp_config: MlSpacesExpConfig,
) -> None:
camera_config = exp_config.camera_config
camera_config.cameras[0] = MjcfCameraConfig(
name="wrist_camera",
mjcf_name="wrist_camera",
robot_namespace="robot_0/gripper/",
fov=53.0,
fov_noise_degrees=(0.0, 0.0),
pos_noise_range=(0.0, 0.0),
orientation_noise_degrees=0.0,
record_depth=True,
)
camera_config.cameras[1].record_depth = True
camera_config.cameras[1].fov = 71
rot_base = R.from_quat(episode_spec.task["robot_base_pose"][3:7], scalar_first=True).as_matrix()
episode_spec.task["robot_base_pose"][:3] += 0.05 * rot_base[0:3, 0]
episode_spec.task["robot_base_pose"][2] -= 0.2
camera_config.img_resolution = (960, 720)
episode_spec.robot.init_qpos = {
"base": [],
"arm": [[0, -1.5, 0.116, -2.45, 0, 0.842, 0.965]],
"gripper": [0.00296, 0.00296],
}
ROBOT_OVERRIDE_REGISTRY: dict[type[BaseRobotConfig], OverrideFn] = {
FrankaCAPRobotConfig: cap_robot_eval_override,
}
How it gets wired in:
JsonEvalRunner.adjust_robot(exp_config)looks up an override forexp_config.robot_config's class inROBOT_OVERRIDE_REGISTRY. The lookup walks the class MRO, so subclasses inherit their parent's override.- The resolved function (or
None) is stored on the config asexp_config.eval_runtime_params.robot_override_fn. - Inside
JsonEvalTaskSampler.__init__, after the recorded camera config and task type have been wired up but beforesuper().__init__, the override is invoked:robot_override_fn(episode_spec, exp_config). Because it receives the liveepisode_specand the fullMlSpacesExpConfig, it can mutate the camera config, robot config, or any other part of the experiment config for that episode.
To add an override for a new robot class:
- Implement an
OverrideFn((EpisodeSpec, MlSpacesExpConfig) -> None). - Call
register_robot_override(MyRobotConfig, my_override_fn)from an importing module with the robot config class and an override function. - The override is only applied during JSON evaluation (it's wired through
JsonEvalRunner.adjust_robot), so datagen behavior is unaffected.
Warning
Use robot eval overrides sparingly: the JSON benchmark is meant to be authoritative, and overrides defeat that contract.
Tip
If all you need is depth recording on every camera, prefer
policy_config.force_enable_depth = True over a robot eval override. When
set, JsonEvalTaskSampler flips record_depth = True on every camera in
exp_config.camera_config before the sampler is initialized (see
molmo_spaces/tasks/json_eval_task_sampler.py), so you don't need to mutate
the camera config from a robot-specific hook.
7. Runner construction (JsonEvalRunner.__init__)¶
The runner re-loads the benchmark, then:
- Truncates to the first
max_episodesif requested. - Groups episodes by
house_index(so each house can be processed independently across workers). - If
episode_idxis set, narrowstask_sampler_config.house_indsto the single house that owns that episode and setssamples_per_house = 1. - Otherwise sets
house_indsto all houses present in the benchmark andsamples_per_houseto the maximum number of episodes any single house has. - Stores
benchmark_pathon the config so worker processes can re-load episodes inside their own process.
Then super().__init__(exp_config) runs ParallelRolloutRunner.__init__,
which builds the work-item list (house_id, batch_samples, batch_num,
total_batches) and prepares multiprocessing primitives.
8. Per-episode execution¶
Inside each worker, ParallelRolloutRunner.process_single_house() iterates
the episodes for a single house. The JsonEvalRunner overrides hook out the
JSON-specific behavior:
load_episodes_for_house()re-reads the benchmark fromexp_config.benchmark_pathand appliesEvalRuntimeParamsfilters (max_episodes,episode_idx, custom-object replacement viareplace_target_object_with_custom).get_max_episode_attempts()returnslen(episode_specs)— every benchmark episode is run exactly once (no retries, unlike datagen).should_stop_early()returnsTrueonce a single-episode run has collected its one rollout.prepare_episode_config()deep-copiesexp_configand setsscene_dataset/data_splitfrom the spec.task_horizonis kept fromexp_config.get_episode_task_sampler()constructs a freshJsonEvalTaskSamplerper episode (mixed task types within one benchmark are supported).sample_task_from_spec()callssampler.sample_task(house_index=house_id), which sets robot base pose, applies the per-task config, applies any_randomize_colors, callssetup_cameras, and instantiates the right task class (task.task_clsis imported dynamically).get_episode_seed()usesepisode_spec.seedif set, else the index.should_close_episode_task_sampler()returnsTrueso the sampler is torn down each episode (avoids leaking state between mixed-task episodes).
JsonEvalTaskSampler.randomize_scene does not randomize — it deterministically
sets:
- object poses from
scene_modifications.object_poses, - robot per-move-group joint positions from
robot.init_qpos, - pickup-object joint state when applicable,
- per-object colors from
task.object_colors(e.g. forPickAndPlaceColorTask).
Cameras come either from the JSON-recorded specs (default) or from a
FrankaEvalCameraSystem that performs deterministic, per-episode
visibility-aware spherical perturbation when --use_eval_cameras is passed.
9. Result collection¶
After runner.run(...) returns (success_count, total_count),
collect_episode_results(output_dir) walks the output tree and produces a
list of EpisodeResult records (house_id, episode_idx, success, paths
to per-episode artifacts, etc.). These are returned in the
EvaluationResults dataclass, alongside success_rate, output_dir, and
the resolved exp_config.
If use_wandb=True:
- A run is initialized with run name
"<ckpt_name>_<timestamp>"and config including the resolved checkpoint path, benchmark dir, horizon (steps and seconds), config class, and episode/house counts. - Composed videos per episode are built via
compose_episode_videos(whenpolicy_config.camera_namesis non-empty) and uploaded withlog_eval_results_to_wandb.
After everything finishes, the user typically runs
scripts/benchmarks/eval_to_csv.py <output_dir> ... to aggregate per-episode
results into a CSV (see evaluation_guide.md).
What users can configure¶
The table below summarizes each user-facing knob, where it lives, and what it
affects. For full details on a flag, see get_args() in
eval_main.py or the docstring on run_evaluation().
Required selection¶
| CLI flag | run_evaluation arg |
Purpose |
|---|---|---|
positional exp_config_cls |
eval_config_cls |
The eval config class (object, registry name, or "module:Class"). Provides policy_config, robot_config, and timing. |
--benchmark_dir |
benchmark_dir |
Path to the benchmark (benchmark.json or legacy directory layout). |
Policy / checkpoint¶
| CLI flag | run_evaluation arg |
Effect |
|---|---|---|
--checkpoint_path |
checkpoint_path |
Override policy_config.checkpoint_path. |
| (none) | preloaded_policy |
Pass an already-instantiated BasePolicy. Single-worker only — multi-worker requires creating connections inside each worker. |
--camera_names |
camera_names_override |
Replace policy_config.camera_names (e.g. --camera_names randomized_zed2_analogue_1 wrist_camera). |
Episode horizon and selection¶
| CLI flag | run_evaluation arg |
Effect |
|---|---|---|
--task_horizon_steps |
task_horizon_steps |
Hard-cap policy steps per episode. Mutually exclusive with --task_horizon_sec. |
--task_horizon_sec |
task_horizon_sec |
Same, but expressed in seconds; converted via policy_dt_ms. |
--max_episodes |
max_episodes |
Evaluate only the first N episodes (filtered before house grouping). |
--idx |
episode_idx |
Run a single episode by its index in the benchmark. |
Cameras¶
| CLI flag | run_evaluation arg |
Effect |
|---|---|---|
--use_eval_cameras |
(via camera_config_override) |
Enable FrankaEvalCameraSystem — replaces JSON-recorded cameras with a deterministic, visibility-checked spherical perturbation. |
--camera_rand_level |
(via camera_config_override) |
0–100 randomization scale for the eval camera system. Only relevant with --use_eval_cameras. |
--use-filament |
(build separately) | Switch to the filament renderer (requires the custom wheel). |
--environment-light-intensity |
environment_light_intensity |
Override default ambient intensity (filament only). |
Custom objects (asset-swap experiments)¶
| CLI flag | run_evaluation arg |
Effect |
|---|---|---|
--add_custom_object |
add_custom_object |
Replace the target object in pick / pick_and_place-type episodes with a custom XML asset. |
--custom_object_path |
custom_object_path |
Path to the replacement object XML (required when add_custom_object is set). |
--custom_object_name |
custom_object_name |
Natural-language name used in instructions (defaults to the file stem if omitted). |
These are typically combined with --idx to do focused per-episode swaps.
Output and logging¶
| CLI flag | run_evaluation arg |
Effect |
|---|---|---|
--output_dir |
output_dir |
Root output directory. Final path is <output_dir>/<config>/<timestamp>. |
--num_workers |
num_workers |
Multiprocessing workers (one process per worker). |
--no_wandb |
use_wandb=False |
Disable wandb logging. |
--wandb_project |
wandb_project |
wandb project name (default: mlspaces-json-eval). |
Eval config (your own subclass of JsonBenchmarkEvalConfig)¶
Your eval config controls everything that is not part of the per-episode JSON:
policy_config: model class/factory, checkpoint,camera_names,action_move_group_names,action_spec. Setpolicy_config.force_enable_depth = Trueto require depth from every camera — at JSON eval timeJsonEvalTaskSamplerwill setrecord_depth = Trueon all cameras automatically, which is the lightweight alternative to a robot eval override when depth is the only thing you need to flip.robot_config: which*RobotConfigsubclass — also drives whether arobot_eval_overridefrom the registry is applied.policy_dt_ms,ctrl_dt_ms,sim_dt_ms: timing rates. The same benchmark can be replayed at different rates by changing these.end_on_success: stop the rollout as soon as the task is judged successful (e.g.PiPolicyEvalConfig).- Anything else inherited from
MlSpacesExpConfig(renderer, profiling knobs, etc.).
See molmo_spaces/evaluation/configs/evaluation_configs.py for examples
(PiPolicyEvalConfig, CAPPolicyEvalConfig, TeleopPolicyEvalConfig).
Robot eval overrides (advanced)¶
If you need to tweak environment setup specifically at evaluation time for a
particular robot, register an OverrideFn in
molmo_spaces/evaluation/robot_eval_overrides.py:
def my_robot_eval_override(episode_spec, exp_config):
exp_config.camera_config.cameras[0].record_depth = True
episode_spec.task["robot_base_pose"][2] -= 0.05
ROBOT_OVERRIDE_REGISTRY = {
MyRobotConfig: my_robot_eval_override,
}
# or, from another module:
# register_robot_override(MyRobotConfig, my_robot_eval_override)
The override is selected by the robot config class (with MRO traversal so
subclasses inherit), applied per episode by JsonEvalTaskSampler, and only
active during JSON evaluation.
Where to look in code¶
molmo_spaces/evaluation/eval_main.py— entrypoints, config building, result collection.molmo_spaces/evaluation/json_eval_runner.py—JsonEvalRunnerand the hook overrides on top ofParallelRolloutRunner.molmo_spaces/evaluation/robot_eval_overrides.py— robot-specific eval overrides registry.molmo_spaces/tasks/json_eval_task_sampler.py— per-episodeJsonEvalTaskSampler(scene modifications, cameras, task class loading).molmo_spaces/evaluation/benchmark_schema.py—EpisodeSpecand all related Pydantic schemas.molmo_spaces/data_generation/pipeline.py— baseParallelRolloutRunner.process_single_houseloop the JSON runner plugs into.molmo_spaces/utils/eval_camera_randomization_utils.py—--use_eval_cameras/--camera_rand_levelCLI handling and the spherical perturbation logic.molmo_spaces/evaluation/configs/evaluation_configs.py— example eval configs (JsonBenchmarkEvalConfig,PiPolicyEvalConfig, etc.).