evaluation¶
evaluation
¶
Evaluation utilities for MolmoSpaces benchmarks.
Programmatic usage
from molmo_spaces.evaluation import run_evaluation
results = run_evaluation( eval_config_cls=MyEvalConfig, benchmark_dir="/path/to/benchmark", checkpoint_path="/path/to/checkpoint", ) print(f"Success rate: {results.success_rate:.1%}")
See run_evaluation() for full documentation.
Modules:
| Name | Description |
|---|---|
benchmark_schema |
JSON-based benchmark schema definitions. |
configs |
|
eval_main |
Evaluation entrypoint for learned policies on JSON-based benchmarks. |
json_eval_runner |
JSON-based benchmark evaluation runner. |
policy_server |
Modified from: https://github.com/Physical-Intelligence/openpi/blob/main/src/openpi/serving/websocket_policy_server.py |
robot_eval_overrides |
|
Classes:
| Name | Description |
|---|---|
BaseTaskSpec |
Base task specification with fields common to all task types. |
BenchmarkMetadata |
Optional metadata for a benchmark directory. |
EpisodeSpec |
Complete specification for a single benchmark episode. |
EvaluationResults |
Results from running an evaluation on a benchmark. |
ExocentricCameraSpec |
Specification for an exocentric (fixed) camera. |
JsonEvalRunner |
Evaluation runner for JSON-based benchmarks. |
LanguageSpec |
Natural language task specification. |
NavToObjTaskSpec |
Task-specific parameters for navigation to object tasks. |
OpenCloseTaskSpec |
Task-specific parameters for open/close tasks. |
PickAndPlaceTaskSpec |
Task-specific parameters for pick and place tasks. |
PickTaskSpec |
Task-specific parameters for pick tasks. |
RobotMountedCameraSpec |
Specification for a camera mounted on the robot. |
RobotSpec |
Robot initialization specification. |
SceneModificationsSpec |
Scene modifications required for this episode. |
SourceSpec |
Provenance information for this episode. |
Functions:
| Name | Description |
|---|---|
load_all_episodes |
Load all episodes from a benchmark directory as a flat list. |
load_benchmark |
Load a benchmark directory. |
run_evaluation |
Run evaluation on a JSON benchmark programmatically. |
Attributes:
| Name | Type | Description |
|---|---|---|
CameraSpec |
|
|
TaskSpec |
|
TaskSpec
module-attribute
¶
TaskSpec = PickTaskSpec | PickAndPlaceTaskSpec | PickAndPlaceColorTaskSpec | PickAndPlaceNextToTaskSpec | OpenCloseTaskSpec | NavToObjTaskSpec | DoorOpeningTaskSpec
__all__
module-attribute
¶
__all__ = ['run_evaluation', 'EvaluationResults', 'JsonEvalRunner', 'BaseTaskSpec', 'BenchmarkMetadata', 'CameraSpec', 'EpisodeSpec', 'ExocentricCameraSpec', 'LanguageSpec', 'NavToObjTaskSpec', 'OpenCloseTaskSpec', 'PickAndPlaceTaskSpec', 'PickTaskSpec', 'RobotMountedCameraSpec', 'RobotSpec', 'SceneModificationsSpec', 'SourceSpec', 'TaskSpec', 'load_all_episodes', 'load_benchmark']
BaseTaskSpec
¶
Bases: BaseModel
Base task specification with fields common to all task types.
robot_base_pose is the authoritative field for robot world placement. This comes from task_config in the codebase, not robot_config.
task_cls is the authoritative identifier for the task type. The eval task sampler is responsible for interpreting task_cls and creating the appropriate task. task_type is optional and for human convenience only.
Attributes:
| Name | Type | Description |
|---|---|---|
robot_base_pose |
list[float]
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
BenchmarkMetadata
¶
Bases: BaseModel
Optional metadata for a benchmark directory.
This is NOT required - each episode is fully self-contained. This file provides optional human-readable metadata about the benchmark.
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_json_file |
Load benchmark metadata from a JSON file. |
to_json_file |
Save the benchmark metadata to a JSON file. |
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_created_date |
str | None
|
|
camera_system_class |
str | None
|
|
created_at |
str | None
|
|
description |
str | None
|
|
episode_length_stats |
dict[str, float] | None
|
|
house_counts |
dict[int, int] | None
|
|
num_episodes |
int | None
|
|
num_houses |
int | None
|
|
object_category_counts |
dict[str, int] | None
|
|
robot_counts |
dict[str, int] | None
|
|
source_data_date |
str | None
|
|
source_datagen_path |
str | None
|
|
task_cls_counts |
dict[str, int] | None
|
|
benchmark_created_date
class-attribute
instance-attribute
¶
episode_length_stats
class-attribute
instance-attribute
¶
object_category_counts
class-attribute
instance-attribute
¶
Config
¶
from_json_file
classmethod
¶
from_json_file(path: str | Path) -> BenchmarkMetadata
Load benchmark metadata from a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
to_json_file
¶
Save the benchmark metadata to a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
EpisodeSpec
¶
Bases: BaseModel
Complete specification for a single benchmark episode.
This is a FULLY SELF-CONTAINED specification - no external config needed. Contains all information needed to recreate the exact initial conditions for an episode: scene, robot, cameras, and task parameters.
NOTE: Timing/execution parameters (policy_dt_ms, ctrl_dt_ms, sim_dt_ms, task_horizon) are NOT stored per-episode. They come from the evaluation config or command line.
A benchmark is simply a list of EpisodeSpec objects in a single JSON file.
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_json_file |
Load an episode spec from a JSON file. |
get_task_cls |
Get fully qualified task class name from task dict (authoritative identifier). |
get_task_type |
Get optional human-readable task type from task dict. |
to_json_file |
Save the episode spec to a JSON file. |
Attributes:
| Name | Type | Description |
|---|---|---|
cameras |
list[CameraSpec]
|
|
data_split |
str
|
|
house_index |
int
|
|
img_resolution |
tuple[int, int]
|
|
language |
LanguageSpec
|
|
robot |
RobotSpec
|
|
scene_dataset |
str
|
|
scene_modifications |
SceneModificationsSpec
|
|
seed |
int | None
|
|
source |
SourceSpec | None
|
|
task |
dict
|
|
task_relevant_objects |
list[str]
|
|
cameras
class-attribute
instance-attribute
¶
cameras: list[CameraSpec] = Field(default_factory=list)
scene_modifications
class-attribute
instance-attribute
¶
scene_modifications: SceneModificationsSpec = Field(default_factory=SceneModificationsSpec)
task_relevant_objects
class-attribute
instance-attribute
¶
Config
¶
from_json_file
classmethod
¶
from_json_file(path: str | Path) -> EpisodeSpec
Load an episode spec from a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
get_task_cls
¶
Get fully qualified task class name from task dict (authoritative identifier).
Source code in molmo_spaces/evaluation/benchmark_schema.py
get_task_type
¶
to_json_file
¶
Save the episode spec to a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
EvaluationResults
dataclass
¶
EvaluationResults(success_count: int, total_count: int, output_dir: Path, episode_results: list[EpisodeResult] = list(), exp_config: MlSpacesExpConfig | None = None)
Results from running an evaluation on a benchmark.
Attributes:
| Name | Type | Description |
|---|---|---|
success_count |
int
|
Number of successful episodes |
total_count |
int
|
Total number of episodes evaluated |
output_dir |
Path
|
Path where evaluation outputs were saved |
episode_results |
list[EpisodeResult]
|
Per-episode results with details |
exp_config |
MlSpacesExpConfig | None
|
The experiment config used for evaluation |
episode_results
class-attribute
instance-attribute
¶
episode_results: list[EpisodeResult] = field(default_factory=list)
ExocentricCameraSpec
¶
JsonEvalRunner
¶
Bases: ParallelRolloutRunner
Evaluation runner for JSON-based benchmarks.
This runner differs from the standard ParallelRolloutRunner in several ways: 1. Episodes are loaded from JSON files, not from H5 frozen configs 2. Each episode is fully self-contained (timing, cameras, task config) 3. Task samplers are created per-episode to support mixed task types 4. Uses patch_config to add evaluation-specific runtime parameters
The runner inherits process_single_house from ParallelRolloutRunner and customizes behavior by overriding hook methods.
Initialize the JSON eval runner.
The benchmark is authoritative - all episode data comes from the JSON files. No fallbacks or defaults; missing data is an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exp_config
|
MlSpacesExpConfig
|
Base experiment config (provides robot_config, policy_config) |
required |
benchmark_dir
|
Path
|
Path to benchmark directory containing benchmark.json |
required |
Methods:
| Name | Description |
|---|---|
adjust_robot |
Apply robot-specific evaluation overrides if configured. |
get_episode_seed |
Get seed from episode spec, falling back to index. |
get_episode_spec_at_index |
Get episode specification at given index. |
get_episode_task_sampler |
Create per-episode JsonEvalTaskSampler. |
get_episodes_for_house |
Get all episode specs for a given house. |
get_max_episode_attempts |
Process all episodes in the benchmark - no retry multiplier. |
load_episodes_for_house |
Load episode specifications from JSON benchmark. |
patch_config |
Patch evaluation config with runtime evaluation-specific parameters. |
prepare_episode_config |
Prepare episode-specific config from JSON spec. |
process_single_house |
Process all episodes for a single house using customizable hooks. |
run |
Run house-by-house rollouts using multiprocessing workers. |
run_single_rollout |
Execute a single rollout with the given task and policy. |
sample_task_from_spec |
Sample task - episode spec is already in the JsonEvalTaskSampler. |
should_close_episode_task_sampler |
Close task sampler after each episode - we create per-episode. |
should_stop_early |
Stop early if evaluating a single episode (--idx provided) and it's been collected. |
Attributes:
Source code in molmo_spaces/evaluation/json_eval_runner.py
max_allowed_sequential_irrecoverable_failures
instance-attribute
¶
max_allowed_sequential_rollout_failures
instance-attribute
¶
max_allowed_sequential_task_sampler_failures
instance-attribute
¶
adjust_robot
staticmethod
¶
Apply robot-specific evaluation overrides if configured.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_episode_seed
staticmethod
¶
get_episode_seed(episode_idx: int, episode_spec: EpisodeSpec, task_sampler: JsonEvalTaskSampler) -> int
Get seed from episode spec, falling back to index.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_episode_spec_at_index
staticmethod
¶
get_episode_spec_at_index(episode_specs: list[EpisodeSpec], idx: int) -> EpisodeSpec
get_episode_task_sampler
staticmethod
¶
get_episode_task_sampler(exp_config: MlSpacesExpConfig, episode_spec: EpisodeSpec, shared_task_sampler, datagen_profiler: DatagenProfiler | None) -> JsonEvalTaskSampler
Create per-episode JsonEvalTaskSampler.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_episodes_for_house
¶
get_episodes_for_house(house_id: int) -> list[EpisodeSpec]
Get all episode specs for a given house.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_max_episode_attempts
staticmethod
¶
get_max_episode_attempts(episode_specs: list[EpisodeSpec], samples_per_house: int, exp_config: MlSpacesExpConfig) -> int
Process all episodes in the benchmark - no retry multiplier.
Source code in molmo_spaces/evaluation/json_eval_runner.py
load_episodes_for_house
staticmethod
¶
load_episodes_for_house(exp_config: MlSpacesExpConfig, house_id: int, batch_suffix: str, worker_task_sampler, worker_logger) -> tuple[list[EpisodeSpec], None]
Load episode specifications from JSON benchmark.
Source code in molmo_spaces/evaluation/json_eval_runner.py
patch_config
staticmethod
¶
patch_config(exp_config: MlSpacesExpConfig, episode_idx: int | None = None, max_episodes: int | None = None, add_custom_object: bool = False, custom_object_path: str | Path | None = None, custom_object_name: str | None = None) -> MlSpacesExpConfig
Patch evaluation config with runtime evaluation-specific parameters.
This method modifies the config object to store evaluation-specific runtime parameters that are not part of the base config schema. These parameters are used by the evaluation runner to customize episode processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exp_config
|
MlSpacesExpConfig
|
The experiment config to patch |
required |
episode_idx
|
int | None
|
Optional index of a specific episode to evaluate. If provided, only that episode will be evaluated and the process will stop after it. |
None
|
max_episodes
|
int | None
|
Optional maximum number of episodes to evaluate. If provided, only the episodes for the houses used in the first N episodes will be evaluated. Note that the final number of episodes can differ from N if more than one episode is sampled for any of the houses among the first N episodes. |
None
|
add_custom_object
|
bool
|
Whether to replace the target object with a custom object. |
False
|
custom_object_path
|
str | Path | None
|
Path to the custom object XML file. Required if add_custom_object is True. |
None
|
custom_object_name
|
str | None
|
Natural language name for the custom object (e.g., 'lemon', 'cup'). |
None
|
Returns:
| Type | Description |
|---|---|
MlSpacesExpConfig
|
The patched config (same object, modified in place) |
Note
These parameters are stored in an EvalRuntimeParams dataclass attached to
the config object as exp_config.eval_runtime_params for access by worker
processes. They are not part of the base MlSpacesExpConfig schema but are
necessary for runtime evaluation customization.
Source code in molmo_spaces/evaluation/json_eval_runner.py
prepare_episode_config
staticmethod
¶
prepare_episode_config(exp_config: MlSpacesExpConfig, episode_spec: EpisodeSpec, episode_idx: int) -> MlSpacesExpConfig
Prepare episode-specific config from JSON spec.
Note: task_horizon is NOT read from episode_spec. It's an evaluation parameter that comes from exp_config (set via command line or defaults).
Source code in molmo_spaces/evaluation/json_eval_runner.py
process_single_house
staticmethod
¶
process_single_house(worker_id: int, worker_logger, house_id: int, exp_config: MlSpacesExpConfig, samples_per_house: int, shutdown_event, task_sampler, preloaded_policy: BasePolicy | None = None, max_allowed_sequential_task_sampler_failures: int = 10, max_allowed_sequential_rollout_failures: int = 10, filter_for_successful_trajectories: bool = False, runner_class=None, batch_num: int | None = None, total_batches: int | None = None, datagen_profiler: DatagenProfiler | None = None) -> tuple[int, int, bool]
Process all episodes for a single house using customizable hooks.
This method uses a while loop to iterate over episodes, calling hook methods via runner_class to allow subclasses to customize behavior without duplicating the entire method.
Hooks called (override in subclass to customize): - load_episodes_for_house: Load episode specs from source (JSON, etc.) - get_max_episode_attempts: Maximum iterations of the episode loop - should_stop_early: Whether to stop before max attempts (e.g., enough successes) - prepare_episode_config: Modify config per-episode - get_episode_task_sampler: Get/create task sampler for episode - sample_task_from_spec: Sample task from specification - get_episode_seed: Get seed for episode - should_close_episode_task_sampler: Whether to close sampler per-episode
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
worker_id
|
int
|
ID of the worker thread/process |
required |
worker_logger
|
Logger instance for this worker |
required | |
house_id
|
int
|
Index of the house to process |
required |
exp_config
|
MlSpacesExpConfig
|
Experiment configuration |
required |
samples_per_house
|
int
|
Number of episodes to collect for this house |
required |
shutdown_event
|
Event to signal shutdown |
required | |
task_sampler
|
Task sampler instance (shared across houses for this worker) |
required | |
preloaded_policy
|
BasePolicy | None
|
Optional pre-initialized policy instance |
None
|
max_allowed_sequential_task_sampler_failures
|
int
|
Max consecutive task sampling failures |
10
|
max_allowed_sequential_rollout_failures
|
int
|
Max consecutive rollout failures |
10
|
filter_for_successful_trajectories
|
bool
|
Whether to filter for successful trajectories only |
False
|
runner_class
|
Runner class with hook methods to call |
None
|
|
batch_num
|
int | None
|
Batch number for this house (for batched processing) |
None
|
total_batches
|
int | None
|
Total number of batches for this house |
None
|
datagen_profiler
|
DatagenProfiler | None
|
DatagenProfiler for per-worker timing (optional) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
tuple[int, int, bool]
|
(house_success_count, house_total_count, irrecoverable_failure_flag) |
Source code in molmo_spaces/data_generation/pipeline.py
784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 | |
run
¶
run(preloaded_policy: BasePolicy | None = None) -> tuple[int, int]
Run house-by-house rollouts using multiprocessing workers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preloaded_policy
|
BasePolicy | None
|
Optional pre-initialized policy instance to use for rollouts. If None, a new policy will be created for each rollout. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
tuple[int, int]
|
(success_count, total_count) |
Source code in molmo_spaces/data_generation/pipeline.py
1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 | |
run_single_rollout
staticmethod
¶
run_single_rollout(episode_seed: int, task: BaseMujocoTask, policy: Any, profiler: Profiler | None = None, viewer=None, shutdown_event=None, datagen_profiler: DatagenProfiler | None = None, end_on_success: bool = False) -> bool
Execute a single rollout with the given task and policy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
episode_seed
|
int
|
Seed for this episode |
required |
task
|
BaseMujocoTask
|
The task to run |
required |
policy
|
Any
|
Policy to use for action selection |
required |
profiler
|
Profiler | None
|
Legacy Profiler instance (optional) |
None
|
viewer
|
MuJoCo viewer for visualization (optional) |
None
|
|
shutdown_event
|
Event to signal shutdown (optional) |
None
|
|
datagen_profiler
|
DatagenProfiler | None
|
DatagenProfiler for per-worker timing (optional) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
Whether the episode was successful |
Source code in molmo_spaces/data_generation/pipeline.py
678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 | |
sample_task_from_spec
staticmethod
¶
sample_task_from_spec(task_sampler: JsonEvalTaskSampler, house_id: int, episode_spec: EpisodeSpec, episode_idx: int) -> BaseMujocoTask | None
Sample task - episode spec is already in the JsonEvalTaskSampler.
Source code in molmo_spaces/evaluation/json_eval_runner.py
should_close_episode_task_sampler
staticmethod
¶
should_stop_early
staticmethod
¶
should_stop_early(num_collected: int, samples_per_house: int, exp_config: MlSpacesExpConfig | None = None) -> bool
Stop early if evaluating a single episode (--idx provided) and it's been collected.
Source code in molmo_spaces/evaluation/json_eval_runner.py
LanguageSpec
¶
Bases: BaseModel
Natural language task specification.
Attributes:
| Name | Type | Description |
|---|---|---|
referral_expressions |
dict[str, str]
|
|
referral_expressions_priority |
dict[str, list[list[float | str]]]
|
|
task_description |
str
|
|
NavToObjTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for navigation to object tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
pickup_obj_candidates |
list[str] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float] | None
|
|
receptacle_name |
str | None
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
OpenCloseTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for open/close tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
any_inst_of_category |
bool
|
|
articulation_object_name |
str | None
|
|
joint_goal_position |
float | None
|
|
joint_index |
int
|
|
joint_name |
str
|
|
joint_start_position |
float | list[float]
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
robot_base_pose |
list[float]
|
|
task_cls |
str
|
|
task_success_threshold |
float
|
|
task_type |
str | None
|
|
PickAndPlaceTaskSpec
¶
Bases: PickTaskSpec
Task-specific parameters for pick and place tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
max_place_receptacle_pos_displacement |
float
|
|
max_place_receptacle_rot_displacement |
float
|
|
pickup_obj_goal_pose |
list[float] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
place_receptacle_name |
str
|
|
place_receptacle_start_pose |
list[float]
|
|
receptacle_name |
str | None
|
|
receptacle_supported_weight_frac |
float
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
max_place_receptacle_pos_displacement
class-attribute
instance-attribute
¶
max_place_receptacle_rot_displacement
class-attribute
instance-attribute
¶
pickup_obj_goal_pose
class-attribute
instance-attribute
¶
pickup_obj_start_pose
class-attribute
instance-attribute
¶
place_receptacle_start_pose
class-attribute
instance-attribute
¶
receptacle_supported_weight_frac
class-attribute
instance-attribute
¶
robot_base_pose
class-attribute
instance-attribute
¶
PickTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for pick tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
pickup_obj_goal_pose |
list[float] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
receptacle_name |
str | None
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
RobotMountedCameraSpec
¶
Bases: BaseModel
Specification for a camera mounted on the robot.
Attributes:
| Name | Type | Description |
|---|---|---|
camera_offset |
list[float]
|
|
camera_quaternion |
list[float]
|
|
fov |
float
|
|
lookat_offset |
list[float]
|
|
name |
str
|
|
record_depth |
bool
|
|
reference_body_names |
list[str]
|
|
type |
Literal['robot_mounted']
|
|
RobotSpec
¶
Bases: BaseModel
Robot initialization specification.
Note: Robot world placement is in task.robot_base_pose, not here. This spec only contains robot-intrinsic state (joint positions).
Attributes:
| Name | Type | Description |
|---|---|---|
init_qpos |
dict[str, list[float]]
|
|
robot_name |
str
|
|
SceneModificationsSpec
¶
Bases: BaseModel
Scene modifications required for this episode.
This captures objects that need to be added to the base scene XML and their initial poses.
Attributes:
| Name | Type | Description |
|---|---|---|
added_objects |
dict[str, str]
|
|
object_poses |
dict[str, list[float]]
|
|
removed_objects |
list[str]
|
|
SourceSpec
¶
Bases: BaseModel
Provenance information for this episode.
Tracks where this episode specification came from (which H5 file and trajectory).
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_created_date |
str | None
|
|
camera_system_class |
str | None
|
|
episode_length |
int | None
|
|
h5_file |
str
|
|
source_data_date |
str | None
|
|
traj_key |
str
|
|
benchmark_created_date
class-attribute
instance-attribute
¶
load_all_episodes
¶
load_all_episodes(benchmark_dir: Path) -> list[EpisodeSpec]
Load all episodes from a benchmark directory as a flat list.
Supports two formats: 1. Single benchmark.json file (preferred): List of EpisodeSpec dicts 2. Legacy house_/episode_.json structure
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_dir
|
Path
|
Path to benchmark directory |
required |
Returns:
| Type | Description |
|---|---|
list[EpisodeSpec]
|
List of EpisodeSpec objects |
Source code in molmo_spaces/evaluation/benchmark_schema.py
load_benchmark
¶
load_benchmark(benchmark_dir: Path) -> tuple[BenchmarkMetadata | None, dict[int, list[Path]]]
Load a benchmark directory.
A benchmark is simply a directory of episode JSON files. Each episode is fully self-contained. An optional benchmark_metadata.json provides human-readable info but is not required.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_dir
|
Path
|
Path to benchmark directory containing house_* subdirectories with episode JSON files. May optionally contain benchmark_metadata.json. |
required |
Returns:
| Type | Description |
|---|---|
tuple[BenchmarkMetadata | None, dict[int, list[Path]]]
|
Tuple of (BenchmarkMetadata or None, dict mapping house_id -> list of episode JSON paths) |
Source code in molmo_spaces/evaluation/benchmark_schema.py
run_evaluation
¶
run_evaluation(eval_config_cls: type[MlSpacesExpConfig] | str, benchmark_dir: Path, checkpoint_path: str | None = None, task_horizon_steps: int | None = None, task_horizon_sec: float | None = None, output_dir: str | Path | None = None, num_workers: int = 1, use_wandb: bool = False, wandb_project: str = 'mlspaces-online-eval', preloaded_policy: BasePolicy | None = None, max_episodes: int | None = None, camera_config_override: Any | None = None, camera_names_override: list[str] | None = None, environment_light_intensity: float | None = None, episode_idx: int | None = None, add_custom_object: bool = False, custom_object_path: str | Path | None = None, custom_object_name: str | None = None) -> EvaluationResults
Run evaluation on a JSON benchmark programmatically.
This is the primary entry point for running evaluations from external code. It can be imported and called directly without using command-line arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eval_config_cls
|
type[MlSpacesExpConfig] | str
|
Either an MlSpacesExpConfig subclass, or a string in the format "module.path:ClassName" (e.g., "myrepo.configs:MyEvalConfig"). |
required |
benchmark_dir
|
Path
|
Path to JSON benchmark directory containing benchmark.json. |
required |
checkpoint_path
|
str | None
|
Path to model checkpoint. Overrides the checkpoint in policy_config. |
None
|
task_horizon_steps
|
int | None
|
Max steps per episode. If None, uses default for the task class. |
None
|
task_horizon_sec
|
float | None
|
Max seconds per episode, used to calculate horizon in steps. Cannot be used with task_horizon_steps. |
None
|
output_dir
|
str | Path | None
|
Output directory for results. Defaults to eval_output/ |
None
|
num_workers
|
int
|
Number of parallel worker processes. |
1
|
use_wandb
|
bool
|
Whether to log results to Weights & Biases. |
False
|
wandb_project
|
str
|
W&B project name (only used if use_wandb=True). |
'mlspaces-online-eval'
|
preloaded_policy
|
BasePolicy | None
|
Optional pre-initialized policy instance. If provided, skips policy creation from config. |
None
|
max_episodes
|
int | None
|
Maximum number of episodes to evaluate from benchmark. If None, evaluates all episodes. |
None
|
camera_config_override
|
Any | None
|
Optional camera system config (e.g. FrankaEvalCameraSystem) to replace the default camera_config on the experiment config. |
None
|
camera_names_override
|
list[str] | None
|
Optional list of camera names to override policy_config.camera_names (e.g. ["randomized_zed2_analogue_1", "wrist_camera"]). |
None
|
episode_idx
|
int | None
|
Index of a specific episode to evaluate. If None, evaluates all episodes. |
None
|
add_custom_object
|
bool
|
Whether to replace the target object with a custom object. |
False
|
custom_object_path
|
str | Path | None
|
Path to the custom object XML file. Required if add_custom_object is True. |
None
|
custom_object_name
|
str | None
|
Natural language name for the custom object (e.g., 'lemon', 'cup'). If not provided, will attempt to extract from the object path. |
None
|
Returns:
| Type | Description |
|---|---|
EvaluationResults
|
EvaluationResults containing success counts, output paths, and per-episode details. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If benchmark_dir doesn't exist. |
ValueError
|
If no episodes found in benchmark or config class not found. |
Example
from molmo_spaces.evaluation import run_evaluation from my_repo.configs import MyEvalConfig
results = run_evaluation( eval_config_cls=MyEvalConfig, benchmark_dir="/path/to/benchmark", checkpoint_path="/path/to/checkpoint.pt", task_horizon_steps=500, ) print(f"Success rate: {results.success_rate:.1%}")
Source code in molmo_spaces/evaluation/eval_main.py
440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 | |
benchmark_schema
¶
JSON-based benchmark schema definitions.
This module defines Pydantic models for JSON benchmark files that fully specify episode initialization without relying on pickle serialization. Each episode is fully self-contained and can be loaded/inspected independently.
Design principles
- Each episode JSON is fully self-contained (no external config dependencies)
- A benchmark is simply a list/directory of episode JSONs (can mix task types)
- All fields needed to recreate exact initial conditions are explicit
- Task horizon is NOT stored per-episode - it's an evaluation parameter
Benchmark directory structure
benchmark_dir/ ├── house_5/ │ ├── episode_00000000.json # Fully self-contained │ └── ... └── ...
Key fields for robot placement
- robot.init_qpos: Initial joint positions per move group
- task.robot_base_pose: Robot base pose in world frame (NOT robot.default_world_pose)
The actual robot world placement comes from task.robot_base_pose, which is set by the task sampler and frozen into the episode. The robot_config.default_world_pose field in the codebase is just a default that gets overridden.
Task horizons
Task horizon (max steps per episode) is an EVALUATION parameter, not a task specification. Use DEFAULT_TASK_HORIZONS for sensible defaults per task class, and override via command line for specific eval runs.
Classes:
| Name | Description |
|---|---|
BaseTaskSpec |
Base task specification with fields common to all task types. |
BenchmarkMetadata |
Optional metadata for a benchmark directory. |
DoorOpeningTaskSpec |
Task-specific parameters for door opening tasks. |
EpisodeSpec |
Complete specification for a single benchmark episode. |
ExocentricCameraSpec |
Specification for an exocentric (fixed) camera. |
LanguageSpec |
Natural language task specification. |
NavToObjTaskSpec |
Task-specific parameters for navigation to object tasks. |
OpenCloseTaskSpec |
Task-specific parameters for open/close tasks. |
PickAndPlaceColorTaskSpec |
Task-specific parameters for pick and place color tasks. |
PickAndPlaceNextToTaskSpec |
Task-specific parameters for pick and place next-to tasks. |
PickAndPlaceTaskSpec |
Task-specific parameters for pick and place tasks. |
PickTaskSpec |
Task-specific parameters for pick tasks. |
RobotMountedCameraSpec |
Specification for a camera mounted on the robot. |
RobotSpec |
Robot initialization specification. |
SceneModificationsSpec |
Scene modifications required for this episode. |
SourceSpec |
Provenance information for this episode. |
Functions:
| Name | Description |
|---|---|
get_task_spec_field_names |
Get all field names from TaskSpec models that should be copied to task_config. |
load_all_episodes |
Load all episodes from a benchmark directory as a flat list. |
load_benchmark |
Load a benchmark directory. |
replace_target_object_with_custom |
Replace the target object in an episode with a custom object. |
Attributes:
| Name | Type | Description |
|---|---|---|
ALL_TASK_SPEC_CLASSES |
list[type[BaseTaskSpec]]
|
|
CameraSpec |
|
|
TaskSpec |
|
ALL_TASK_SPEC_CLASSES
module-attribute
¶
ALL_TASK_SPEC_CLASSES: list[type[BaseTaskSpec]] = [PickTaskSpec, PickAndPlaceTaskSpec, PickAndPlaceNextToTaskSpec, PickAndPlaceColorTaskSpec, OpenCloseTaskSpec, NavToObjTaskSpec, DoorOpeningTaskSpec]
TaskSpec
module-attribute
¶
TaskSpec = PickTaskSpec | PickAndPlaceTaskSpec | PickAndPlaceColorTaskSpec | PickAndPlaceNextToTaskSpec | OpenCloseTaskSpec | NavToObjTaskSpec | DoorOpeningTaskSpec
BaseTaskSpec
¶
Bases: BaseModel
Base task specification with fields common to all task types.
robot_base_pose is the authoritative field for robot world placement. This comes from task_config in the codebase, not robot_config.
task_cls is the authoritative identifier for the task type. The eval task sampler is responsible for interpreting task_cls and creating the appropriate task. task_type is optional and for human convenience only.
Attributes:
| Name | Type | Description |
|---|---|---|
robot_base_pose |
list[float]
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
BenchmarkMetadata
¶
Bases: BaseModel
Optional metadata for a benchmark directory.
This is NOT required - each episode is fully self-contained. This file provides optional human-readable metadata about the benchmark.
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_json_file |
Load benchmark metadata from a JSON file. |
to_json_file |
Save the benchmark metadata to a JSON file. |
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_created_date |
str | None
|
|
camera_system_class |
str | None
|
|
created_at |
str | None
|
|
description |
str | None
|
|
episode_length_stats |
dict[str, float] | None
|
|
house_counts |
dict[int, int] | None
|
|
num_episodes |
int | None
|
|
num_houses |
int | None
|
|
object_category_counts |
dict[str, int] | None
|
|
robot_counts |
dict[str, int] | None
|
|
source_data_date |
str | None
|
|
source_datagen_path |
str | None
|
|
task_cls_counts |
dict[str, int] | None
|
|
benchmark_created_date
class-attribute
instance-attribute
¶
episode_length_stats
class-attribute
instance-attribute
¶
object_category_counts
class-attribute
instance-attribute
¶
Config
¶
from_json_file
classmethod
¶
from_json_file(path: str | Path) -> BenchmarkMetadata
Load benchmark metadata from a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
to_json_file
¶
Save the benchmark metadata to a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
DoorOpeningTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for door opening tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
articulated_joint_range |
list[float] | None
|
|
articulated_joint_reset_state |
list[float] | None
|
|
door_body_name |
str
|
|
door_openness_threshold |
float
|
|
robot_base_pose |
list[float]
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
EpisodeSpec
¶
Bases: BaseModel
Complete specification for a single benchmark episode.
This is a FULLY SELF-CONTAINED specification - no external config needed. Contains all information needed to recreate the exact initial conditions for an episode: scene, robot, cameras, and task parameters.
NOTE: Timing/execution parameters (policy_dt_ms, ctrl_dt_ms, sim_dt_ms, task_horizon) are NOT stored per-episode. They come from the evaluation config or command line.
A benchmark is simply a list of EpisodeSpec objects in a single JSON file.
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_json_file |
Load an episode spec from a JSON file. |
get_task_cls |
Get fully qualified task class name from task dict (authoritative identifier). |
get_task_type |
Get optional human-readable task type from task dict. |
to_json_file |
Save the episode spec to a JSON file. |
Attributes:
| Name | Type | Description |
|---|---|---|
cameras |
list[CameraSpec]
|
|
data_split |
str
|
|
house_index |
int
|
|
img_resolution |
tuple[int, int]
|
|
language |
LanguageSpec
|
|
robot |
RobotSpec
|
|
scene_dataset |
str
|
|
scene_modifications |
SceneModificationsSpec
|
|
seed |
int | None
|
|
source |
SourceSpec | None
|
|
task |
dict
|
|
task_relevant_objects |
list[str]
|
|
cameras
class-attribute
instance-attribute
¶
cameras: list[CameraSpec] = Field(default_factory=list)
scene_modifications
class-attribute
instance-attribute
¶
scene_modifications: SceneModificationsSpec = Field(default_factory=SceneModificationsSpec)
task_relevant_objects
class-attribute
instance-attribute
¶
Config
¶
from_json_file
classmethod
¶
from_json_file(path: str | Path) -> EpisodeSpec
Load an episode spec from a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
get_task_cls
¶
Get fully qualified task class name from task dict (authoritative identifier).
Source code in molmo_spaces/evaluation/benchmark_schema.py
get_task_type
¶
to_json_file
¶
Save the episode spec to a JSON file.
Source code in molmo_spaces/evaluation/benchmark_schema.py
ExocentricCameraSpec
¶
LanguageSpec
¶
Bases: BaseModel
Natural language task specification.
Attributes:
| Name | Type | Description |
|---|---|---|
referral_expressions |
dict[str, str]
|
|
referral_expressions_priority |
dict[str, list[list[float | str]]]
|
|
task_description |
str
|
|
NavToObjTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for navigation to object tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
pickup_obj_candidates |
list[str] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float] | None
|
|
receptacle_name |
str | None
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
OpenCloseTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for open/close tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
any_inst_of_category |
bool
|
|
articulation_object_name |
str | None
|
|
joint_goal_position |
float | None
|
|
joint_index |
int
|
|
joint_name |
str
|
|
joint_start_position |
float | list[float]
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
robot_base_pose |
list[float]
|
|
task_cls |
str
|
|
task_success_threshold |
float
|
|
task_type |
str | None
|
|
PickAndPlaceColorTaskSpec
¶
Bases: PickAndPlaceTaskSpec
Task-specific parameters for pick and place color tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
max_place_receptacle_pos_displacement |
float
|
|
max_place_receptacle_rot_displacement |
float
|
|
object_colors |
dict[str, list[float]] | None
|
|
other_receptacle_names |
list[str] | None
|
|
other_receptacle_start_poses |
dict[str, list[float]] | None
|
|
pickup_obj_goal_pose |
list[float] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
place_receptacle_name |
str
|
|
place_receptacle_start_pose |
list[float]
|
|
receptacle_name |
str | None
|
|
receptacle_supported_weight_frac |
float
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
max_place_receptacle_pos_displacement
class-attribute
instance-attribute
¶
max_place_receptacle_rot_displacement
class-attribute
instance-attribute
¶
object_colors
class-attribute
instance-attribute
¶
other_receptacle_names
class-attribute
instance-attribute
¶
other_receptacle_start_poses
class-attribute
instance-attribute
¶
pickup_obj_goal_pose
class-attribute
instance-attribute
¶
pickup_obj_start_pose
class-attribute
instance-attribute
¶
place_receptacle_start_pose
class-attribute
instance-attribute
¶
receptacle_supported_weight_frac
class-attribute
instance-attribute
¶
robot_base_pose
class-attribute
instance-attribute
¶
PickAndPlaceNextToTaskSpec
¶
Bases: PickAndPlaceTaskSpec
Task-specific parameters for pick and place next-to tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
max_place_receptacle_pos_displacement |
float
|
|
max_place_receptacle_rot_displacement |
float
|
|
max_surface_to_surface_gap |
float | None
|
|
min_surface_to_surface_gap |
float | None
|
|
pickup_obj_goal_pose |
list[float] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
place_receptacle_name |
str
|
|
place_receptacle_start_pose |
list[float]
|
|
receptacle_name |
str | None
|
|
receptacle_supported_weight_frac |
float
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
max_place_receptacle_pos_displacement
class-attribute
instance-attribute
¶
max_place_receptacle_rot_displacement
class-attribute
instance-attribute
¶
max_surface_to_surface_gap
class-attribute
instance-attribute
¶
min_surface_to_surface_gap
class-attribute
instance-attribute
¶
pickup_obj_goal_pose
class-attribute
instance-attribute
¶
pickup_obj_start_pose
class-attribute
instance-attribute
¶
place_receptacle_start_pose
class-attribute
instance-attribute
¶
receptacle_supported_weight_frac
class-attribute
instance-attribute
¶
robot_base_pose
class-attribute
instance-attribute
¶
PickAndPlaceTaskSpec
¶
Bases: PickTaskSpec
Task-specific parameters for pick and place tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
max_place_receptacle_pos_displacement |
float
|
|
max_place_receptacle_rot_displacement |
float
|
|
pickup_obj_goal_pose |
list[float] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
place_receptacle_name |
str
|
|
place_receptacle_start_pose |
list[float]
|
|
receptacle_name |
str | None
|
|
receptacle_supported_weight_frac |
float
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
max_place_receptacle_pos_displacement
class-attribute
instance-attribute
¶
max_place_receptacle_rot_displacement
class-attribute
instance-attribute
¶
pickup_obj_goal_pose
class-attribute
instance-attribute
¶
pickup_obj_start_pose
class-attribute
instance-attribute
¶
place_receptacle_start_pose
class-attribute
instance-attribute
¶
receptacle_supported_weight_frac
class-attribute
instance-attribute
¶
robot_base_pose
class-attribute
instance-attribute
¶
PickTaskSpec
¶
Bases: BaseTaskSpec
Task-specific parameters for pick tasks.
Attributes:
| Name | Type | Description |
|---|---|---|
pickup_obj_goal_pose |
list[float] | None
|
|
pickup_obj_name |
str
|
|
pickup_obj_start_pose |
list[float]
|
|
receptacle_name |
str | None
|
|
robot_base_pose |
list[float]
|
|
succ_pos_threshold |
float
|
|
task_cls |
str
|
|
task_type |
str | None
|
|
RobotMountedCameraSpec
¶
Bases: BaseModel
Specification for a camera mounted on the robot.
Attributes:
| Name | Type | Description |
|---|---|---|
camera_offset |
list[float]
|
|
camera_quaternion |
list[float]
|
|
fov |
float
|
|
lookat_offset |
list[float]
|
|
name |
str
|
|
record_depth |
bool
|
|
reference_body_names |
list[str]
|
|
type |
Literal['robot_mounted']
|
|
RobotSpec
¶
Bases: BaseModel
Robot initialization specification.
Note: Robot world placement is in task.robot_base_pose, not here. This spec only contains robot-intrinsic state (joint positions).
Attributes:
| Name | Type | Description |
|---|---|---|
init_qpos |
dict[str, list[float]]
|
|
robot_name |
str
|
|
SceneModificationsSpec
¶
Bases: BaseModel
Scene modifications required for this episode.
This captures objects that need to be added to the base scene XML and their initial poses.
Attributes:
| Name | Type | Description |
|---|---|---|
added_objects |
dict[str, str]
|
|
object_poses |
dict[str, list[float]]
|
|
removed_objects |
list[str]
|
|
SourceSpec
¶
Bases: BaseModel
Provenance information for this episode.
Tracks where this episode specification came from (which H5 file and trajectory).
Attributes:
| Name | Type | Description |
|---|---|---|
benchmark_created_date |
str | None
|
|
camera_system_class |
str | None
|
|
episode_length |
int | None
|
|
h5_file |
str
|
|
source_data_date |
str | None
|
|
traj_key |
str
|
|
benchmark_created_date
class-attribute
instance-attribute
¶
get_task_spec_field_names
¶
Get all field names from TaskSpec models that should be copied to task_config.
Returns the union of all fields from all TaskSpec subclasses, excluding metadata fields (task_cls, task_type) which identify the task but aren't configuration values.
This is derived from the Pydantic models to stay in sync automatically.
Source code in molmo_spaces/evaluation/benchmark_schema.py
load_all_episodes
¶
load_all_episodes(benchmark_dir: Path) -> list[EpisodeSpec]
Load all episodes from a benchmark directory as a flat list.
Supports two formats: 1. Single benchmark.json file (preferred): List of EpisodeSpec dicts 2. Legacy house_/episode_.json structure
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_dir
|
Path
|
Path to benchmark directory |
required |
Returns:
| Type | Description |
|---|---|
list[EpisodeSpec]
|
List of EpisodeSpec objects |
Source code in molmo_spaces/evaluation/benchmark_schema.py
load_benchmark
¶
load_benchmark(benchmark_dir: Path) -> tuple[BenchmarkMetadata | None, dict[int, list[Path]]]
Load a benchmark directory.
A benchmark is simply a directory of episode JSON files. Each episode is fully self-contained. An optional benchmark_metadata.json provides human-readable info but is not required.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
benchmark_dir
|
Path
|
Path to benchmark directory containing house_* subdirectories with episode JSON files. May optionally contain benchmark_metadata.json. |
required |
Returns:
| Type | Description |
|---|---|
tuple[BenchmarkMetadata | None, dict[int, list[Path]]]
|
Tuple of (BenchmarkMetadata or None, dict mapping house_id -> list of episode JSON paths) |
Source code in molmo_spaces/evaluation/benchmark_schema.py
replace_target_object_with_custom
¶
replace_target_object_with_custom(episode: EpisodeSpec, custom_object_path: str | Path, custom_object_name: str | None = None) -> EpisodeSpec
Replace the target object in an episode with a custom object.
This function: 1. Identifies the target object from the task specification (e.g., pickup_obj_name) 2. Gets the target object's pose from task or scene_modifications 3. Removes the target object from scene_modifications if it's an added object 4. Adds the custom object to scene_modifications with the same pose 5. Updates the task specification to reference the new custom object
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
episode
|
EpisodeSpec
|
The episode specification to modify |
required |
custom_object_path
|
str | Path
|
Path to the custom object XML file (relative to ASSETS_DIR or absolute) |
required |
custom_object_name
|
str | None
|
Optional natural language name for the custom object (e.g., 'lemon'). If not provided, will extract from the XML body name. |
None
|
Returns:
| Type | Description |
|---|---|
EpisodeSpec
|
A new EpisodeSpec with the target object replaced by the custom object |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the episode doesn't have a target object or if required fields are missing |
Source code in molmo_spaces/evaluation/benchmark_schema.py
490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 | |
configs
¶
Modules:
| Name | Description |
|---|---|
evaluation_configs |
These configs are EXAMPLES of how to set up evaluation configs for use |
evaluation_configs
¶
These configs are EXAMPLES of how to set up evaluation configs for use with JSON benchmarks via molmo_spaces.evaluation.run_evaluation(). The anticipated pattern is that users will create their own eval configs in their own repositories, import run_evaluation from molmo_spaces.evaluation, and pass their config to it.
Example usage from an external repo
from molmo_spaces.evaluation import run_evaluation from my_repo.configs import MyPolicyEvalConfig
results = run_evaluation( eval_config_cls=MyPolicyEvalConfig, benchmark_dir="/path/to/benchmark", checkpoint_path="/path/to/checkpoint", )
Eval configs provide: - Robot config (factories for instantiation, gravcomp settings) - Policy config (checkpoint path, camera names, action spec) - Timing parameters (policy_dt_ms, ctrl_dt_ms, sim_dt_ms)
Episode-specific data (init_qpos, robot_base_pose, cameras, object_poses, task config) comes from the JSON benchmark files, not from these configs. The benchmark JSON is strictly authoritative for episode initialization.
Classes:
| Name | Description |
|---|---|
BrownianMotionPickPlaceColorEvalConfig |
|
BrownianMotionPickPlaceEvalConfig |
Evaluation config for Dummy pick and place. |
CAPPolicyEvalConfig |
|
DreamZeroPolicyEvalConfig |
|
DummyBenchmarkEvalConfig |
Test config that inherits from JsonBenchmarkEvalConfig. |
DummyPickPlaceEvalConfig |
Evaluation config for Dummy pick and place. |
JsonBenchmarkEvalConfig |
Minimal base config for JSON benchmark evaluation. |
PiPolicyEvalConfig |
|
TeleopPolicyEvalConfig |
|
Attributes:
| Name | Type | Description |
|---|---|---|
TIMESTAMP |
|
BrownianMotionPickPlaceColorEvalConfig
¶
Bases: BrownianMotionPickPlaceEvalConfig
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
|
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
FrankaRandomizedDroidCameraSystem
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
BrownianMotionPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
BaseRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
PickAndPlaceColorTaskConfig
|
|
task_config_preset |
PickTaskConfig | None
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
PickAndPlaceColorTaskSamplerConfig
|
|
task_type |
str
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
viewer_camera |
None
|
|
wandb_name |
str
|
|
wandb_project |
str
|
|
camera_config
class-attribute
instance-attribute
¶
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
output_dir
class-attribute
instance-attribute
¶
output_dir: Path = Path('eval_output') / f'brownian_motion_{TIMESTAMP}'
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config: PickAndPlaceColorTaskConfig = PickAndPlaceColorTaskConfig(task_cls=PickAndPlaceColorTask)
task_config_preset
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: PickAndPlaceColorTaskSamplerConfig = PickAndPlaceColorTaskSamplerConfig(task_sampler_class=PickAndPlaceColorTaskSampler, house_inds=[5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 105, 115, 125, 135, 145], samples_per_house=3)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
wandb_name
class-attribute
instance-attribute
¶
wandb_name: str = f'brownian_motion_pick_place_color_eval_{TIMESTAMP}'
Config
¶
SavedEpisode
¶
Bases: Config
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
FrankaRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
PickAndPlaceTaskConfig | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
BrownianMotionPickPlaceEvalConfig
¶
Bases: FrankaPickAndPlaceDataGenConfig
Evaluation config for Dummy pick and place.
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
|
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
FrankaRandomizedDroidCameraSystem
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
BrownianMotionPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
BaseRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
PickAndPlaceTaskConfig
|
|
task_config_preset |
PickTaskConfig | None
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
PickAndPlaceTaskSamplerConfig
|
|
task_type |
str
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
viewer_camera |
None
|
|
wandb_name |
str
|
|
wandb_project |
str
|
|
camera_config
class-attribute
instance-attribute
¶
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
output_dir
class-attribute
instance-attribute
¶
output_dir: Path = Path('eval_output') / f'brownian_motion_{TIMESTAMP}'
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config: PickAndPlaceTaskConfig = PickAndPlaceTaskConfig(task_cls=PickAndPlaceTask)
task_config_preset
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: PickAndPlaceTaskSamplerConfig = PickAndPlaceTaskSamplerConfig(task_sampler_class=PickAndPlaceTaskSampler, house_inds=[5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 105, 115, 125, 135, 145], samples_per_house=3)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
wandb_name
class-attribute
instance-attribute
¶
wandb_name: str = f'brownian_motion_pick_place_eval_{TIMESTAMP}'
Config
¶
SavedEpisode
¶
Bases: Config
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
FrankaRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
PickAndPlaceTaskConfig | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
CAPPolicyEvalConfig
¶
Bases: JsonBenchmarkEvalConfig
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
Config informationd describing a sinlge episode |
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
None
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
CAPPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
FrankaCAPRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
BaseMujocoTaskConfig
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
BaseMujocoTaskSamplerConfig
|
|
task_type |
str
|
|
terminate_upon_success |
bool
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
wandb_name |
str | None
|
|
wandb_project |
str
|
|
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: BaseMujocoTaskSamplerConfig = BaseMujocoTaskSamplerConfig(task_sampler_class=BaseMujocoTaskSampler, house_inds=[0], samples_per_house=1, task_batch_size=1, max_tasks=10000, load_robot_from_file=True)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
Config
¶
SavedEpisode
¶
Bases: Config
Config informationd describing a sinlge episode
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
BaseRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
AllTaskConfigs | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
DreamZeroPolicyEvalConfig
¶
Bases: JsonBenchmarkEvalConfig
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
Config informationd describing a sinlge episode |
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
None
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
DreamZeroPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
FrankaRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
BaseMujocoTaskConfig
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
BaseMujocoTaskSamplerConfig
|
|
task_type |
str
|
|
terminate_upon_success |
bool
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
wandb_name |
str | None
|
|
wandb_project |
str
|
|
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: BaseMujocoTaskSamplerConfig = BaseMujocoTaskSamplerConfig(task_sampler_class=BaseMujocoTaskSampler, house_inds=[0], samples_per_house=1, task_batch_size=1, max_tasks=10000, load_robot_from_file=True)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
Config
¶
SavedEpisode
¶
Bases: Config
Config informationd describing a sinlge episode
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
BaseRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
AllTaskConfigs | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
DummyBenchmarkEvalConfig
¶
Bases: JsonBenchmarkEvalConfig
Test config that inherits from JsonBenchmarkEvalConfig.
This tests the recommended pattern from evaluation/README.md: external repos should inherit from JsonBenchmarkEvalConfig and provide their robot_config and policy_config. The benchmark JSON provides all episode-specific data (cameras, poses, task params).
Note: Prefixed with underscore to avoid pytest collection warning since this inherits from a class with init.
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
Config informationd describing a sinlge episode |
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
None
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
DummyPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
FrankaRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
BaseMujocoTaskConfig
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
BaseMujocoTaskSamplerConfig
|
|
task_type |
str
|
|
terminate_upon_success |
bool
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
wandb_name |
str | None
|
|
wandb_project |
str
|
|
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: BaseMujocoTaskSamplerConfig = BaseMujocoTaskSamplerConfig(task_sampler_class=BaseMujocoTaskSampler, house_inds=[0], samples_per_house=1, task_batch_size=1, max_tasks=10000, load_robot_from_file=True)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
Config
¶
SavedEpisode
¶
Bases: Config
Config informationd describing a sinlge episode
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
BaseRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
AllTaskConfigs | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
DummyPickPlaceEvalConfig
¶
Bases: FrankaPickAndPlaceDataGenConfig
Evaluation config for Dummy pick and place.
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
|
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
FrankaRandomizedDroidCameraSystem
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
DummyPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
BaseRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
PickAndPlaceTaskConfig
|
|
task_config_preset |
PickTaskConfig | None
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
PickAndPlaceTaskSamplerConfig
|
|
task_type |
str
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
viewer_camera |
None
|
|
wandb_name |
str
|
|
wandb_project |
str
|
|
camera_config
class-attribute
instance-attribute
¶
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
output_dir
class-attribute
instance-attribute
¶
output_dir: Path = Path('eval_output') / f'dummy_{TIMESTAMP}'
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config: PickAndPlaceTaskConfig = PickAndPlaceTaskConfig(task_cls=PickAndPlaceTask)
task_config_preset
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: PickAndPlaceTaskSamplerConfig = PickAndPlaceTaskSamplerConfig(task_sampler_class=PickAndPlaceTaskSampler, house_inds=[5, 15, 25, 35, 45, 55, 65, 75, 85, 95, 105, 115, 125, 135, 145], samples_per_house=3)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
wandb_name
class-attribute
instance-attribute
¶
wandb_name: str = f'dummy_pick_place_eval_{TIMESTAMP}'
Config
¶
SavedEpisode
¶
Bases: Config
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
FrankaRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
PickAndPlaceTaskConfig | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
JsonBenchmarkEvalConfig
¶
Bases: MlSpacesExpConfig
Minimal base config for JSON benchmark evaluation.
This config is designed for use ONLY with JSON benchmarks. It provides the minimal infrastructure needed to run a learned policy against a benchmark where all episode-specific data (task type, cameras, robot poses, object poses, etc.) comes from the benchmark JSON.
Subclass this and provide: - robot_config: Robot configuration for instantiation - policy_config: Your learned policy configuration
DO NOT provide task_sampler_config or task_config - those are placeholders that will be overridden by the benchmark. If you accidentally try to use this config for data generation (not evaluation), it will fail because the task sampler/config are minimal stubs.
Example
class MyPolicyBenchmarkEvalConfig(JsonBenchmarkEvalConfig): robot_config = FrankaRobotConfig() policy_config = MyPolicyConfig(checkpoint_path="/path/to/ckpt")
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
Config informationd describing a sinlge episode |
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
This serves as the init() called after internal validation of config parameters |
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
None
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
BasePolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
BaseRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
BaseMujocoTaskConfig
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
BaseMujocoTaskSamplerConfig
|
|
task_type |
str
|
|
terminate_upon_success |
bool
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
wandb_name |
str | None
|
|
wandb_project |
str
|
|
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: BaseMujocoTaskSamplerConfig = BaseMujocoTaskSamplerConfig(task_sampler_class=BaseMujocoTaskSampler, house_inds=[0], samples_per_house=1, task_batch_size=1, max_tasks=10000, load_robot_from_file=True)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
Config
¶
SavedEpisode
¶
Bases: Config
Config informationd describing a sinlge episode
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
BaseRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
AllTaskConfigs | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
This serves as the init() called after internal validation of config parameters
Source code in molmo_spaces/configs/abstract_exp_config.py
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
PiPolicyEvalConfig
¶
Bases: JsonBenchmarkEvalConfig
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
Config informationd describing a sinlge episode |
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
None
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
PiPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
FrankaRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
BaseMujocoTaskConfig
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
BaseMujocoTaskSamplerConfig
|
|
task_type |
str
|
|
terminate_upon_success |
bool
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
wandb_name |
str | None
|
|
wandb_project |
str
|
|
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: BaseMujocoTaskSamplerConfig = BaseMujocoTaskSamplerConfig(task_sampler_class=BaseMujocoTaskSampler, house_inds=[0], samples_per_house=1, task_batch_size=1, max_tasks=10000, load_robot_from_file=True)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
Config
¶
SavedEpisode
¶
Bases: Config
Config informationd describing a sinlge episode
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
BaseRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
AllTaskConfigs | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
TeleopPolicyEvalConfig
¶
Bases: JsonBenchmarkEvalConfig
Classes:
| Name | Description |
|---|---|
Config |
|
SavedEpisode |
Config informationd describing a sinlge episode |
Methods:
| Name | Description |
|---|---|
freeze_task_config |
Saves the state of a sampled task i.e. an episode |
from_dict |
Create a configuration instance from a dictionary. |
load_config |
Loads a configuration from a file |
load_from_json |
Load the configuration from a JSON file. |
model_post_init |
|
save_config |
Saves the current configuration to the output directory |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
CameraConfig |
type
|
|
PolicyConfig |
type
|
|
RobotConfig |
type
|
|
benchmark_path |
Path | None
|
|
camera_config |
None
|
|
collision_free_pose_limit |
int
|
|
config_version |
str
|
|
ctrl_dt_ms |
float
|
|
data_split |
str
|
|
datagen_profiler |
bool
|
|
end_on_success |
bool
|
|
environment_light_intensity |
float
|
|
eval_runtime_params |
Any
|
|
filter_for_successful_trajectories |
bool
|
|
fps |
float
|
|
log_level |
str
|
|
num_envs |
int
|
|
num_workers |
int
|
|
output_dir |
Path
|
|
policy_config |
TeleopPolicyConfig
|
|
policy_dt_ms |
float
|
|
profile |
bool
|
|
profiler |
Profiler | None
|
|
robot_config |
FrankaRobotConfig
|
|
scene_dataset |
str
|
|
seed |
int | None
|
|
sim_dt_ms |
float
|
|
tag |
str
|
|
task_config |
BaseMujocoTaskConfig
|
|
task_config_preset_exp |
AllTaskConfigs | None
|
|
task_config_preset_scn |
AllTaskConfigs | None
|
|
task_horizon |
int
|
|
task_sampler_config |
BaseMujocoTaskSamplerConfig
|
|
task_type |
str
|
|
terminate_upon_success |
bool
|
|
use_passive_viewer |
bool
|
|
use_wandb |
bool
|
|
viewer_cam_dict |
dict
|
|
wandb_name |
str | None
|
|
wandb_project |
str
|
|
environment_light_intensity
class-attribute
instance-attribute
¶
filter_for_successful_trajectories
class-attribute
instance-attribute
¶
policy_config
class-attribute
instance-attribute
¶
robot_config
class-attribute
instance-attribute
¶
task_config
class-attribute
instance-attribute
¶
task_config_preset_exp
class-attribute
instance-attribute
¶
task_config_preset_scn
class-attribute
instance-attribute
¶
task_sampler_config
class-attribute
instance-attribute
¶
task_sampler_config: BaseMujocoTaskSamplerConfig = BaseMujocoTaskSamplerConfig(task_sampler_class=BaseMujocoTaskSampler, house_inds=[0], samples_per_house=1, task_batch_size=1, max_tasks=10000, load_robot_from_file=True)
viewer_cam_dict
class-attribute
instance-attribute
¶
viewer_cam_dict: dict = {'distance': 5.0, 'azimuth': 45.0, 'elevation': -30.0, 'lookat': [0.0, 0.0, 0.5]}
Config
¶
SavedEpisode
¶
Bases: Config
Config informationd describing a sinlge episode
Classes:
| Name | Description |
|---|---|
Config |
|
Methods:
| Name | Description |
|---|---|
from_dict |
Create a configuration instance from a dictionary. |
load_from_json |
Load the configuration from a JSON file. |
save_to_json |
Save the configuration to a JSON file. |
to_dict |
Convert the configuration to a dictionary. |
to_json |
|
Attributes:
| Name | Type | Description |
|---|---|---|
camera_config |
AllCameraSystems | None
|
|
robot_config |
BaseRobotConfig | None
|
|
task_cls_str |
str | None
|
|
task_config |
AllTaskConfigs | None
|
|
freeze_task_config
¶
freeze_task_config(observation, task: BaseMujocoTask = None) -> None
Saves the state of a sampled task i.e. an episode
Source code in molmo_spaces/configs/abstract_exp_config.py
from_dict
classmethod
¶
load_config
staticmethod
¶
Loads a configuration from a file
Source code in molmo_spaces/configs/abstract_exp_config.py
load_from_json
classmethod
¶
Load the configuration from a JSON file.
model_post_init
¶
save_config
¶
Saves the current configuration to the output directory
Source code in molmo_spaces/configs/abstract_exp_config.py
save_to_json
¶
to_dict
¶
eval_main
¶
Evaluation entrypoint for learned policies on JSON-based benchmarks.
This module evaluates policies on JSON benchmark files where each episode is fully self-contained. Unlike the pickle-based frozen config approach, JSON benchmarks are human-readable, version-independent, and support mixed task types.
Key differences from run_benchmark_with_learned_policy.py: - Uses JsonEvalRunner instead of PatchyRunner - No patch_config needed - JSON episode specs are authoritative - Timing parameters (policy_dt_ms, ctrl_dt_ms, sim_dt_ms) come from the eval config, NOT from individual episodes. This allows the same benchmark to be run at different control rates. - Supports mixed task types in the same benchmark
Programmatic usage (from external repo): from molmo_spaces.evaluation import run_evaluation
results = run_evaluation(
eval_config_cls=MyEvalConfig,
benchmark_dir="/path/to/benchmark",
checkpoint_path="/path/to/checkpoint",
)
print(f"Success rate: {results.success_count}/{results.total_count}")
Environment setup (MacOS): export PYTHONPATH="${PYTHONPATH}:." export MUJOCO_GL=egl export PYOPENGL_PLATFORM=egl
Classes:
| Name | Description |
|---|---|
EvalRuntimeParams |
Runtime parameters for evaluation that are not part of the base config schema. |
EvaluationResults |
Results from running an evaluation on a benchmark. |
Functions:
| Name | Description |
|---|---|
build_success_status_map |
Build a map of episode keys to success status for video naming. |
create_eval_config |
Create an MlSpacesExpConfig experiment config from a JSON benchmark for evaluation. |
determine_task_horizon |
Determine task horizon from command line override or benchmark. |
get_args |
|
main |
Command-line entry point for evaluation. |
run_evaluation |
Run evaluation on a JSON benchmark programmatically. |
Attributes:
| Name | Type | Description |
|---|---|---|
log |
|
EvalRuntimeParams
dataclass
¶
EvalRuntimeParams(episode_idx: int | None = None, max_episodes: int | None = None, add_custom_object: bool = False, custom_object_path: str | Path | None = None, custom_object_name: str | None = None)
Runtime parameters for evaluation that are not part of the base config schema.
These parameters are set during evaluation initialization and used by the evaluation runner to customize episode processing.
Attributes:
| Name | Type | Description |
|---|---|---|
add_custom_object |
bool
|
|
custom_object_name |
str | None
|
|
custom_object_path |
str | Path | None
|
|
episode_idx |
int | None
|
|
max_episodes |
int | None
|
|
custom_object_path
class-attribute
instance-attribute
¶
EvaluationResults
dataclass
¶
EvaluationResults(success_count: int, total_count: int, output_dir: Path, episode_results: list[EpisodeResult] = list(), exp_config: MlSpacesExpConfig | None = None)
Results from running an evaluation on a benchmark.
Attributes:
| Name | Type | Description |
|---|---|---|
success_count |
int
|
Number of successful episodes |
total_count |
int
|
Total number of episodes evaluated |
output_dir |
Path
|
Path where evaluation outputs were saved |
episode_results |
list[EpisodeResult]
|
Per-episode results with details |
exp_config |
MlSpacesExpConfig | None
|
The experiment config used for evaluation |
episode_results
class-attribute
instance-attribute
¶
episode_results: list[EpisodeResult] = field(default_factory=list)
build_success_status_map
¶
build_success_status_map(results: list[EpisodeResult]) -> dict[str, bool]
Build a map of episode keys to success status for video naming.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[EpisodeResult]
|
List of episode results |
required |
Returns:
| Type | Description |
|---|---|
dict[str, bool]
|
Dict mapping episode keys (e.g., "house_5/episode_00000000") to success status |
Source code in molmo_spaces/evaluation/eval_main.py
create_eval_config
¶
create_eval_config(eval_config_cls: type[MlSpacesExpConfig], benchmark_dir: Path, output_dir: Path, checkpoint_path: str | None, task_horizon: int, num_workers: int, camera_config_override: Any | None = None) -> MlSpacesExpConfig
Create an MlSpacesExpConfig experiment config from a JSON benchmark for evaluation.
The eval config class provides: - policy_config: Policy configuration (checkpoint, camera names, etc.) - robot_config: Robot configuration - Timing parameters: policy_dt_ms, ctrl_dt_ms, sim_dt_ms
The benchmark provides: - Scene/task configuration (per-episode)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eval_config_cls
|
type[MlSpacesExpConfig]
|
The eval config class to instantiate |
required |
benchmark_dir
|
Path
|
Path to JSON benchmark directory |
required |
output_dir
|
Path
|
Output directory for results |
required |
checkpoint_path
|
str | None
|
Optional override for checkpoint path |
required |
task_horizon
|
int
|
Task horizon (already resolved from defaults or override) |
required |
num_workers
|
int
|
Number of worker processes |
required |
Returns:
| Type | Description |
|---|---|
MlSpacesExpConfig
|
Configured MlSpacesExpConfig |
Source code in molmo_spaces/evaluation/eval_main.py
determine_task_horizon
¶
determine_task_horizon(episodes: list[EpisodeSpec], task_horizon_override: int | None, policy_dt_ms: float | None = None) -> int
Determine task horizon from command line override or benchmark.
Priority: 1. Explicit override (from CLI --task_horizon_steps or --task_horizon_sec) 2. Benchmark's per-episode task_horizon_sec (converted to steps via policy_dt_ms)
Fails loudly if the benchmark does not contain task_horizon_sec and no explicit override was provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
episodes
|
list[EpisodeSpec]
|
List of episode specs from the benchmark |
required |
task_horizon_override
|
int | None
|
Optional override from command line |
required |
policy_dt_ms
|
float | None
|
Policy timestep in milliseconds, required when reading task_horizon_sec from the benchmark. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Task horizon (in steps) to use for all episodes |
Source code in molmo_spaces/evaluation/eval_main.py
get_args
¶
Source code in molmo_spaces/evaluation/eval_main.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | |
main
¶
Command-line entry point for evaluation.
Source code in molmo_spaces/evaluation/eval_main.py
run_evaluation
¶
run_evaluation(eval_config_cls: type[MlSpacesExpConfig] | str, benchmark_dir: Path, checkpoint_path: str | None = None, task_horizon_steps: int | None = None, task_horizon_sec: float | None = None, output_dir: str | Path | None = None, num_workers: int = 1, use_wandb: bool = False, wandb_project: str = 'mlspaces-online-eval', preloaded_policy: BasePolicy | None = None, max_episodes: int | None = None, camera_config_override: Any | None = None, camera_names_override: list[str] | None = None, environment_light_intensity: float | None = None, episode_idx: int | None = None, add_custom_object: bool = False, custom_object_path: str | Path | None = None, custom_object_name: str | None = None) -> EvaluationResults
Run evaluation on a JSON benchmark programmatically.
This is the primary entry point for running evaluations from external code. It can be imported and called directly without using command-line arguments.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
eval_config_cls
|
type[MlSpacesExpConfig] | str
|
Either an MlSpacesExpConfig subclass, or a string in the format "module.path:ClassName" (e.g., "myrepo.configs:MyEvalConfig"). |
required |
benchmark_dir
|
Path
|
Path to JSON benchmark directory containing benchmark.json. |
required |
checkpoint_path
|
str | None
|
Path to model checkpoint. Overrides the checkpoint in policy_config. |
None
|
task_horizon_steps
|
int | None
|
Max steps per episode. If None, uses default for the task class. |
None
|
task_horizon_sec
|
float | None
|
Max seconds per episode, used to calculate horizon in steps. Cannot be used with task_horizon_steps. |
None
|
output_dir
|
str | Path | None
|
Output directory for results. Defaults to eval_output/ |
None
|
num_workers
|
int
|
Number of parallel worker processes. |
1
|
use_wandb
|
bool
|
Whether to log results to Weights & Biases. |
False
|
wandb_project
|
str
|
W&B project name (only used if use_wandb=True). |
'mlspaces-online-eval'
|
preloaded_policy
|
BasePolicy | None
|
Optional pre-initialized policy instance. If provided, skips policy creation from config. |
None
|
max_episodes
|
int | None
|
Maximum number of episodes to evaluate from benchmark. If None, evaluates all episodes. |
None
|
camera_config_override
|
Any | None
|
Optional camera system config (e.g. FrankaEvalCameraSystem) to replace the default camera_config on the experiment config. |
None
|
camera_names_override
|
list[str] | None
|
Optional list of camera names to override policy_config.camera_names (e.g. ["randomized_zed2_analogue_1", "wrist_camera"]). |
None
|
episode_idx
|
int | None
|
Index of a specific episode to evaluate. If None, evaluates all episodes. |
None
|
add_custom_object
|
bool
|
Whether to replace the target object with a custom object. |
False
|
custom_object_path
|
str | Path | None
|
Path to the custom object XML file. Required if add_custom_object is True. |
None
|
custom_object_name
|
str | None
|
Natural language name for the custom object (e.g., 'lemon', 'cup'). If not provided, will attempt to extract from the object path. |
None
|
Returns:
| Type | Description |
|---|---|
EvaluationResults
|
EvaluationResults containing success counts, output paths, and per-episode details. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If benchmark_dir doesn't exist. |
ValueError
|
If no episodes found in benchmark or config class not found. |
Example
from molmo_spaces.evaluation import run_evaluation from my_repo.configs import MyEvalConfig
results = run_evaluation( eval_config_cls=MyEvalConfig, benchmark_dir="/path/to/benchmark", checkpoint_path="/path/to/checkpoint.pt", task_horizon_steps=500, ) print(f"Success rate: {results.success_rate:.1%}")
Source code in molmo_spaces/evaluation/eval_main.py
440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 | |
json_eval_runner
¶
JSON-based benchmark evaluation runner.
This runner loads episode specifications from JSON benchmark files and runs policy evaluations against them. Unlike the pickle-based frozen config approach, JSON specs are fully self-contained and human-readable.
Key design principles: - Each episode is fully self-contained in JSON (no external config dependencies) - Timing parameters (policy_dt_ms, ctrl_dt_ms, sim_dt_ms) come from the eval config, NOT from individual episodes. This allows the same benchmark to be run at different control rates without modifying the benchmark files. - Task type can vary per episode (mixed task types in same benchmark) - No patch_config needed - JSON is authoritative
Usage
from molmo_spaces.evaluation import JsonEvalRunner, load_benchmark
Load benchmark and create config¶
metadata, episodes_by_house = load_benchmark(benchmark_dir) runner = JsonEvalRunner(exp_config, benchmark_dir) success_count, total_count = runner.run(preloaded_policy=policy)
Classes:
| Name | Description |
|---|---|
JsonEvalRunner |
Evaluation runner for JSON-based benchmarks. |
Attributes:
| Name | Type | Description |
|---|---|---|
log |
|
JsonEvalRunner
¶
Bases: ParallelRolloutRunner
Evaluation runner for JSON-based benchmarks.
This runner differs from the standard ParallelRolloutRunner in several ways: 1. Episodes are loaded from JSON files, not from H5 frozen configs 2. Each episode is fully self-contained (timing, cameras, task config) 3. Task samplers are created per-episode to support mixed task types 4. Uses patch_config to add evaluation-specific runtime parameters
The runner inherits process_single_house from ParallelRolloutRunner and customizes behavior by overriding hook methods.
Initialize the JSON eval runner.
The benchmark is authoritative - all episode data comes from the JSON files. No fallbacks or defaults; missing data is an error.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exp_config
|
MlSpacesExpConfig
|
Base experiment config (provides robot_config, policy_config) |
required |
benchmark_dir
|
Path
|
Path to benchmark directory containing benchmark.json |
required |
Methods:
| Name | Description |
|---|---|
adjust_robot |
Apply robot-specific evaluation overrides if configured. |
get_episode_seed |
Get seed from episode spec, falling back to index. |
get_episode_spec_at_index |
Get episode specification at given index. |
get_episode_task_sampler |
Create per-episode JsonEvalTaskSampler. |
get_episodes_for_house |
Get all episode specs for a given house. |
get_max_episode_attempts |
Process all episodes in the benchmark - no retry multiplier. |
load_episodes_for_house |
Load episode specifications from JSON benchmark. |
patch_config |
Patch evaluation config with runtime evaluation-specific parameters. |
prepare_episode_config |
Prepare episode-specific config from JSON spec. |
process_single_house |
Process all episodes for a single house using customizable hooks. |
run |
Run house-by-house rollouts using multiprocessing workers. |
run_single_rollout |
Execute a single rollout with the given task and policy. |
sample_task_from_spec |
Sample task - episode spec is already in the JsonEvalTaskSampler. |
should_close_episode_task_sampler |
Close task sampler after each episode - we create per-episode. |
should_stop_early |
Stop early if evaluating a single episode (--idx provided) and it's been collected. |
Attributes:
Source code in molmo_spaces/evaluation/json_eval_runner.py
max_allowed_sequential_irrecoverable_failures
instance-attribute
¶
max_allowed_sequential_rollout_failures
instance-attribute
¶
max_allowed_sequential_task_sampler_failures
instance-attribute
¶
adjust_robot
staticmethod
¶
Apply robot-specific evaluation overrides if configured.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_episode_seed
staticmethod
¶
get_episode_seed(episode_idx: int, episode_spec: EpisodeSpec, task_sampler: JsonEvalTaskSampler) -> int
Get seed from episode spec, falling back to index.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_episode_spec_at_index
staticmethod
¶
get_episode_spec_at_index(episode_specs: list[EpisodeSpec], idx: int) -> EpisodeSpec
get_episode_task_sampler
staticmethod
¶
get_episode_task_sampler(exp_config: MlSpacesExpConfig, episode_spec: EpisodeSpec, shared_task_sampler, datagen_profiler: DatagenProfiler | None) -> JsonEvalTaskSampler
Create per-episode JsonEvalTaskSampler.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_episodes_for_house
¶
get_episodes_for_house(house_id: int) -> list[EpisodeSpec]
Get all episode specs for a given house.
Source code in molmo_spaces/evaluation/json_eval_runner.py
get_max_episode_attempts
staticmethod
¶
get_max_episode_attempts(episode_specs: list[EpisodeSpec], samples_per_house: int, exp_config: MlSpacesExpConfig) -> int
Process all episodes in the benchmark - no retry multiplier.
Source code in molmo_spaces/evaluation/json_eval_runner.py
load_episodes_for_house
staticmethod
¶
load_episodes_for_house(exp_config: MlSpacesExpConfig, house_id: int, batch_suffix: str, worker_task_sampler, worker_logger) -> tuple[list[EpisodeSpec], None]
Load episode specifications from JSON benchmark.
Source code in molmo_spaces/evaluation/json_eval_runner.py
patch_config
staticmethod
¶
patch_config(exp_config: MlSpacesExpConfig, episode_idx: int | None = None, max_episodes: int | None = None, add_custom_object: bool = False, custom_object_path: str | Path | None = None, custom_object_name: str | None = None) -> MlSpacesExpConfig
Patch evaluation config with runtime evaluation-specific parameters.
This method modifies the config object to store evaluation-specific runtime parameters that are not part of the base config schema. These parameters are used by the evaluation runner to customize episode processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
exp_config
|
MlSpacesExpConfig
|
The experiment config to patch |
required |
episode_idx
|
int | None
|
Optional index of a specific episode to evaluate. If provided, only that episode will be evaluated and the process will stop after it. |
None
|
max_episodes
|
int | None
|
Optional maximum number of episodes to evaluate. If provided, only the episodes for the houses used in the first N episodes will be evaluated. Note that the final number of episodes can differ from N if more than one episode is sampled for any of the houses among the first N episodes. |
None
|
add_custom_object
|
bool
|
Whether to replace the target object with a custom object. |
False
|
custom_object_path
|
str | Path | None
|
Path to the custom object XML file. Required if add_custom_object is True. |
None
|
custom_object_name
|
str | None
|
Natural language name for the custom object (e.g., 'lemon', 'cup'). |
None
|
Returns:
| Type | Description |
|---|---|
MlSpacesExpConfig
|
The patched config (same object, modified in place) |
Note
These parameters are stored in an EvalRuntimeParams dataclass attached to
the config object as exp_config.eval_runtime_params for access by worker
processes. They are not part of the base MlSpacesExpConfig schema but are
necessary for runtime evaluation customization.
Source code in molmo_spaces/evaluation/json_eval_runner.py
prepare_episode_config
staticmethod
¶
prepare_episode_config(exp_config: MlSpacesExpConfig, episode_spec: EpisodeSpec, episode_idx: int) -> MlSpacesExpConfig
Prepare episode-specific config from JSON spec.
Note: task_horizon is NOT read from episode_spec. It's an evaluation parameter that comes from exp_config (set via command line or defaults).
Source code in molmo_spaces/evaluation/json_eval_runner.py
process_single_house
staticmethod
¶
process_single_house(worker_id: int, worker_logger, house_id: int, exp_config: MlSpacesExpConfig, samples_per_house: int, shutdown_event, task_sampler, preloaded_policy: BasePolicy | None = None, max_allowed_sequential_task_sampler_failures: int = 10, max_allowed_sequential_rollout_failures: int = 10, filter_for_successful_trajectories: bool = False, runner_class=None, batch_num: int | None = None, total_batches: int | None = None, datagen_profiler: DatagenProfiler | None = None) -> tuple[int, int, bool]
Process all episodes for a single house using customizable hooks.
This method uses a while loop to iterate over episodes, calling hook methods via runner_class to allow subclasses to customize behavior without duplicating the entire method.
Hooks called (override in subclass to customize): - load_episodes_for_house: Load episode specs from source (JSON, etc.) - get_max_episode_attempts: Maximum iterations of the episode loop - should_stop_early: Whether to stop before max attempts (e.g., enough successes) - prepare_episode_config: Modify config per-episode - get_episode_task_sampler: Get/create task sampler for episode - sample_task_from_spec: Sample task from specification - get_episode_seed: Get seed for episode - should_close_episode_task_sampler: Whether to close sampler per-episode
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
worker_id
|
int
|
ID of the worker thread/process |
required |
worker_logger
|
Logger instance for this worker |
required | |
house_id
|
int
|
Index of the house to process |
required |
exp_config
|
MlSpacesExpConfig
|
Experiment configuration |
required |
samples_per_house
|
int
|
Number of episodes to collect for this house |
required |
shutdown_event
|
Event to signal shutdown |
required | |
task_sampler
|
Task sampler instance (shared across houses for this worker) |
required | |
preloaded_policy
|
BasePolicy | None
|
Optional pre-initialized policy instance |
None
|
max_allowed_sequential_task_sampler_failures
|
int
|
Max consecutive task sampling failures |
10
|
max_allowed_sequential_rollout_failures
|
int
|
Max consecutive rollout failures |
10
|
filter_for_successful_trajectories
|
bool
|
Whether to filter for successful trajectories only |
False
|
runner_class
|
Runner class with hook methods to call |
None
|
|
batch_num
|
int | None
|
Batch number for this house (for batched processing) |
None
|
total_batches
|
int | None
|
Total number of batches for this house |
None
|
datagen_profiler
|
DatagenProfiler | None
|
DatagenProfiler for per-worker timing (optional) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
tuple[int, int, bool]
|
(house_success_count, house_total_count, irrecoverable_failure_flag) |
Source code in molmo_spaces/data_generation/pipeline.py
784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 | |
run
¶
run(preloaded_policy: BasePolicy | None = None) -> tuple[int, int]
Run house-by-house rollouts using multiprocessing workers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
preloaded_policy
|
BasePolicy | None
|
Optional pre-initialized policy instance to use for rollouts. If None, a new policy will be created for each rollout. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
tuple |
tuple[int, int]
|
(success_count, total_count) |
Source code in molmo_spaces/data_generation/pipeline.py
1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 | |
run_single_rollout
staticmethod
¶
run_single_rollout(episode_seed: int, task: BaseMujocoTask, policy: Any, profiler: Profiler | None = None, viewer=None, shutdown_event=None, datagen_profiler: DatagenProfiler | None = None, end_on_success: bool = False) -> bool
Execute a single rollout with the given task and policy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
episode_seed
|
int
|
Seed for this episode |
required |
task
|
BaseMujocoTask
|
The task to run |
required |
policy
|
Any
|
Policy to use for action selection |
required |
profiler
|
Profiler | None
|
Legacy Profiler instance (optional) |
None
|
viewer
|
MuJoCo viewer for visualization (optional) |
None
|
|
shutdown_event
|
Event to signal shutdown (optional) |
None
|
|
datagen_profiler
|
DatagenProfiler | None
|
DatagenProfiler for per-worker timing (optional) |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
bool |
bool
|
Whether the episode was successful |
Source code in molmo_spaces/data_generation/pipeline.py
678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 | |
sample_task_from_spec
staticmethod
¶
sample_task_from_spec(task_sampler: JsonEvalTaskSampler, house_id: int, episode_spec: EpisodeSpec, episode_idx: int) -> BaseMujocoTask | None
Sample task - episode spec is already in the JsonEvalTaskSampler.
Source code in molmo_spaces/evaluation/json_eval_runner.py
should_close_episode_task_sampler
staticmethod
¶
should_stop_early
staticmethod
¶
should_stop_early(num_collected: int, samples_per_house: int, exp_config: MlSpacesExpConfig | None = None) -> bool
Stop early if evaluating a single episode (--idx provided) and it's been collected.
Source code in molmo_spaces/evaluation/json_eval_runner.py
policy_server
¶
Modified from: https://github.com/Physical-Intelligence/openpi/blob/main/src/openpi/serving/websocket_policy_server.py
Classes:
| Name | Description |
|---|---|
MutableFloat |
|
WebsocketPolicyServer |
Serves a policy using the websocket protocol. |
Functions:
| Name | Description |
|---|---|
measure_elapsed |
|
Attributes:
| Name | Type | Description |
|---|---|---|
logger |
|
MutableFloat
dataclass
¶
WebsocketPolicyServer
¶
WebsocketPolicyServer(policies: InferencePolicy | list[InferencePolicy], model_name: str, host: str = '0.0.0.0', port: int | None = None, metadata: dict | None = None, max_concurrency: int = 100, force_concurrent: bool = False)
Serves a policy using the websocket protocol.
Concurrent inference is supported for stateful policies via state saving. Non-stateful policies default to nonconcurrent inference unless force_concurrent is True.
In order to provide for concurrent inference, we track policy state internally in the server.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
policies
|
InferencePolicy | list[InferencePolicy]
|
Multiple copies of the same policies to serve, requests will be balanced across the policies for concurrent inference. If a policy is passed instead of a list, it will be used as the only policy. |
required |
model_name
|
str
|
The name of the model to serve. Will be included in the metadata. |
required |
host
|
str
|
The host to serve the policy on. |
'0.0.0.0'
|
port
|
int | None
|
The port to serve the policy on. |
None
|
metadata
|
dict | None
|
Additional metadata to serve with the policy. |
None
|
max_concurrency
|
int
|
The maximum number of concurrent clients to serve. Ignored for non-stateful policies unless force_concurrent is True. |
100
|
force_concurrent
|
bool
|
Whether to force concurrent inference for non-stateful policies. This may cause bugs if the policy is not safe for concurrency. |
False
|
Methods:
| Name | Description |
|---|---|
serve_forever |
Prepares the policy and starts the server. |
Source code in molmo_spaces/evaluation/policy_server.py
serve_forever
¶
Prepares the policy and starts the server.
Source code in molmo_spaces/evaluation/policy_server.py
robot_eval_overrides
¶
Functions:
| Name | Description |
|---|---|
cap_robot_eval_override |
|
get_robot_override |
|
Attributes:
| Name | Type | Description |
|---|---|---|
OverrideFn |
|
|
ROBOT_OVERRIDE_REGISTRY |
dict[str, OverrideFn]
|
|
log |
|
ROBOT_OVERRIDE_REGISTRY
module-attribute
¶
ROBOT_OVERRIDE_REGISTRY: dict[str, OverrideFn] = {'FrankaCAPRobotConfig': cap_robot_eval_override}
cap_robot_eval_override
¶
cap_robot_eval_override(episode_spec: EpisodeSpec, camera_config: CameraSystemConfig) -> None
Source code in molmo_spaces/evaluation/robot_eval_overrides.py
get_robot_override
¶
get_robot_override(robot_config) -> OverrideFn | None