Per-episode sensitivity recording (eval + metrics)#781
Conversation
Produce the episode_summary.jsonl that the sensitivity toolbox (PR #729) consumes: - episode_writer.write_episode_summaries: one JSONL row per episode (job_name, episode_idx, full arena_env_args, per-episode outcomes) - eval_runner / eval_runner_cli: opt-in --episode_summary flag - job_manager: expose arena_env_args_dict for logging - metrics_manager: compute_per_episode for the per-episode outcome values Stacked on cvolk/feature/sensitivity_analysis_mvp1 (#729). Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Greptile SummaryThis PR adds per-episode sensitivity recording to the eval pipeline: a new
Confidence Score: 3/5The per-episode recording feature works correctly for the bundled 2-episode-per-job sweep configs, but the underlying HDF5 reading logic in compute_per_episode has unresolved data-integrity issues at higher episode counts and fragile error handling in the eval_runner integration that were raised in prior review rounds and remain in the code. The new compute_per_episode method inherits the same fragile HDF5 access pattern as its predecessor: it calls get_num_episodes and get_recorded_metric_data in separate file opens, so the loop bound and the array lengths can diverge. For jobs with ten or more episodes, iterating h5py's data group alphabetically produces demo_0, demo_1, demo_10 rather than numeric order, so episode_idx values written to the JSONL silently map to the wrong HDF5 demos. Additionally, write_episode_summaries lives inside the try-block that drives complete_job, so an I/O failure during JSONL writing discards already-computed rollout metrics and marks a successful job as failed. isaaclab_arena/metrics/metrics_manager.py and isaaclab_arena/metrics/metrics.py (get_recorded_metric_data iteration order), isaaclab_arena/evaluation/eval_runner.py (write_episode_summaries placement relative to the exception handler) Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as eval_runner (main)
participant RP as rollout_policy()
participant Rec as HDF5 Recorder
participant MM as MetricsManager
participant EW as episode_writer
participant JSONL as episode_summary.jsonl
CLI->>RP: "rollout_policy(env, policy, num_episodes=N)"
loop per episode
RP->>Rec: writes demo_i data to HDF5
end
RP-->>CLI: aggregate metrics dict
alt episode_summary_enabled
CLI->>EW: write_episode_summaries(env, job, path)
EW->>MM: compute_per_episode()
MM->>Rec: get_num_episodes(dataset_path) [HDF5 open 1]
loop per term
MM->>Rec: get_recorded_metric_data(dataset_path, term) [HDF5 open 2..N]
end
MM-->>EW: list[dict per episode]
loop per episode_idx (enumerate)
EW->>JSONL: "append {job_name, episode_idx, arena_env_args, outcomes}"
end
EW-->>CLI: rows written
end
CLI->>CLI: job_manager.complete_job(COMPLETED)
Reviews (3): Last reviewed commit: "Remove the single-factor light_intensity..." | Re-trigger Greptile |
| if episode_summary_enabled: | ||
| rows = write_episode_summaries(env, job, args_cli.episode_summary) | ||
| print(f"[INFO] Wrote {rows} episode summaries for job '{job.name}'") |
There was a problem hiding this comment.
Summary write failure silently fails a successful job
write_episode_summaries runs inside the outer try/except block, so any I/O failure (disk full, permission denied, bad JSON serialization) will trigger the except handler at line 273. That handler calls job_manager.complete_job(job, metrics={}, status=Status.FAILED), discarding the already-computed metrics from rollout_policy and marking a successful rollout as FAILED. Since the summary is opt-in and supplemental, its failure should be isolated — otherwise a temporary disk issue poisons your eval results.
| dataset_path = get_metric_recorder_dataset_path(self._env) | ||
| num_episodes = get_num_episodes(dataset_path) | ||
|
|
||
| # Recorded data arrives grouped by metric (each term -> one array per episode). | ||
| # Read it once here, then transpose into one metric dict per episode below. | ||
| episode_arrays_by_term = { | ||
| term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name) | ||
| for term_name, term_cfg in zip(self._term_names, self._term_cfgs) | ||
| } | ||
|
|
||
| per_episode_metrics: list[dict[str, Any]] = [] | ||
| for episode_index in range(num_episodes): |
There was a problem hiding this comment.
Extra HDF5 open + potential
IndexError between two independent reads
get_num_episodes opens and closes the HDF5 file once more after episode_arrays_by_term has already been populated. The two reads are independent h5py.File opens, so if the returned list lengths don't match num_episodes (e.g., a term's dataset has fewer demos than the top-level demo count), episode_arrays_by_term[term_name][episode_index] will raise an IndexError. Deriving the episode count directly from the already-loaded lists avoids the redundant open and keeps both measurements consistent.
| dataset_path = get_metric_recorder_dataset_path(self._env) | |
| num_episodes = get_num_episodes(dataset_path) | |
| # Recorded data arrives grouped by metric (each term -> one array per episode). | |
| # Read it once here, then transpose into one metric dict per episode below. | |
| episode_arrays_by_term = { | |
| term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name) | |
| for term_name, term_cfg in zip(self._term_names, self._term_cfgs) | |
| } | |
| per_episode_metrics: list[dict[str, Any]] = [] | |
| for episode_index in range(num_episodes): | |
| dataset_path = get_metric_recorder_dataset_path(self._env) | |
| # Recorded data arrives grouped by metric (each term -> one array per episode). | |
| # Read it once here, then transpose into one metric dict per episode below. | |
| episode_arrays_by_term = { | |
| term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name) | |
| for term_name, term_cfg in zip(self._term_names, self._term_cfgs) | |
| } | |
| # Derive episode count from the already-loaded data to stay consistent with it. | |
| # Fall back to get_num_episodes only when there are no registered terms. | |
| if episode_arrays_by_term: | |
| num_episodes = min(len(v) for v in episode_arrays_by_term.values()) | |
| else: | |
| num_episodes = get_num_episodes(dataset_path) | |
| per_episode_metrics: list[dict[str, Any]] = [] | |
| for episode_index in range(num_episodes): |
| per_episode_metrics = unwrapped_env.metrics_manager.compute_per_episode() | ||
| arena_env_args_snapshot = dict(job.arena_env_args_dict) | ||
|
|
||
| output_path = Path(output_path) | ||
| output_path.parent.mkdir(parents=True, exist_ok=True) | ||
| with open(output_path, "a", encoding="utf-8") as jsonl_output: | ||
| for episode_idx, episode_metrics in enumerate(per_episode_metrics): |
There was a problem hiding this comment.
Silent empty
arena_env_args when Job constructed without arena_env_args_dict
job.arena_env_args_dict defaults to {} when a Job is constructed directly (i.e., not through Job.from_dict()). In that case the JSONL row is written as "arena_env_args": {} with no error or warning, producing rows that are silently useless to the sensitivity analyzer. At a minimum, adding a guard that logs a warning (and optionally skips writing) when the dict is empty would surface this misconfiguration before it silently corrupts an analysis run.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
…is_mvp1' into cvolk/feature/sensitivity_eval_recording
Bring the parked light_intensity sweep configs (jobs config + minimal variant + factors.yaml) onto the recording branch so eval_runner --episode_summary can be run end to end and the output analysed. factors.yaml is updated to the current factors-only schema (the parked copy still had the removed slice/outcomes blocks). Signed-off-by: Clemens Volk <cvolk@nvidia.com>
| for episode_index in range(num_episodes): | ||
| episode_metrics: dict[str, Any] = {} | ||
| for term_name, term_cfg in zip(self._term_names, self._term_cfgs): | ||
| # compute_metric_func reduces a list of per-episode arrays; give it just this one. | ||
| episode_array = episode_arrays_by_term[term_name][episode_index] |
There was a problem hiding this comment.
HDF5 alphabetical key order silently swaps episodes for 10+ episode jobs
get_recorded_metric_data iterates for demo in f["data"], which h5py returns in alphabetical order by default (creation-order tracking is off unless the file was created with track_order=True). For a job with 20 episodes the key order is demo_0, demo_1, demo_10, demo_11, …, demo_19, demo_2, demo_3, …, so episode_arrays_by_term[term][2] contains data from demo_10 (the simulation's 11th episode) rather than demo_2. All nine jobs in light_intensity_sweep_jobs_config.json use num_episodes: 20, so every JSONL row from those jobs will carry a wrong episode_idx. Any downstream use that joins on episode_idx (e.g. matching to video recordings) will silently cross-reference the wrong episode.
The fix is to sort the demo keys numerically before building the lists in get_recorded_metric_data, or to drive the outer loop from the sorted keys in episode_arrays_by_term directly instead of from a separate get_num_episodes call.
150 jobs (75 uniform-random light values over [0,2000] x {rubiks cube, tomato soup can}),
num_envs=2, 2 episodes each. Paired factors.yaml declares light_intensity (continuous) +
pick_up_object (categorical) -> MNPE. Run with eval_runner --chunk_size 20 --episode_summary,
then generate_report --outcome success_rate.
Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Superseded by light_object_sweep (light x object, MNPE). The light-only sweep is the degenerate 1-D NPE case anyway. Still available on cvolk/feature/sensitivity_eval_configs_parked. Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Summary
Per-episode sensitivity recording (eval + metrics)
Base branch:
cvolk/feature/sensitivity_analysis_mvp1(#729).MNPE (picked when any factor is categorical): Running sweep over 2 objects + light intensity:
&