Per-episode sensitivity recording (eval + metrics) by cvolkcvolk · Pull Request #781 · isaac-sim/IsaacLab-Arena

cvolkcvolk · 2026-06-11T14:01:01Z

Summary

Per-episode sensitivity recording (eval + metrics)
Base branch: cvolk/feature/sensitivity_analysis_mvp1 (#729).

MNPE (picked when any factor is categorical): Running sweep over 2 objects + light intensity:

python isaaclab_arena/evaluation/eval_runner.py \
  --eval_jobs_config isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_jobs_config.json \
  --episode_summary eval/light_object_sweep.jsonl \
  --chunk_size 20

&

python -m isaaclab_arena.analysis.sensitivity.generate_report \
  --factors_yaml isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_factors.yaml \
  --episode_summary eval/light_object_sweep.jsonl \
  --outcome success_rate \
  --output eval/light_object_sweep_report.png

Produce the episode_summary.jsonl that the sensitivity toolbox (PR #729) consumes: - episode_writer.write_episode_summaries: one JSONL row per episode (job_name, episode_idx, full arena_env_args, per-episode outcomes) - eval_runner / eval_runner_cli: opt-in --episode_summary flag - job_manager: expose arena_env_args_dict for logging - metrics_manager: compute_per_episode for the per-episode outcome values Stacked on cvolk/feature/sensitivity_analysis_mvp1 (#729). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

greptile-apps · 2026-06-11T14:06:59Z

Greptile Summary

This PR adds per-episode sensitivity recording to the eval pipeline: a new episode_writer.py module appends one JSONL row per recorded episode (containing arena_env_args + per-metric outcomes), a compute_per_episode() method on MetricsManager performs the per-episode transpose of HDF5 data, and a new --episode_summary CLI flag gates the feature. Example sweep configs (light_object_sweep_jobs_config.json / light_object_sweep_factors.yaml) demonstrate MNPE sensitivity analysis over light intensity × object type.

compute_per_episode() opens the HDF5 dataset separately for get_num_episodes and once per term in get_recorded_metric_data, then iterates range(num_episodes) indexing into those lists — the episode ordering, count consistency, and alphabetical-vs-numerical demo-key issues from prior review threads are inherited here.
write_episode_summaries() sits inside the rollout_policy try-block so a JSONL I/O failure will cause an already-succeeded rollout to be marked FAILED with empty metrics, as noted in prior review.

Confidence Score: 3/5

The per-episode recording feature works correctly for the bundled 2-episode-per-job sweep configs, but the underlying HDF5 reading logic in compute_per_episode has unresolved data-integrity issues at higher episode counts and fragile error handling in the eval_runner integration that were raised in prior review rounds and remain in the code.

The new compute_per_episode method inherits the same fragile HDF5 access pattern as its predecessor: it calls get_num_episodes and get_recorded_metric_data in separate file opens, so the loop bound and the array lengths can diverge. For jobs with ten or more episodes, iterating h5py's data group alphabetically produces demo_0, demo_1, demo_10 rather than numeric order, so episode_idx values written to the JSONL silently map to the wrong HDF5 demos. Additionally, write_episode_summaries lives inside the try-block that drives complete_job, so an I/O failure during JSONL writing discards already-computed rollout metrics and marks a successful job as failed.

isaaclab_arena/metrics/metrics_manager.py and isaaclab_arena/metrics/metrics.py (get_recorded_metric_data iteration order), isaaclab_arena/evaluation/eval_runner.py (write_episode_summaries placement relative to the exception handler)

Important Files Changed

Filename	Overview
isaaclab_arena/evaluation/episode_writer.py	New module: writes per-episode JSONL rows with arena_env_args + outcomes. Clean design; episode_idx is derived from enumerate order (which inherits HDF5 alphabetical-key ordering from compute_per_episode). Sits inside the outer try-block in eval_runner, so I/O failure silently poisons job status.
isaaclab_arena/evaluation/eval_runner.py	write_episode_summaries() is called inside the main try/except block after rollout_policy, meaning any I/O failure marks a successful job as FAILED with empty metrics — previously flagged and unchanged.
isaaclab_arena/metrics/metrics_manager.py	New compute_per_episode() opens HDF5 N+1 times independently (get_num_episodes + once per term), then indexes into those lists by range(num_episodes) — carries over the alphabetical demo-key ordering bug and the IndexError risk from count mismatches, both flagged in prior review.
isaaclab_arena/evaluation/job_manager.py	Adds arena_env_args_dict field to Job (defaulting to {} when not supplied); from_dict() correctly passes the raw dict; minimal, clean change.
isaaclab_arena/evaluation/eval_runner_cli.py	Adds --episode_summary argparse argument with clear help text; no issues.
isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_factors.yaml	New factors schema for MNPE analysis: continuous light_intensity [0, 2000] and categorical pick_up_object. All values in the paired JSON config fall within declared ranges.
isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_jobs_config.json	Generated sweep config (161 jobs x 2 objects, num_episodes: 2 each). All jobs correctly set num_episodes: 2, so alphabetical HDF5 key ordering bug (demo_0, demo_1) does not manifest for this specific config.

Sequence Diagram

sequenceDiagram
    participant CLI as eval_runner (main)
    participant RP as rollout_policy()
    participant Rec as HDF5 Recorder
    participant MM as MetricsManager
    participant EW as episode_writer
    participant JSONL as episode_summary.jsonl

    CLI->>RP: "rollout_policy(env, policy, num_episodes=N)"
    loop per episode
        RP->>Rec: writes demo_i data to HDF5
    end
    RP-->>CLI: aggregate metrics dict

    alt episode_summary_enabled
        CLI->>EW: write_episode_summaries(env, job, path)
        EW->>MM: compute_per_episode()
        MM->>Rec: get_num_episodes(dataset_path) [HDF5 open 1]
        loop per term
            MM->>Rec: get_recorded_metric_data(dataset_path, term) [HDF5 open 2..N]
        end
        MM-->>EW: list[dict per episode]
        loop per episode_idx (enumerate)
            EW->>JSONL: "append {job_name, episode_idx, arena_env_args, outcomes}"
        end
        EW-->>CLI: rows written
    end

    CLI->>CLI: job_manager.complete_job(COMPLETED)

_{Reviews (3): Last reviewed commit: "Remove the single-factor light_intensity..." | Re-trigger Greptile}

greptile-apps · 2026-06-11T14:07:03Z

+                    if episode_summary_enabled:
+                        rows = write_episode_summaries(env, job, args_cli.episode_summary)
+                        print(f"[INFO] Wrote {rows} episode summaries for job '{job.name}'")


Summary write failure silently fails a successful job

write_episode_summaries runs inside the outer try/except block, so any I/O failure (disk full, permission denied, bad JSON serialization) will trigger the except handler at line 273. That handler calls job_manager.complete_job(job, metrics={}, status=Status.FAILED), discarding the already-computed metrics from rollout_policy and marking a successful rollout as FAILED. Since the summary is opt-in and supplemental, its failure should be isolated — otherwise a temporary disk issue poisons your eval results.

greptile-apps · 2026-06-11T14:07:04Z

+        dataset_path = get_metric_recorder_dataset_path(self._env)
+        num_episodes = get_num_episodes(dataset_path)
+
+        # Recorded data arrives grouped by metric (each term -> one array per episode).
+        # Read it once here, then transpose into one metric dict per episode below.
+        episode_arrays_by_term = {
+            term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name)
+            for term_name, term_cfg in zip(self._term_names, self._term_cfgs)
+        }
+
+        per_episode_metrics: list[dict[str, Any]] = []
+        for episode_index in range(num_episodes):


Extra HDF5 open + potential IndexError between two independent reads

get_num_episodes opens and closes the HDF5 file once more after episode_arrays_by_term has already been populated. The two reads are independent h5py.File opens, so if the returned list lengths don't match num_episodes (e.g., a term's dataset has fewer demos than the top-level demo count), episode_arrays_by_term[term_name][episode_index] will raise an IndexError. Deriving the episode count directly from the already-loaded lists avoids the redundant open and keeps both measurements consistent.

Suggested change

dataset_path = get_metric_recorder_dataset_path(self._env)

num_episodes = get_num_episodes(dataset_path)

# Recorded data arrives grouped by metric (each term -> one array per episode).

# Read it once here, then transpose into one metric dict per episode below.

episode_arrays_by_term = {

term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name)

for term_name, term_cfg in zip(self._term_names, self._term_cfgs)

}

per_episode_metrics: list[dict[str, Any]] = []

for episode_index in range(num_episodes):

dataset_path = get_metric_recorder_dataset_path(self._env)

# Recorded data arrives grouped by metric (each term -> one array per episode).

# Read it once here, then transpose into one metric dict per episode below.

episode_arrays_by_term = {

term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name)

for term_name, term_cfg in zip(self._term_names, self._term_cfgs)

}

# Derive episode count from the already-loaded data to stay consistent with it.

# Fall back to get_num_episodes only when there are no registered terms.

if episode_arrays_by_term:

num_episodes = min(len(v) for v in episode_arrays_by_term.values())

else:

num_episodes = get_num_episodes(dataset_path)

per_episode_metrics: list[dict[str, Any]] = []

for episode_index in range(num_episodes):

greptile-apps · 2026-06-11T14:07:05Z

+    per_episode_metrics = unwrapped_env.metrics_manager.compute_per_episode()
+    arena_env_args_snapshot = dict(job.arena_env_args_dict)
+
+    output_path = Path(output_path)
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with open(output_path, "a", encoding="utf-8") as jsonl_output:
+        for episode_idx, episode_metrics in enumerate(per_episode_metrics):


Silent empty arena_env_args when Job constructed without arena_env_args_dict

job.arena_env_args_dict defaults to {} when a Job is constructed directly (i.e., not through Job.from_dict()). In that case the JSONL row is written as "arena_env_args": {} with no error or warning, producing rows that are silently useless to the sensitivity analyzer. At a minimum, adding a guard that logs a warning (and optionally skips writing) when the dict is empty would surface this misconfiguration before it silently corrupts an analysis run.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…is_mvp1' into cvolk/feature/sensitivity_eval_recording

Bring the parked light_intensity sweep configs (jobs config + minimal variant + factors.yaml) onto the recording branch so eval_runner --episode_summary can be run end to end and the output analysed. factors.yaml is updated to the current factors-only schema (the parked copy still had the removed slice/outcomes blocks). Signed-off-by: Clemens Volk <cvolk@nvidia.com>

greptile-apps · 2026-06-11T17:07:52Z

+        for episode_index in range(num_episodes):
+            episode_metrics: dict[str, Any] = {}
+            for term_name, term_cfg in zip(self._term_names, self._term_cfgs):
+                # compute_metric_func reduces a list of per-episode arrays; give it just this one.
+                episode_array = episode_arrays_by_term[term_name][episode_index]


HDF5 alphabetical key order silently swaps episodes for 10+ episode jobs

get_recorded_metric_data iterates for demo in f["data"], which h5py returns in alphabetical order by default (creation-order tracking is off unless the file was created with track_order=True). For a job with 20 episodes the key order is demo_0, demo_1, demo_10, demo_11, …, demo_19, demo_2, demo_3, …, so episode_arrays_by_term[term][2] contains data from demo_10 (the simulation's 11th episode) rather than demo_2. All nine jobs in light_intensity_sweep_jobs_config.json use num_episodes: 20, so every JSONL row from those jobs will carry a wrong episode_idx. Any downstream use that joins on episode_idx (e.g. matching to video recordings) will silently cross-reference the wrong episode.

The fix is to sort the demo keys numerically before building the lists in get_recorded_metric_data, or to drive the outer loop from the sorted keys in episode_arrays_by_term directly instead of from a separate get_num_episodes call.

150 jobs (75 uniform-random light values over [0,2000] x {rubiks cube, tomato soup can}), num_envs=2, 2 episodes each. Paired factors.yaml declares light_intensity (continuous) + pick_up_object (categorical) -> MNPE. Run with eval_runner --chunk_size 20 --episode_summary, then generate_report --outcome success_rate. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Superseded by light_object_sweep (light x object, MNPE). The light-only sweep is the degenerate 1-D NPE case anyway. Still available on cvolk/feature/sensitivity_eval_configs_parked. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

cvolkcvolk requested review from alexmillane, peterd-NV, qianl-nv, viiik-inside, xyao-nv and zhx06 as code owners June 11, 2026 14:01

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

cvolkcvolk mentioned this pull request Jun 11, 2026

Sensitivity analysis MVP: MNPE/NPE posterior #729

Open

cvolkcvolk added 2 commits June 11, 2026 17:45

Merge remote-tracking branch 'origin/cvolk/feature/sensitivity_analys…

82c25f1

…is_mvp1' into cvolk/feature/sensitivity_eval_recording

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

cvolkcvolk added 2 commits June 11, 2026 20:09

Remove the single-factor light_intensity sweep configs

85446cf

Superseded by light_object_sweep (light x object, MNPE). The light-only sweep is the degenerate 1-D NPE case anyway. Still available on cvolk/feature/sensitivity_eval_configs_parked. Signed-off-by: Clemens Volk <cvolk@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-episode sensitivity recording (eval + metrics)#781

Per-episode sensitivity recording (eval + metrics)#781
cvolkcvolk wants to merge 5 commits into
cvolk/feature/sensitivity_analysis_mvp1from
cvolk/feature/sensitivity_eval_recording

cvolkcvolk commented Jun 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cvolkcvolk commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

greptile-apps Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cvolkcvolk commented Jun 11, 2026 •

edited

Loading

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading