Skip to content

Per-episode sensitivity recording (eval + metrics)#781

Open
cvolkcvolk wants to merge 5 commits into
cvolk/feature/sensitivity_analysis_mvp1from
cvolk/feature/sensitivity_eval_recording
Open

Per-episode sensitivity recording (eval + metrics)#781
cvolkcvolk wants to merge 5 commits into
cvolk/feature/sensitivity_analysis_mvp1from
cvolk/feature/sensitivity_eval_recording

Conversation

@cvolkcvolk

@cvolkcvolk cvolkcvolk commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Per-episode sensitivity recording (eval + metrics)
Base branch: cvolk/feature/sensitivity_analysis_mvp1 (#729).

MNPE (picked when any factor is categorical): Running sweep over 2 objects + light intensity:

python isaaclab_arena/evaluation/eval_runner.py \
  --eval_jobs_config isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_jobs_config.json \
  --episode_summary eval/light_object_sweep.jsonl \
  --chunk_size 20

&

python -m isaaclab_arena.analysis.sensitivity.generate_report \
  --factors_yaml isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_factors.yaml \
  --episode_summary eval/light_object_sweep.jsonl \
  --outcome success_rate \
  --output eval/light_object_sweep_report.png

Produce the episode_summary.jsonl that the sensitivity toolbox (PR #729) consumes:

- episode_writer.write_episode_summaries: one JSONL row per episode
  (job_name, episode_idx, full arena_env_args, per-episode outcomes)
- eval_runner / eval_runner_cli: opt-in --episode_summary flag
- job_manager: expose arena_env_args_dict for logging
- metrics_manager: compute_per_episode for the per-episode outcome values

Stacked on cvolk/feature/sensitivity_analysis_mvp1 (#729).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds per-episode sensitivity recording to the eval pipeline: a new episode_writer.py module appends one JSONL row per recorded episode (containing arena_env_args + per-metric outcomes), a compute_per_episode() method on MetricsManager performs the per-episode transpose of HDF5 data, and a new --episode_summary CLI flag gates the feature. Example sweep configs (light_object_sweep_jobs_config.json / light_object_sweep_factors.yaml) demonstrate MNPE sensitivity analysis over light intensity × object type.

  • compute_per_episode() opens the HDF5 dataset separately for get_num_episodes and once per term in get_recorded_metric_data, then iterates range(num_episodes) indexing into those lists — the episode ordering, count consistency, and alphabetical-vs-numerical demo-key issues from prior review threads are inherited here.
  • write_episode_summaries() sits inside the rollout_policy try-block so a JSONL I/O failure will cause an already-succeeded rollout to be marked FAILED with empty metrics, as noted in prior review.

Confidence Score: 3/5

The per-episode recording feature works correctly for the bundled 2-episode-per-job sweep configs, but the underlying HDF5 reading logic in compute_per_episode has unresolved data-integrity issues at higher episode counts and fragile error handling in the eval_runner integration that were raised in prior review rounds and remain in the code.

The new compute_per_episode method inherits the same fragile HDF5 access pattern as its predecessor: it calls get_num_episodes and get_recorded_metric_data in separate file opens, so the loop bound and the array lengths can diverge. For jobs with ten or more episodes, iterating h5py's data group alphabetically produces demo_0, demo_1, demo_10 rather than numeric order, so episode_idx values written to the JSONL silently map to the wrong HDF5 demos. Additionally, write_episode_summaries lives inside the try-block that drives complete_job, so an I/O failure during JSONL writing discards already-computed rollout metrics and marks a successful job as failed.

isaaclab_arena/metrics/metrics_manager.py and isaaclab_arena/metrics/metrics.py (get_recorded_metric_data iteration order), isaaclab_arena/evaluation/eval_runner.py (write_episode_summaries placement relative to the exception handler)

Important Files Changed

Filename Overview
isaaclab_arena/evaluation/episode_writer.py New module: writes per-episode JSONL rows with arena_env_args + outcomes. Clean design; episode_idx is derived from enumerate order (which inherits HDF5 alphabetical-key ordering from compute_per_episode). Sits inside the outer try-block in eval_runner, so I/O failure silently poisons job status.
isaaclab_arena/evaluation/eval_runner.py write_episode_summaries() is called inside the main try/except block after rollout_policy, meaning any I/O failure marks a successful job as FAILED with empty metrics — previously flagged and unchanged.
isaaclab_arena/metrics/metrics_manager.py New compute_per_episode() opens HDF5 N+1 times independently (get_num_episodes + once per term), then indexes into those lists by range(num_episodes) — carries over the alphabetical demo-key ordering bug and the IndexError risk from count mismatches, both flagged in prior review.
isaaclab_arena/evaluation/job_manager.py Adds arena_env_args_dict field to Job (defaulting to {} when not supplied); from_dict() correctly passes the raw dict; minimal, clean change.
isaaclab_arena/evaluation/eval_runner_cli.py Adds --episode_summary argparse argument with clear help text; no issues.
isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_factors.yaml New factors schema for MNPE analysis: continuous light_intensity [0, 2000] and categorical pick_up_object. All values in the paired JSON config fall within declared ranges.
isaaclab_arena_environments/eval_jobs_configs/light_object_sweep_jobs_config.json Generated sweep config (161 jobs x 2 objects, num_episodes: 2 each). All jobs correctly set num_episodes: 2, so alphabetical HDF5 key ordering bug (demo_0, demo_1) does not manifest for this specific config.

Sequence Diagram

sequenceDiagram
    participant CLI as eval_runner (main)
    participant RP as rollout_policy()
    participant Rec as HDF5 Recorder
    participant MM as MetricsManager
    participant EW as episode_writer
    participant JSONL as episode_summary.jsonl

    CLI->>RP: "rollout_policy(env, policy, num_episodes=N)"
    loop per episode
        RP->>Rec: writes demo_i data to HDF5
    end
    RP-->>CLI: aggregate metrics dict

    alt episode_summary_enabled
        CLI->>EW: write_episode_summaries(env, job, path)
        EW->>MM: compute_per_episode()
        MM->>Rec: get_num_episodes(dataset_path) [HDF5 open 1]
        loop per term
            MM->>Rec: get_recorded_metric_data(dataset_path, term) [HDF5 open 2..N]
        end
        MM-->>EW: list[dict per episode]
        loop per episode_idx (enumerate)
            EW->>JSONL: "append {job_name, episode_idx, arena_env_args, outcomes}"
        end
        EW-->>CLI: rows written
    end

    CLI->>CLI: job_manager.complete_job(COMPLETED)
Loading

Reviews (3): Last reviewed commit: "Remove the single-factor light_intensity..." | Re-trigger Greptile

Comment on lines +263 to +265
if episode_summary_enabled:
rows = write_episode_summaries(env, job, args_cli.episode_summary)
print(f"[INFO] Wrote {rows} episode summaries for job '{job.name}'")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Summary write failure silently fails a successful job

write_episode_summaries runs inside the outer try/except block, so any I/O failure (disk full, permission denied, bad JSON serialization) will trigger the except handler at line 273. That handler calls job_manager.complete_job(job, metrics={}, status=Status.FAILED), discarding the already-computed metrics from rollout_policy and marking a successful rollout as FAILED. Since the summary is opt-in and supplemental, its failure should be isolated — otherwise a temporary disk issue poisons your eval results.

Comment on lines +76 to +87
dataset_path = get_metric_recorder_dataset_path(self._env)
num_episodes = get_num_episodes(dataset_path)

# Recorded data arrives grouped by metric (each term -> one array per episode).
# Read it once here, then transpose into one metric dict per episode below.
episode_arrays_by_term = {
term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name)
for term_name, term_cfg in zip(self._term_names, self._term_cfgs)
}

per_episode_metrics: list[dict[str, Any]] = []
for episode_index in range(num_episodes):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Extra HDF5 open + potential IndexError between two independent reads

get_num_episodes opens and closes the HDF5 file once more after episode_arrays_by_term has already been populated. The two reads are independent h5py.File opens, so if the returned list lengths don't match num_episodes (e.g., a term's dataset has fewer demos than the top-level demo count), episode_arrays_by_term[term_name][episode_index] will raise an IndexError. Deriving the episode count directly from the already-loaded lists avoids the redundant open and keeps both measurements consistent.

Suggested change
dataset_path = get_metric_recorder_dataset_path(self._env)
num_episodes = get_num_episodes(dataset_path)
# Recorded data arrives grouped by metric (each term -> one array per episode).
# Read it once here, then transpose into one metric dict per episode below.
episode_arrays_by_term = {
term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name)
for term_name, term_cfg in zip(self._term_names, self._term_cfgs)
}
per_episode_metrics: list[dict[str, Any]] = []
for episode_index in range(num_episodes):
dataset_path = get_metric_recorder_dataset_path(self._env)
# Recorded data arrives grouped by metric (each term -> one array per episode).
# Read it once here, then transpose into one metric dict per episode below.
episode_arrays_by_term = {
term_name: get_recorded_metric_data(dataset_path, term_cfg.recorder_term_name)
for term_name, term_cfg in zip(self._term_names, self._term_cfgs)
}
# Derive episode count from the already-loaded data to stay consistent with it.
# Fall back to get_num_episodes only when there are no registered terms.
if episode_arrays_by_term:
num_episodes = min(len(v) for v in episode_arrays_by_term.values())
else:
num_episodes = get_num_episodes(dataset_path)
per_episode_metrics: list[dict[str, Any]] = []
for episode_index in range(num_episodes):

Comment on lines +47 to +53
per_episode_metrics = unwrapped_env.metrics_manager.compute_per_episode()
arena_env_args_snapshot = dict(job.arena_env_args_dict)

output_path = Path(output_path)
output_path.parent.mkdir(parents=True, exist_ok=True)
with open(output_path, "a", encoding="utf-8") as jsonl_output:
for episode_idx, episode_metrics in enumerate(per_episode_metrics):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent empty arena_env_args when Job constructed without arena_env_args_dict

job.arena_env_args_dict defaults to {} when a Job is constructed directly (i.e., not through Job.from_dict()). In that case the JSONL row is written as "arena_env_args": {} with no error or warning, producing rows that are silently useless to the sensitivity analyzer. At a minimum, adding a guard that logs a warning (and optionally skips writing) when the dict is empty would surface this misconfiguration before it silently corrupts an analysis run.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

…is_mvp1' into cvolk/feature/sensitivity_eval_recording
Bring the parked light_intensity sweep configs (jobs config + minimal variant + factors.yaml)
onto the recording branch so eval_runner --episode_summary can be run end to end and the
output analysed. factors.yaml is updated to the current factors-only schema (the parked copy
still had the removed slice/outcomes blocks).

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Comment on lines +87 to +91
for episode_index in range(num_episodes):
episode_metrics: dict[str, Any] = {}
for term_name, term_cfg in zip(self._term_names, self._term_cfgs):
# compute_metric_func reduces a list of per-episode arrays; give it just this one.
episode_array = episode_arrays_by_term[term_name][episode_index]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 HDF5 alphabetical key order silently swaps episodes for 10+ episode jobs

get_recorded_metric_data iterates for demo in f["data"], which h5py returns in alphabetical order by default (creation-order tracking is off unless the file was created with track_order=True). For a job with 20 episodes the key order is demo_0, demo_1, demo_10, demo_11, …, demo_19, demo_2, demo_3, …, so episode_arrays_by_term[term][2] contains data from demo_10 (the simulation's 11th episode) rather than demo_2. All nine jobs in light_intensity_sweep_jobs_config.json use num_episodes: 20, so every JSONL row from those jobs will carry a wrong episode_idx. Any downstream use that joins on episode_idx (e.g. matching to video recordings) will silently cross-reference the wrong episode.

The fix is to sort the demo keys numerically before building the lists in get_recorded_metric_data, or to drive the outer loop from the sorted keys in episode_arrays_by_term directly instead of from a separate get_num_episodes call.

150 jobs (75 uniform-random light values over [0,2000] x {rubiks cube, tomato soup can}),
num_envs=2, 2 episodes each. Paired factors.yaml declares light_intensity (continuous) +
pick_up_object (categorical) -> MNPE. Run with eval_runner --chunk_size 20 --episode_summary,
then generate_report --outcome success_rate.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Superseded by light_object_sweep (light x object, MNPE). The light-only sweep is the
degenerate 1-D NPE case anyway. Still available on cvolk/feature/sensitivity_eval_configs_parked.

Signed-off-by: Clemens Volk <cvolk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant