Skip to content

Extend eval runner: per-env camera recording, metrics file, and episode boundaries#776

Open
aiguldzh-nvidia wants to merge 8 commits into
mainfrom
adzhumamurat/extend_episode_record_metrics
Open

Extend eval runner: per-env camera recording, metrics file, and episode boundaries#776
aiguldzh-nvidia wants to merge 8 commits into
mainfrom
adzhumamurat/extend_episode_record_metrics

Conversation

@aiguldzh-nvidia

@aiguldzh-nvidia aiguldzh-nvidia commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

Extend eval runner with per-env camera recording, metrics file output, and timestamped run directories

Detailed description

  • CameraObsVideoRecorder now records all parallel envs (one file per env per camera) instead of env 0 only; each run gets a timestamped subdirectory under --video_dir to avoid stale files accumulating across runs
  • eval_runner gains --camera_video and --metrics_file flags; save_metrics_to_file() and camera video recording existed but were never called from the eval runner
  • rollout_policy() now returns (metrics, episode_boundaries) tracking {env_idx, start_step, end_step} per completed episode; frame index in CameraObsVideoRecorder output equals step index so boundaries can be used to slice per-env videos into
    individual episode clips
  • Job now preserves task_name, embodiment, and env_params extracted from arena_env_args before it is converted to a CLI list - used by the follow-on EpisodeRecord PR

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: EpisodeRecord with Task Metadata, Metrics Output, All Envs Recording and Episode Boundaries

Summary

This PR adds a well-structured EpisodeRecord schema and wires up episode boundaries, metrics persistence, and multi-env camera recording into the eval runner. The overall architecture is clean — dataclass-based schema with versioning, clear separation between building and writing records, and thoughtful extension points for future phases. A few issues worth addressing before merge.


🔴 Critical Issues

1. MetricsLogger receives non-timestamped path (race with path mutation)

File: eval_runner.py (lines ~205–215)

metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file or "metrics.json")
# ... later ...
if args_cli.metrics_file is not None:
    base, ext = os.path.splitext(args_cli.metrics_file)
    args_cli.metrics_file = f"{base}_{run_ts}{ext}"

MetricsLogger is initialized before args_cli.metrics_file is mutated to include the timestamp. The logger stores the original path at construction time, so save_metrics_to_file() writes to the non-timestamped path while the [INFO] print references metrics_logger.metrics_file (also non-timestamped). The timestamped path is effectively dead code.

Suggestion: Move MetricsLogger instantiation to after the timestamp mutation, or explicitly update metrics_logger.metrics_file post-mutation.


2. episode_boundaries lost on exception — NameError in caller

File: eval_runner.py (line ~268)

metrics, episode_boundaries = rollout_policy(...)

If rollout_policy raises an exception (which re-raises after pbar cleanup), the tuple unpacking never completes, so episode_boundaries is unbound in the except block's scope. While the failure-path build_episode_record(...) call doesn't pass episode_boundaries, any future code that references it in the except block would hit a NameError. The variable should be initialized before the call:

episode_boundaries = []
metrics, episode_boundaries = rollout_policy(...)

This also future-proofs the failure record to include partial boundaries collected before the crash.


🟡 Moderate Issues

3. MetricsLogger always receives a metrics_file argument even when user didn't request file output

File: eval_runner.py

metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file or "metrics.json")

When --metrics_file is not passed, args_cli.metrics_file is None, so this falls through to "metrics.json". If MetricsLogger writes on destruction or has side effects, this could produce an unexpected metrics.json file. The original code passed no argument — this changes the default behavior. Consider:

metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file)

and only calling save_metrics_to_file() when the file is set (which you already gate on args_cli.metrics_file is not None).

4. Off-by-one risk in num_episodes mode trailing boundary

File: policy_runner.py (lines ~141–148)

last_step = (num_steps_completed - 1) if num_steps is not None else num_steps_completed

In num_episodes mode, the comment says "break fired before increment." This is correct if the break at the episode-count check fires before num_steps_completed += 1. However, the break can also fire when num_steps is not None inside the inner if num_steps ... branch. Verify that both exit paths are covered by the same formula, especially when num_episodes mode terminates mid-step due to the num_episodes_completed >= num_episodes check happening after num_steps_completed is already incremented (the +=1 appears before the episode count check in the original code).

5. Memory accumulation unbounded — no chunked flush for multi-env recording

File: camera_video.py

The docstring warns about ~3.8 GB for 10 envs / 500 steps / 512×512×3, but there's no runtime guard. For longer runs where video_length is large, the in-memory buffers will grow linearly until _flush(). Consider adding a configurable max_buffer_frames parameter that triggers an intermediate flush-and-stitch, or at minimum a runtime warning when the estimated buffer exceeds a threshold (e.g., 2 GB).


🟢 Suggestions / Nits

6. EpisodeRecord.arena_env_args typed as list[str] but populated with list(job.arena_env_args)

File: episode_record.py (field declaration) vs job_manager.py (arena_env_args is read from a dict)

In Job.from_dict, arena_env_args comes from converting a dict to a CLI list — confirm the runtime type is always list[str] and not list[Any]. If the config dict values are non-string, this will produce a JSON record with mixed types that breaks the schema contract.

7. _find_video_paths only searches one level deep

File: episode_record.py

job_video_dir = os.path.join(video_dir, job_name)

Now that video_dir already has the timestamp subdirectory appended (video_dir/run_ts), the video files are at video_dir/run_ts/job_name/*.mp4. But _find_video_paths receives args_cli.video_dir (already timestamped) and appends job_name. This seems correct, but worth a comment to clarify the expected directory layout for future maintainers.

8. CI failures unrelated to this PR

The "Run tests" and "GR00T closed-loop E2E" checks are failing — confirm these are pre-existing failures on main and not regressions introduced here.


✅ What Looks Good

  • Clean dataclass schema with forward/backward compatibility (from_dict filters to known fields)
  • Schema versioning from day one
  • Episode boundary tracking is well-reasoned — start/end inclusive, handles both termination modes
  • Failure-path episode record writing (silently swallowed exceptions prevent cascading failures)
  • Timestamp-isolated output directories prevent stale file accumulation
  • Good docstrings explaining the frame-index ↔ step-index correspondence

Update 2 (commit 66dd66a)

The latest commits complete the scope reduction by:

  1. Deleting episode_record.py entirely (335 lines removed)
  2. Removing --episode_record_dir CLI argument from eval_runner_cli.py
  3. Cleaning up eval_runner.py to remove all episode record building/writing logic
  4. Adding new test coverage for task metadata extraction in Job.from_dict

Previous findings status:

Finding Status
🔴 #1 (MetricsLogger timestamped-path race) ⚠️ Still present — MetricsLogger initialized before timestamp mutation
🔴 #2 (episode_boundaries NameError) Moot — boundaries now discarded (_)
🟡 #3 (MetricsLogger default fallback) ⚠️ Still present
🟡 #4 (Off-by-one in boundaries) Moot — boundaries no longer consumed
🟡 #5 (Memory accumulation in camera_video) ⚠️ Unchanged — no modifications to camera_video.py
🟢 #6, #7 (EpisodeRecord schema concerns) Moot — file removed

New observations:

  • ✅ New tests test_job_from_dict_task_metadata and test_job_task_metadata_defaults are well-structured
  • ✅ Minor docstring update in job_manager.py removes now-stale EpisodeRecord reference
  • The PR now focuses cleanly on task metadata extraction, metrics output, and camera recording — EpisodeRecord is deferred

Remaining actionable items:

  1. 🔴 Fix MetricsLogger initialization order — move instantiation after timestamp mutation (or update the path post-mutation)
  2. 🟡 Consider removing the or "metrics.json" fallback if file output is only intended when explicitly requested
  3. 🟡 camera_video.py memory warning is still relevant for long runs

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR wires up metrics file output, per-env camera video recording, and episode-boundary tracking in the eval pipeline. It also adds task_name, embodiment, and env_params metadata fields to Job to support a future EpisodeRecord schema.

  • CameraObsVideoRecorder is extended from recording env-0-only to recording all parallel envs, producing one file per (env, camera) pair.
  • rollout_policy now tracks and returns episode_boundaries ({env_idx, start_step, end_step}) alongside metrics; eval_runner and policy_runner both accept the new tuple return type.
  • eval_runner gains --metrics_file and --camera_video flags and writes timestamped output subdirectories per run. However, episode_record.py, build_episode_record(), write_episode_record(), and the --episode_record_dir flag — all explicitly described in the PR — are absent from the diff and the repository, leaving episode_boundaries silently discarded.

Confidence Score: 4/5

The PR is safe to merge for its working parts (camera recording, metrics file flag, job metadata fields), but the episode record output described in the title and PR description is entirely absent.

The episode_boundaries data is computed and returned by rollout_policy but is immediately discarded in eval_runner.py; no episode_record.py, no --episode_record_dir flag, and no JSON-per-job output exists anywhere in the repository. The PR description explicitly promises these as deliverables in this change.

isaaclab_arena/evaluation/eval_runner.py and isaaclab_arena/evaluation/eval_runner_cli.py — the episode record persistence layer and its CLI flag are missing.

Important Files Changed

Filename Overview
isaaclab_arena/evaluation/camera_video.py Extends recording from env-0-only to all envs; buffer structure and flush condition correctly updated for N envs × M cameras.
isaaclab_arena/evaluation/eval_runner.py Wires up metrics file saving, camera video recording, and timestamped output dirs. MetricsLogger is constructed before the timestamp is applied to args_cli.metrics_file (already flagged); episode_boundaries from rollout_policy is silently discarded; --camera_video does not propagate into enable_cameras_if_required (already flagged).
isaaclab_arena/evaluation/eval_runner_cli.py Adds --camera_video and --metrics_file flags. --episode_record_dir described in the PR is absent.
isaaclab_arena/evaluation/job_manager.py Adds task_name, embodiment, env_params fields to Job; correctly extracted from arena_env_args before it is converted to a CLI list.
isaaclab_arena/evaluation/policy_runner.py rollout_policy now tracks and returns episode_boundaries; off-by-one handling for num_steps vs num_episodes mode is correctly documented and implemented.
isaaclab_arena/tests/test_job_manager.py Adds two tests covering the new task_name/embodiment/env_params fields on Job; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant CLI as eval_runner CLI
    participant ER as eval_runner.main()
    participant JM as JobManager
    participant RP as rollout_policy()
    participant CV as CameraObsVideoRecorder
    participant ML as MetricsLogger

    CLI->>ER: --camera_video / --metrics_file / --video
    ER->>ER: "apply run timestamp to video_dir & metrics_file path"
    ER->>JM: iterate jobs
    loop per job
        ER->>CV: "wrap env (video_folder=job_video_dir)"
        ER->>RP: rollout_policy(env, policy, ...)
        RP->>CV: env.step() x N steps
        CV-->>CV: buffer frames[env][cam]
        CV-->>CV: "flush -> write mp4 per (env, cam)"
        RP-->>ER: (metrics, episode_boundaries)
        Note over ER: episode_boundaries discarded (_)
        ER->>ML: append_job_metrics(job.name, metrics)
    end
    ER->>ML: save_metrics_to_file()
    Note over ML: writes to original (non-timestamped) path
Loading

Comments Outside Diff (1)

  1. isaaclab_arena/evaluation/eval_runner.py, line 265-271 (link)

    P1 episode_boundaries silently discarded — EpisodeRecord output never written

    The PR description states that episode_record.py (with EpisodeRecord, build_episode_record(), write_episode_record()) and the --episode_record_dir flag should all be part of this PR, but none of these exist in the diff or the repo. The episode_boundaries data returned from rollout_policy is immediately thrown away with _. As a result, no per-job JSON is ever written and the primary feature named in the PR title ("Add EpisodeRecord") is absent. The --episode_record_dir flag documented in the PR description is also missing from eval_runner_cli.py.

Reviews (2): Last reviewed commit: "fix docstring and add tests for job task..." | Re-trigger Greptile

Comment on lines +208 to +222
metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file or "metrics.json")

job_manager.print_jobs_info()

if args_cli.video:
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.video or args_cli.camera_video or args_cli.episode_record_dir or args_cli.metrics_file:
run_ts = datetime.now().strftime("%Y%m%dT%H%M%S")
if args_cli.video or args_cli.camera_video:
args_cli.video_dir = os.path.join(args_cli.video_dir, run_ts)
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.episode_record_dir is not None:
args_cli.episode_record_dir = os.path.join(args_cli.episode_record_dir, run_ts)
if args_cli.metrics_file is not None:
base, ext = os.path.splitext(args_cli.metrics_file)
args_cli.metrics_file = f"{base}_{run_ts}{ext}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 MetricsLogger gets the non-timestamped path — timestamp suffix is never applied

MetricsLogger is constructed at line 208 using args_cli.metrics_file (e.g. "out/metrics.json"). The timestamp is computed and appended to args_cli.metrics_file later at lines 220-222, but that mutation only updates args_cli.metrics_file — it never updates metrics_logger.metrics_file. When save_metrics_to_file() is called at line 326, it uses the original non-timestamped path and overwrites the previous run's output instead of writing a new file named out/metrics_20250610T120000.json.

Comment on lines 207 to +222
job_manager = JobManager(eval_jobs_config["jobs"])
metrics_logger = MetricsLogger()
metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file or "metrics.json")

job_manager.print_jobs_info()

if args_cli.video:
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.video or args_cli.camera_video or args_cli.episode_record_dir or args_cli.metrics_file:
run_ts = datetime.now().strftime("%Y%m%dT%H%M%S")
if args_cli.video or args_cli.camera_video:
args_cli.video_dir = os.path.join(args_cli.video_dir, run_ts)
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.episode_record_dir is not None:
args_cli.episode_record_dir = os.path.join(args_cli.episode_record_dir, run_ts)
if args_cli.metrics_file is not None:
base, ext = os.path.splitext(args_cli.metrics_file)
args_cli.metrics_file = f"{base}_{run_ts}{ext}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 MetricsLogger should be constructed after the timestamp is applied to args_cli.metrics_file

Moving the MetricsLogger construction to after the timestamp suffix is appended ensures metrics_logger.metrics_file holds the timestamped path, so save_metrics_to_file() actually writes the timestamped file instead of overwriting the bare path on every run.

Suggested change
job_manager = JobManager(eval_jobs_config["jobs"])
metrics_logger = MetricsLogger()
metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file or "metrics.json")
job_manager.print_jobs_info()
if args_cli.video:
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.video or args_cli.camera_video or args_cli.episode_record_dir or args_cli.metrics_file:
run_ts = datetime.now().strftime("%Y%m%dT%H%M%S")
if args_cli.video or args_cli.camera_video:
args_cli.video_dir = os.path.join(args_cli.video_dir, run_ts)
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.episode_record_dir is not None:
args_cli.episode_record_dir = os.path.join(args_cli.episode_record_dir, run_ts)
if args_cli.metrics_file is not None:
base, ext = os.path.splitext(args_cli.metrics_file)
args_cli.metrics_file = f"{base}_{run_ts}{ext}"
job_manager = JobManager(eval_jobs_config["jobs"])
job_manager.print_jobs_info()
if args_cli.video or args_cli.camera_video or args_cli.episode_record_dir or args_cli.metrics_file:
run_ts = datetime.now().strftime("%Y%m%dT%H%M%S")
if args_cli.video or args_cli.camera_video:
args_cli.video_dir = os.path.join(args_cli.video_dir, run_ts)
os.makedirs(args_cli.video_dir, exist_ok=True)
print(f"[INFO] Video recording enabled. Videos will be saved to: {args_cli.video_dir}")
if args_cli.episode_record_dir is not None:
args_cli.episode_record_dir = os.path.join(args_cli.episode_record_dir, run_ts)
if args_cli.metrics_file is not None:
base, ext = os.path.splitext(args_cli.metrics_file)
args_cli.metrics_file = f"{base}_{run_ts}{ext}"
metrics_logger = MetricsLogger(metrics_file=args_cli.metrics_file or "metrics.json")

@aiguldzh-nvidia aiguldzh-nvidia changed the title Add EpisodeRecord with task metadata, metrics output, all envs recording and episode boundaries Extend eval runner: per-env camera recording, metrics file, and episode boundaries Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant