feat: Add new error type SOLVE_TIMEOUT by mpatel31415 · Pull Request #1049 · NVIDIA-NeMo/Evaluator

mpatel31415 · 2026-06-03T12:39:32Z

Adds ErrorKind.SOLVE_TIMEOUT and a new "solve_timeout" failure_category
covering rollouts where the Harbor solver hit the configured run_timeout
but were previously emitted with error=None / failure_category=null.

Summary by CodeRabbit

Improvements
- Enhanced timeout detection and refined error classification for solver attempts, including clearer distinction between timeouts with and without workspace progress.
- Improved diagnostics and reporting when solver timeouts occur.
Tests
- Added tests to verify timeout handling and ensure accurate failure categorization.

copy-pr-bot · 2026-06-03T12:39:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-03T12:39:45Z

📝 Walkthrough

Walkthrough

Adds an explicit SOLVE_TIMEOUT error kind, emits it from HarborSolver on agent timeouts, propagates that status through run_evaluation into scoring_details, and classifies it in ArtifactCollector with unit tests exercising the full path.

Changes

Solve Timeout Error Classification

Layer / File(s)	Summary
ErrorKind.SOLVE_TIMEOUT enum definition `src/nemo_evaluator/solvers/base.py`	`ErrorKind` enum gains new `SOLVE_TIMEOUT = "solve_timeout"` variant.
HarborSolver timeout detection and emission `src/nemo_evaluator/solvers/harbor.py`	`HarborSolver.solve()` detects timeouts with and without workspace changes, logs token counts when applicable, and sets `error_kind = ErrorKind.SOLVE_TIMEOUT`.
Eval loop timeout tracking and error_category override `src/nemo_evaluator/engine/eval_loop.py`	`run_evaluation()` initializes `_is_solve_timeout` before retries, sets it when `ErrorKind.SOLVE_TIMEOUT` is returned, and overrides final `scoring_details["error_category"]` to `"solve_timeout"`.
ArtifactCollector solve_timeout classification `src/nemo_evaluator/observability/collector.py`	`_classify_failure()` classifies `scoring_details["error_category"] == "solve_timeout"` as `failure_category = "solve_timeout"` and returns early.
Unit tests for SOLVE_TIMEOUT error flow `tests/test_engine/test_infra_error.py`	Adds `SolveResult` test for `SOLVE_TIMEOUT` and `TestCollectorSolveTimeoutClassification` suite covering classification correctness, precedence over substring-based bucketing, and distinction from `infra_error`.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately and concisely describes the primary change: adding a new ErrorKind.SOLVE_TIMEOUT enum variant to categorize solver timeouts.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/test_engine/test_infra_error.py (1)

24-24: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing module-level import for ArtifactCollector.

The tests use ArtifactCollector() but it's not imported. The PR removed inline imports without adding the module-level import.

🔧 Proposed fix

 from nemo_evaluator.engine.eval_loop import _get_error_category
 from nemo_evaluator.errors import GracefulError, InfraError
+from nemo_evaluator.observability.collector import ArtifactCollector
 from nemo_evaluator.observability.types import StepRecord
 from nemo_evaluator.solvers.base import ErrorKind, SolveResult

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_engine/test_infra_error.py` at line 24, Add a missing module-level
import for ArtifactCollector: at the top of the test file with the other
imports, add "from nemo_evaluator.artifacts import ArtifactCollector" (or the
correct module that defines ArtifactCollector) so the tests that instantiate
ArtifactCollector() can resolve the class; ensure the import is not left inline
inside a test and matches the project's export location for ArtifactCollector.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 24: Add a missing module-level import for ArtifactCollector: at the top
of the test file with the other imports, add "from nemo_evaluator.artifacts
import ArtifactCollector" (or the correct module that defines ArtifactCollector)
so the tests that instantiate ArtifactCollector() can resolve the class; ensure
the import is not left inline inside a test and matches the project's export
location for ArtifactCollector.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 326b3ef0-9b46-470b-8f94-946cffd6859d

📥 Commits

Reviewing files that changed from the base of the PR and between 55ac58c and 06167e9.

⛔ Files ignored due to path filters (2)

docs/evaluation/scoring.md is excluded by !docs/** and included by none
pyproject.toml is excluded by none and included by none

📒 Files selected for processing (5)

src/nemo_evaluator/engine/eval_loop.py
src/nemo_evaluator/observability/collector.py
src/nemo_evaluator/solvers/base.py
src/nemo_evaluator/solvers/harbor.py
tests/test_engine/test_infra_error.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 92: Move the "from nemo_evaluator.observability.collector import
ArtifactCollector" import out of the individual test functions and add a single
module-level import alongside the other imports near the top of the file; then
remove the repeated imports of ArtifactCollector from the five test methods
where it currently appears so the tests use the shared module-level symbol
instead.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 75e95428-d60c-4495-9db2-19e180ddcca2

📥 Commits

Reviewing files that changed from the base of the PR and between 06167e9 and 6b04b03.

📒 Files selected for processing (1)

tests/test_engine/test_infra_error.py

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 92: Move the "from nemo_evaluator.observability.collector import
ArtifactCollector" import out of the individual test functions and add a single
module-level import alongside the other imports near the top of the file; then
remove the repeated imports of ArtifactCollector from the five test methods
where it currently appears so the tests use the shared module-level symbol
instead.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 75e95428-d60c-4495-9db2-19e180ddcca2

📥 Commits

Reviewing files that changed from the base of the PR and between 06167e9 and 6b04b03.

📒 Files selected for processing (1)

tests/test_engine/test_infra_error.py

🛑 Comments failed to post (1)

tests/test_engine/test_infra_error.py (1)
92-92: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move ArtifactCollector import to module level.

The import is repeated inside five test methods. As per coding guidelines, import shared helpers at module level alongside the other imports (around line 24).
📦 Proposed fix

Add the import at module level (after line 24):
 from nemo_evaluator.observability.types import StepRecord
 from nemo_evaluator.solvers.base import ErrorKind, SolveResult
+from nemo_evaluator.observability.collector import ArtifactCollector
Then remove the five repeated imports inside the test methods:
 class TestCollectorInfraClassification:
     def test_infra_error_classified(self):
-        from nemo_evaluator.observability.collector import ArtifactCollector
-
         collector = ArtifactCollector()
(Repeat for lines 106, 121, 137, 151)
Also applies to: 106-106, 121-121, 137-137, 151-151
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_engine/test_infra_error.py` at line 92, Move the "from
nemo_evaluator.observability.collector import ArtifactCollector" import out of
the individual test functions and add a single module-level import alongside the
other imports near the top of the file; then remove the repeated imports of
ArtifactCollector from the five test methods where it currently appears so the
tests use the shared module-level symbol instead.

Adds ErrorKind.SOLVE_TIMEOUT and a new "solve_timeout" failure_category covering rollouts where the Harbor solver hit the configured run_timeout but were previously emitted with error=None / failure_category=null. HarborSolver.solve() - The existing `agent_timed_out and workspace_diff` branch (which intentionally keeps error=None so the verifier still runs on the partial workspace) now also tags the rollout error_kind=SOLVE_TIMEOUT. - A new final `agent_timed_out` branch catches the case where the agent hit the deadline without producing a workspace diff but still recorded prompt tokens (the model was reasoning but never emitted a tool call). - The three pre-existing INFRA timeout branches are untouched. eval_loop._run_step() - Tracks _is_solve_timeout alongside _is_infra. After verify completes, stamps step.scoring_details["error_category"] = "solve_timeout" so the categorizer can short-circuit on it. The verifier is allowed to run on the partial workspace (passing solutions are still rewarded). ArtifactCollector._classify_failure - New short-circuit: if scoring_details.error_category == "solve_timeout", failure_category = "solve_timeout". Placed after the infra_error rule and before the substring scan so the existing "timed out" / "Timeout" rule cannot mis-route Harbor wall-clock timeouts into the HTTP-408 `timeout` bucket. Resume cache filter (eval_loop.py): unchanged. solve_timeout entries are NOT dropped on resume — re-running a wall-clock-bound rollout with the same budget would just hit the wall again. Tests: TestErrorKind.test_solve_timeout_kind plus TestCollectorSolveTimeoutClassification (3 cases) added in tests/test_engine/test_infra_error.py. Full suite passes (1618 / 1618, 29 skipped, 14 network-deselected). Audit-skill scripts (extract_report_examples.py, audit_failures/SKILL.md) do not exist on origin/pjanuszewski/cluing_in — the matching updates need to be applied separately on origin/mpatelka/skill_extended. Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>

Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>

piojanu · 2026-06-09T07:32:11Z

/ok to test 2e147ff

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

mpatel31415 changed the title ~~Mpatelka/new error solve timeout public~~ Add - new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout Jun 3, 2026

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

mpatel31415 changed the title ~~Add - new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout~~ feat: Add new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout Jun 3, 2026

mpatel31415 added 5 commits June 3, 2026 15:01

apply code rabbit fix

82ed663

Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>

fix pipeline

8e02f92

Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>

chore: remove old comment

13ebc1e

Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>

fix: add missing import in test_infra_error.py

2e147ff

Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>

mpatel31415 force-pushed the mpatelka/new_error_solve_timeout_public branch from 6b04b03 to 2e147ff Compare June 3, 2026 13:01

piojanu changed the title ~~feat: Add new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout~~ feat: Add new error type SOLVE_TIMEOUT Jun 9, 2026

NVIDIA-NeMo deleted a comment from copy-pr-bot Bot Jun 9, 2026

copy-pr-bot Bot temporarily deployed to public June 9, 2026 07:32 Inactive

copy-pr-bot Bot temporarily deployed to test June 9, 2026 07:32 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci June 9, 2026 07:33 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 07:33 Inactive

copy-pr-bot Bot temporarily deployed to public June 9, 2026 07:34 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add new error type SOLVE_TIMEOUT#1049

feat: Add new error type SOLVE_TIMEOUT#1049
mpatel31415 wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
mpatel31415:mpatelka/new_error_solve_timeout_public

mpatel31415 commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

piojanu commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mpatel31415 commented Jun 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

piojanu commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mpatel31415 commented Jun 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading