Skip to content

feat: Add new error type SOLVE_TIMEOUT#1049

Open
mpatel31415 wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
mpatel31415:mpatelka/new_error_solve_timeout_public
Open

feat: Add new error type SOLVE_TIMEOUT#1049
mpatel31415 wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
mpatel31415:mpatelka/new_error_solve_timeout_public

Conversation

@mpatel31415

@mpatel31415 mpatel31415 commented Jun 3, 2026

Copy link
Copy Markdown

Adds ErrorKind.SOLVE_TIMEOUT and a new "solve_timeout" failure_category
covering rollouts where the Harbor solver hit the configured run_timeout
but were previously emitted with error=None / failure_category=null.

Summary by CodeRabbit

  • Improvements

    • Enhanced timeout detection and refined error classification for solver attempts, including clearer distinction between timeouts with and without workspace progress.
    • Improved diagnostics and reporting when solver timeouts occur.
  • Tests

    • Added tests to verify timeout handling and ensure accurate failure categorization.

@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds an explicit SOLVE_TIMEOUT error kind, emits it from HarborSolver on agent timeouts, propagates that status through run_evaluation into scoring_details, and classifies it in ArtifactCollector with unit tests exercising the full path.

Changes

Solve Timeout Error Classification

Layer / File(s) Summary
ErrorKind.SOLVE_TIMEOUT enum definition
src/nemo_evaluator/solvers/base.py
ErrorKind enum gains new SOLVE_TIMEOUT = "solve_timeout" variant.
HarborSolver timeout detection and emission
src/nemo_evaluator/solvers/harbor.py
HarborSolver.solve() detects timeouts with and without workspace changes, logs token counts when applicable, and sets error_kind = ErrorKind.SOLVE_TIMEOUT.
Eval loop timeout tracking and error_category override
src/nemo_evaluator/engine/eval_loop.py
run_evaluation() initializes _is_solve_timeout before retries, sets it when ErrorKind.SOLVE_TIMEOUT is returned, and overrides final scoring_details["error_category"] to "solve_timeout".
ArtifactCollector solve_timeout classification
src/nemo_evaluator/observability/collector.py
_classify_failure() classifies scoring_details["error_category"] == "solve_timeout" as failure_category = "solve_timeout" and returns early.
Unit tests for SOLVE_TIMEOUT error flow
tests/test_engine/test_infra_error.py
Adds SolveResult test for SOLVE_TIMEOUT and TestCollectorSolveTimeoutClassification suite covering classification correctness, precedence over substring-based bucketing, and distinction from infra_error.

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately and concisely describes the primary change: adding a new ErrorKind.SOLVE_TIMEOUT enum variant to categorize solver timeouts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/test_engine/test_infra_error.py (1)

24-24: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Missing module-level import for ArtifactCollector.

The tests use ArtifactCollector() but it's not imported. The PR removed inline imports without adding the module-level import.

🔧 Proposed fix
 from nemo_evaluator.engine.eval_loop import _get_error_category
 from nemo_evaluator.errors import GracefulError, InfraError
+from nemo_evaluator.observability.collector import ArtifactCollector
 from nemo_evaluator.observability.types import StepRecord
 from nemo_evaluator.solvers.base import ErrorKind, SolveResult
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_engine/test_infra_error.py` at line 24, Add a missing module-level
import for ArtifactCollector: at the top of the test file with the other
imports, add "from nemo_evaluator.artifacts import ArtifactCollector" (or the
correct module that defines ArtifactCollector) so the tests that instantiate
ArtifactCollector() can resolve the class; ensure the import is not left inline
inside a test and matches the project's export location for ArtifactCollector.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 24: Add a missing module-level import for ArtifactCollector: at the top
of the test file with the other imports, add "from nemo_evaluator.artifacts
import ArtifactCollector" (or the correct module that defines ArtifactCollector)
so the tests that instantiate ArtifactCollector() can resolve the class; ensure
the import is not left inline inside a test and matches the project's export
location for ArtifactCollector.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 326b3ef0-9b46-470b-8f94-946cffd6859d

📥 Commits

Reviewing files that changed from the base of the PR and between 55ac58c and 06167e9.

⛔ Files ignored due to path filters (2)
  • docs/evaluation/scoring.md is excluded by !docs/** and included by none
  • pyproject.toml is excluded by none and included by none
📒 Files selected for processing (5)
  • src/nemo_evaluator/engine/eval_loop.py
  • src/nemo_evaluator/observability/collector.py
  • src/nemo_evaluator/solvers/base.py
  • src/nemo_evaluator/solvers/harbor.py
  • tests/test_engine/test_infra_error.py

@mpatel31415 mpatel31415 changed the title Mpatelka/new error solve timeout public Add - new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout Jun 3, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 92: Move the "from nemo_evaluator.observability.collector import
ArtifactCollector" import out of the individual test functions and add a single
module-level import alongside the other imports near the top of the file; then
remove the repeated imports of ArtifactCollector from the five test methods
where it currently appears so the tests use the shared module-level symbol
instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 75e95428-d60c-4495-9db2-19e180ddcca2

📥 Commits

Reviewing files that changed from the base of the PR and between 06167e9 and 6b04b03.

📒 Files selected for processing (1)
  • tests/test_engine/test_infra_error.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 92: Move the "from nemo_evaluator.observability.collector import
ArtifactCollector" import out of the individual test functions and add a single
module-level import alongside the other imports near the top of the file; then
remove the repeated imports of ArtifactCollector from the five test methods
where it currently appears so the tests use the shared module-level symbol
instead.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 75e95428-d60c-4495-9db2-19e180ddcca2

📥 Commits

Reviewing files that changed from the base of the PR and between 06167e9 and 6b04b03.

📒 Files selected for processing (1)
  • tests/test_engine/test_infra_error.py
🛑 Comments failed to post (1)
tests/test_engine/test_infra_error.py (1)

92-92: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Move ArtifactCollector import to module level.

The import is repeated inside five test methods. As per coding guidelines, import shared helpers at module level alongside the other imports (around line 24).

📦 Proposed fix

Add the import at module level (after line 24):

 from nemo_evaluator.observability.types import StepRecord
 from nemo_evaluator.solvers.base import ErrorKind, SolveResult
+from nemo_evaluator.observability.collector import ArtifactCollector

Then remove the five repeated imports inside the test methods:

 class TestCollectorInfraClassification:
     def test_infra_error_classified(self):
-        from nemo_evaluator.observability.collector import ArtifactCollector
-
         collector = ArtifactCollector()

(Repeat for lines 106, 121, 137, 151)

Also applies to: 106-106, 121-121, 137-137, 151-151

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_engine/test_infra_error.py` at line 92, Move the "from
nemo_evaluator.observability.collector import ArtifactCollector" import out of
the individual test functions and add a single module-level import alongside the
other imports near the top of the file; then remove the repeated imports of
ArtifactCollector from the five test methods where it currently appears so the
tests use the shared module-level symbol instead.

@mpatel31415 mpatel31415 changed the title Add - new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout feat: Add new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout Jun 3, 2026
Adds ErrorKind.SOLVE_TIMEOUT and a new "solve_timeout" failure_category
covering rollouts where the Harbor solver hit the configured run_timeout
but were previously emitted with error=None / failure_category=null.

HarborSolver.solve()
  - The existing `agent_timed_out and workspace_diff` branch (which
    intentionally keeps error=None so the verifier still runs on the
    partial workspace) now also tags the rollout error_kind=SOLVE_TIMEOUT.
  - A new final `agent_timed_out` branch catches the case where the agent
    hit the deadline without producing a workspace diff but still recorded
    prompt tokens (the model was reasoning but never emitted a tool call).
  - The three pre-existing INFRA timeout branches are untouched.

eval_loop._run_step()
  - Tracks _is_solve_timeout alongside _is_infra. After verify completes,
    stamps step.scoring_details["error_category"] = "solve_timeout" so the
    categorizer can short-circuit on it. The verifier is allowed to run on
    the partial workspace (passing solutions are still rewarded).

ArtifactCollector._classify_failure
  - New short-circuit: if scoring_details.error_category == "solve_timeout",
    failure_category = "solve_timeout". Placed after the infra_error rule
    and before the substring scan so the existing "timed out" / "Timeout"
    rule cannot mis-route Harbor wall-clock timeouts into the HTTP-408
    `timeout` bucket.

Resume cache filter (eval_loop.py): unchanged. solve_timeout entries are
NOT dropped on resume — re-running a wall-clock-bound rollout with the
same budget would just hit the wall again.

Tests: TestErrorKind.test_solve_timeout_kind plus
TestCollectorSolveTimeoutClassification (3 cases) added in
tests/test_engine/test_infra_error.py. Full suite passes (1618 / 1618,
29 skipped, 14 network-deselected).

Audit-skill scripts (extract_report_examples.py, audit_failures/SKILL.md)
do not exist on origin/pjanuszewski/cluing_in — the matching updates
need to be applied separately on origin/mpatelka/skill_extended.

Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
@mpatel31415 mpatel31415 force-pushed the mpatelka/new_error_solve_timeout_public branch from 6b04b03 to 2e147ff Compare June 3, 2026 13:01
@piojanu piojanu changed the title feat: Add new error type SOLVE_TIMEOUT covering rollouts where the Harbor solver hit the configured run_timeout feat: Add new error type SOLVE_TIMEOUT Jun 9, 2026
@NVIDIA-NeMo NVIDIA-NeMo deleted a comment from copy-pr-bot Bot Jun 9, 2026
@piojanu

piojanu commented Jun 9, 2026

Copy link
Copy Markdown

/ok to test 2e147ff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants