feat: Add new error type SOLVE_TIMEOUT#1049
Conversation
📝 WalkthroughWalkthroughAdds an explicit SOLVE_TIMEOUT error kind, emits it from HarborSolver on agent timeouts, propagates that status through run_evaluation into scoring_details, and classifies it in ArtifactCollector with unit tests exercising the full path. ChangesSolve Timeout Error Classification
🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/test_engine/test_infra_error.py (1)
24-24:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winMissing module-level import for ArtifactCollector.
The tests use
ArtifactCollector()but it's not imported. The PR removed inline imports without adding the module-level import.🔧 Proposed fix
from nemo_evaluator.engine.eval_loop import _get_error_category from nemo_evaluator.errors import GracefulError, InfraError +from nemo_evaluator.observability.collector import ArtifactCollector from nemo_evaluator.observability.types import StepRecord from nemo_evaluator.solvers.base import ErrorKind, SolveResult🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_engine/test_infra_error.py` at line 24, Add a missing module-level import for ArtifactCollector: at the top of the test file with the other imports, add "from nemo_evaluator.artifacts import ArtifactCollector" (or the correct module that defines ArtifactCollector) so the tests that instantiate ArtifactCollector() can resolve the class; ensure the import is not left inline inside a test and matches the project's export location for ArtifactCollector.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 24: Add a missing module-level import for ArtifactCollector: at the top
of the test file with the other imports, add "from nemo_evaluator.artifacts
import ArtifactCollector" (or the correct module that defines ArtifactCollector)
so the tests that instantiate ArtifactCollector() can resolve the class; ensure
the import is not left inline inside a test and matches the project's export
location for ArtifactCollector.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 326b3ef0-9b46-470b-8f94-946cffd6859d
⛔ Files ignored due to path filters (2)
docs/evaluation/scoring.mdis excluded by!docs/**and included by nonepyproject.tomlis excluded by none and included by none
📒 Files selected for processing (5)
src/nemo_evaluator/engine/eval_loop.pysrc/nemo_evaluator/observability/collector.pysrc/nemo_evaluator/solvers/base.pysrc/nemo_evaluator/solvers/harbor.pytests/test_engine/test_infra_error.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 92: Move the "from nemo_evaluator.observability.collector import
ArtifactCollector" import out of the individual test functions and add a single
module-level import alongside the other imports near the top of the file; then
remove the repeated imports of ArtifactCollector from the five test methods
where it currently appears so the tests use the shared module-level symbol
instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 75e95428-d60c-4495-9db2-19e180ddcca2
📒 Files selected for processing (1)
tests/test_engine/test_infra_error.py
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/test_engine/test_infra_error.py`:
- Line 92: Move the "from nemo_evaluator.observability.collector import
ArtifactCollector" import out of the individual test functions and add a single
module-level import alongside the other imports near the top of the file; then
remove the repeated imports of ArtifactCollector from the five test methods
where it currently appears so the tests use the shared module-level symbol
instead.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 75e95428-d60c-4495-9db2-19e180ddcca2
📒 Files selected for processing (1)
tests/test_engine/test_infra_error.py
🛑 Comments failed to post (1)
tests/test_engine/test_infra_error.py (1)
92-92:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMove
ArtifactCollectorimport to module level.The import is repeated inside five test methods. As per coding guidelines, import shared helpers at module level alongside the other imports (around line 24).
📦 Proposed fix
Add the import at module level (after line 24):
from nemo_evaluator.observability.types import StepRecord from nemo_evaluator.solvers.base import ErrorKind, SolveResult +from nemo_evaluator.observability.collector import ArtifactCollectorThen remove the five repeated imports inside the test methods:
class TestCollectorInfraClassification: def test_infra_error_classified(self): - from nemo_evaluator.observability.collector import ArtifactCollector - collector = ArtifactCollector()(Repeat for lines 106, 121, 137, 151)
Also applies to: 106-106, 121-121, 137-137, 151-151
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_engine/test_infra_error.py` at line 92, Move the "from nemo_evaluator.observability.collector import ArtifactCollector" import out of the individual test functions and add a single module-level import alongside the other imports near the top of the file; then remove the repeated imports of ArtifactCollector from the five test methods where it currently appears so the tests use the shared module-level symbol instead.
Adds ErrorKind.SOLVE_TIMEOUT and a new "solve_timeout" failure_category
covering rollouts where the Harbor solver hit the configured run_timeout
but were previously emitted with error=None / failure_category=null.
HarborSolver.solve()
- The existing `agent_timed_out and workspace_diff` branch (which
intentionally keeps error=None so the verifier still runs on the
partial workspace) now also tags the rollout error_kind=SOLVE_TIMEOUT.
- A new final `agent_timed_out` branch catches the case where the agent
hit the deadline without producing a workspace diff but still recorded
prompt tokens (the model was reasoning but never emitted a tool call).
- The three pre-existing INFRA timeout branches are untouched.
eval_loop._run_step()
- Tracks _is_solve_timeout alongside _is_infra. After verify completes,
stamps step.scoring_details["error_category"] = "solve_timeout" so the
categorizer can short-circuit on it. The verifier is allowed to run on
the partial workspace (passing solutions are still rewarded).
ArtifactCollector._classify_failure
- New short-circuit: if scoring_details.error_category == "solve_timeout",
failure_category = "solve_timeout". Placed after the infra_error rule
and before the substring scan so the existing "timed out" / "Timeout"
rule cannot mis-route Harbor wall-clock timeouts into the HTTP-408
`timeout` bucket.
Resume cache filter (eval_loop.py): unchanged. solve_timeout entries are
NOT dropped on resume — re-running a wall-clock-bound rollout with the
same budget would just hit the wall again.
Tests: TestErrorKind.test_solve_timeout_kind plus
TestCollectorSolveTimeoutClassification (3 cases) added in
tests/test_engine/test_infra_error.py. Full suite passes (1618 / 1618,
29 skipped, 14 network-deselected).
Audit-skill scripts (extract_report_examples.py, audit_failures/SKILL.md)
do not exist on origin/pjanuszewski/cluing_in — the matching updates
need to be applied separately on origin/mpatelka/skill_extended.
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
Signed-off-by: Martyna Patelka <mpatelka@nvidia.com>
6b04b03 to
2e147ff
Compare
|
/ok to test 2e147ff |
Adds ErrorKind.SOLVE_TIMEOUT and a new "solve_timeout" failure_category
covering rollouts where the Harbor solver hit the configured run_timeout
but were previously emitted with error=None / failure_category=null.
Summary by CodeRabbit
Improvements
Tests