Skip to content

Enhance Graph.update() and add whole-graph update tests#1843

Open
Andy-Jost wants to merge 8 commits intoNVIDIA:mainfrom
Andy-Jost:graph-updates
Open

Enhance Graph.update() and add whole-graph update tests#1843
Andy-Jost wants to merge 8 commits intoNVIDIA:mainfrom
Andy-Jost:graph-updates

Conversation

@Andy-Jost
Copy link
Copy Markdown
Contributor

@Andy-Jost Andy-Jost commented Mar 31, 2026

Extend tests of the exsiting Graph.update function and refactor existing graph code in preparation for further work.

Summary

  • Extends Graph.update() to accept both GraphBuilder and GraphDef as sources, giving users flexibility to update instantiated graphs from either the stream-capture or explicit-graph API
  • Surfaces detailed CUgraphExecUpdateResultInfo on update failure (reason enum + docstring) instead of a generic CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE
  • Splits the monolithic _graphdef.pyx (2000+ lines) into a _graph_def/ subpackage with three focused modules for maintainability
  • Reorganizes graph test files into thematic groups with module docstrings
  • Adds new tests for whole-graph update covering happy paths and error cases

Changes

  • cuda/core/_graph/_graph_builder.pyx: Refactored Graph.update() to dispatch on GraphBuilder vs GraphDef, call cuGraphExecUpdate with a CUgraphExecUpdateResultInfo struct, and raise a descriptive CUDAError on failure
  • cuda/core/_graph/_graph_def/: Split _graphdef.pyx into _graph_def.pyx (Condition, GraphAllocOptions, GraphDef), _graph_node.pyx (GraphNode base class and builder methods with GN_* inline helpers), and _subclasses.pyx (all concrete node subclasses). Handle property annotations updated to use driver.* types consistently.
  • tests/graph/: Renamed test files to reflect their scope (test_graph_builder.py, test_graph_builder_conditional.py, test_graph_memory_resource.py, test_graph_update.py, test_graphdef*.py, test_device_launch.py); added module docstrings; moved tests to appropriate files
  • tests/graph/test_graph_update.py: Added parametrized test_graph_update_kernel_args (GraphBuilder + GraphDef), test_graph_update_conditional, test_graph_update_unfinished_builder, test_graph_update_topology_mismatch, test_graph_update_wrong_type

Test Coverage

  • Parametrized happy path: kernel-only graph updated with new pointer args, tested via both GraphBuilder and GraphDef
  • Conditional switch update: existing test (renamed) exercising topology-compatible conditional graph updates
  • Unfinished builder: ValueError when source GraphBuilder hasn't finished capturing
  • Topology mismatch: CUDAError with descriptive reason from CUgraphExecUpdateResultInfo
  • Wrong type: TypeError for invalid argument types

Related Work

Rename test files to reflect what they actually test:
- test_basic -> test_graph_builder (stream capture tests)
- test_conditional -> test_graph_builder_conditional
- test_advanced -> test_graph_update (moved child_graph and
  stream_lifetime tests into test_graph_builder)
- test_capture_alloc -> test_graph_memory_resource
- test_explicit* -> test_graphdef*

Made-with: Cursor
- Extend Graph.update() to accept both GraphBuilder and GraphDef sources
- Surface CUgraphExecUpdateResultInfo details on update failure instead
  of a generic CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE message
- Release the GIL during cuGraphExecUpdate via nogil block
- Add parametrized happy-path test covering both GraphBuilder and GraphDef
- Add error-case tests: unfinished builder, topology mismatch, wrong type

Made-with: Cursor
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone Mar 31, 2026
@Andy-Jost Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Mar 31, 2026
@Andy-Jost Andy-Jost self-assigned this Mar 31, 2026
@Andy-Jost Andy-Jost requested review from cpcloud, leofang, mdboom, rparolin and rwgk and removed request for leofang March 31, 2026 18:25
@github-actions
Copy link
Copy Markdown

- Chain GraphDef kernel launches sequentially (n.launch instead of
  g.launch) to avoid concurrent writes to the same memory location
- Update GraphDef.handle and GraphNode.handle annotations to reflect
  that as_py returns driver types (CUgraph, CUgraphNode), not int

Made-with: Cursor
The monolithic _graphdef.pyx (2000+ lines) is split into three focused
modules under _graph_def/: _graph_def.pyx (Condition, GraphAllocOptions,
GraphDef), _graph_node.pyx (GraphNode base class and builder methods),
and _subclasses.pyx (all concrete node subclasses). Long method bodies
in GraphNode are factored into cdef inline GN_* helpers following
existing codebase conventions. Handle property annotations updated to
use driver.* types consistently.

Made-with: Cursor
@Andy-Jost
Copy link
Copy Markdown
Contributor Author

_graphdef.pyx was broken into 3 parts under _graph_def/. No need to review those in detail.

Update two references that still used _graphdef instead of _graph_def
after the subpackage split.

Made-with: Cursor
Copy link
Copy Markdown
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we don't have to worry about the import/cimport export surface, is that a valid assumption?

cdef cydriver.CUresult err
with nogil:
err = cydriver.cuGraphExecUpdate(cu_exec, cu_graph, &result_info)
if err != cydriver.CUresult.CUDA_SUCCESS:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Cursor GPT-5.4 1M High to "comb" through this "very complex and very large" PR. It only found this one "High" item:


I think this would be a bit safer if it distinguished the graph-update failure case from ordinary driver errors, e.g.

        cdef cydriver.CUgraphExecUpdateResultInfo result_info
        cdef cydriver.CUresult err
        with nogil:
            err = cydriver.cuGraphExecUpdate(cu_exec, cu_graph, &result_info)
        if err == cydriver.CUresult.CUDA_SUCCESS:
            return
        if err == cydriver.CUresult.CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE:
            reason = driver.CUgraphExecUpdateResult(result_info.result)
            msg = f"Graph update failed: {reason.__doc__.strip()} ({reason.name})"
            raise CUDAError(msg)
        raise CUDAError(err)

Rationale:

  • Using cydriver.cuGraphExecUpdate(...) directly here makes sense, since the higher-level binding drops resultInfo on non-success and would lose the detailed update reason entirely.
  • But resultInfo appears to be the structured explanation for the specific CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE path, not necessarily for every possible non-success CUresult.
  • Even when result_info.result == CU_GRAPH_EXEC_UPDATE_ERROR, the enum docs say the actual explanation is described by the function return value. The current code discards err, so it may collapse distinct driver failures into the same generic resultInfo-based message.
  • This shape preserves the nice detailed message for graph-update incompatibilities while still surfacing ordinary driver errors accurately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs, cuGraphExecUpdate only returns CUDA_SUCCESS or CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"very complex and very large"

To be clear, nearly all of this change is refactoring and code movement. The graph tests were regrouped slightly and renamed. The huge _graphdef module was split into three parts. The are not many functional changes here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs, cuGraphExecUpdate only returns CUDA_SUCCESS or CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE.

Documentation tends to be imprecise, or become imprecise over time without anyone noticing.

The suggested change improves the quality of implementation at a very small cost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the driver code and the docs are indeed incorrect.

@Andy-Jost
Copy link
Copy Markdown
Contributor Author

I'm assuming we don't have to worry about the import/cimport export surface, is that a valid assumption?

This is correct. We do not make any guarantees whatsoever about Cython interface stability. The public Python API consists of what we expose at cuda.core. All of these submodules are private and have underscore-prefixed names.

Check for CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE first to provide the
rich error message with the update result reason, then fall through
to HANDLE_RETURN for any other error code (CUDA_ERROR_INVALID_VALUE,
CUDA_ERROR_NOT_SUPPORTED, etc.) or success.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants