Enhance Graph.update() and add whole-graph update tests#1843
Enhance Graph.update() and add whole-graph update tests#1843Andy-Jost wants to merge 8 commits intoNVIDIA:mainfrom
Conversation
Rename test files to reflect what they actually test: - test_basic -> test_graph_builder (stream capture tests) - test_conditional -> test_graph_builder_conditional - test_advanced -> test_graph_update (moved child_graph and stream_lifetime tests into test_graph_builder) - test_capture_alloc -> test_graph_memory_resource - test_explicit* -> test_graphdef* Made-with: Cursor
- Extend Graph.update() to accept both GraphBuilder and GraphDef sources - Surface CUgraphExecUpdateResultInfo details on update failure instead of a generic CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE message - Release the GIL during cuGraphExecUpdate via nogil block - Add parametrized happy-path test covering both GraphBuilder and GraphDef - Add error-case tests: unfinished builder, topology mismatch, wrong type Made-with: Cursor
|
- Chain GraphDef kernel launches sequentially (n.launch instead of g.launch) to avoid concurrent writes to the same memory location - Update GraphDef.handle and GraphNode.handle annotations to reflect that as_py returns driver types (CUgraph, CUgraphNode), not int Made-with: Cursor
The monolithic _graphdef.pyx (2000+ lines) is split into three focused modules under _graph_def/: _graph_def.pyx (Condition, GraphAllocOptions, GraphDef), _graph_node.pyx (GraphNode base class and builder methods), and _subclasses.pyx (all concrete node subclasses). Long method bodies in GraphNode are factored into cdef inline GN_* helpers following existing codebase conventions. Handle property annotations updated to use driver.* types consistently. Made-with: Cursor
|
|
Update two references that still used _graphdef instead of _graph_def after the subpackage split. Made-with: Cursor
rwgk
left a comment
There was a problem hiding this comment.
I'm assuming we don't have to worry about the import/cimport export surface, is that a valid assumption?
| cdef cydriver.CUresult err | ||
| with nogil: | ||
| err = cydriver.cuGraphExecUpdate(cu_exec, cu_graph, &result_info) | ||
| if err != cydriver.CUresult.CUDA_SUCCESS: |
There was a problem hiding this comment.
I used Cursor GPT-5.4 1M High to "comb" through this "very complex and very large" PR. It only found this one "High" item:
I think this would be a bit safer if it distinguished the graph-update failure case from ordinary driver errors, e.g.
cdef cydriver.CUgraphExecUpdateResultInfo result_info
cdef cydriver.CUresult err
with nogil:
err = cydriver.cuGraphExecUpdate(cu_exec, cu_graph, &result_info)
if err == cydriver.CUresult.CUDA_SUCCESS:
return
if err == cydriver.CUresult.CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE:
reason = driver.CUgraphExecUpdateResult(result_info.result)
msg = f"Graph update failed: {reason.__doc__.strip()} ({reason.name})"
raise CUDAError(msg)
raise CUDAError(err)Rationale:
- Using
cydriver.cuGraphExecUpdate(...)directly here makes sense, since the higher-level binding dropsresultInfoon non-success and would lose the detailed update reason entirely. - But
resultInfoappears to be the structured explanation for the specificCUDA_ERROR_GRAPH_EXEC_UPDATE_FAILUREpath, not necessarily for every possible non-successCUresult. - Even when
result_info.result == CU_GRAPH_EXEC_UPDATE_ERROR, the enum docs say the actual explanation is described by the function return value. The current code discardserr, so it may collapse distinct driver failures into the same genericresultInfo-based message. - This shape preserves the nice detailed message for graph-update incompatibilities while still surfacing ordinary driver errors accurately.
There was a problem hiding this comment.
According to the docs, cuGraphExecUpdate only returns CUDA_SUCCESS or CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE.
There was a problem hiding this comment.
"very complex and very large"
To be clear, nearly all of this change is refactoring and code movement. The graph tests were regrouped slightly and renamed. The huge _graphdef module was split into three parts. The are not many functional changes here.
There was a problem hiding this comment.
According to the docs,
cuGraphExecUpdateonly returnsCUDA_SUCCESSorCUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE.
Documentation tends to be imprecise, or become imprecise over time without anyone noticing.
The suggested change improves the quality of implementation at a very small cost.
There was a problem hiding this comment.
I checked the driver code and the docs are indeed incorrect.
This is correct. We do not make any guarantees whatsoever about Cython interface stability. The public Python API consists of what we expose at |
Made-with: Cursor
Check for CUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE first to provide the rich error message with the update result reason, then fall through to HANDLE_RETURN for any other error code (CUDA_ERROR_INVALID_VALUE, CUDA_ERROR_NOT_SUPPORTED, etc.) or success. Made-with: Cursor
Extend tests of the exsiting
Graph.updatefunction and refactor existing graph code in preparation for further work.Summary
Graph.update()to accept bothGraphBuilderandGraphDefas sources, giving users flexibility to update instantiated graphs from either the stream-capture or explicit-graph APICUgraphExecUpdateResultInfoon update failure (reason enum + docstring) instead of a genericCUDA_ERROR_GRAPH_EXEC_UPDATE_FAILURE_graphdef.pyx(2000+ lines) into a_graph_def/subpackage with three focused modules for maintainabilityChanges
cuda/core/_graph/_graph_builder.pyx: RefactoredGraph.update()to dispatch onGraphBuildervsGraphDef, callcuGraphExecUpdatewith aCUgraphExecUpdateResultInfostruct, and raise a descriptiveCUDAErroron failurecuda/core/_graph/_graph_def/: Split_graphdef.pyxinto_graph_def.pyx(Condition, GraphAllocOptions, GraphDef),_graph_node.pyx(GraphNode base class and builder methods withGN_*inline helpers), and_subclasses.pyx(all concrete node subclasses). Handle property annotations updated to usedriver.*types consistently.tests/graph/: Renamed test files to reflect their scope (test_graph_builder.py,test_graph_builder_conditional.py,test_graph_memory_resource.py,test_graph_update.py,test_graphdef*.py,test_device_launch.py); added module docstrings; moved tests to appropriate filestests/graph/test_graph_update.py: Added parametrizedtest_graph_update_kernel_args(GraphBuilder + GraphDef),test_graph_update_conditional,test_graph_update_unfinished_builder,test_graph_update_topology_mismatch,test_graph_update_wrong_typeTest Coverage
GraphBuilderandGraphDefValueErrorwhen sourceGraphBuilderhasn't finished capturingCUDAErrorwith descriptive reason fromCUgraphExecUpdateResultInfoTypeErrorfor invalid argument typesRelated Work