Skip to content

Commit cc3c29a

Browse files
yzhaiustcyuzhaihwu36
authored
CUTLASS 3.6.0 (#1850)
* v3.6 * update changelog * update readme * fix typo * fixing typos * hopper gemm with weight prefetch --------- Co-authored-by: yuzhai <[email protected]> Co-authored-by: Haicheng Wu <[email protected]>
1 parent 0837a2a commit cc3c29a

File tree

354 files changed

+105914
-8174
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

354 files changed

+105914
-8174
lines changed

CHANGELOG.md

+20
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,25 @@
11
# NVIDIA CUTLASS Changelog
22

3+
## [3.6.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.6.0) (2024-10-03)
4+
5+
- [Hopper structured sparse GEMM](./examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu).
6+
+ [FP16](./test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu)
7+
+ [FP8](./test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu)
8+
+ [INT8](./test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)
9+
+ [TF32](./test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)
10+
- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
11+
- [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
12+
- [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
13+
- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md).
14+
- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
15+
- A new TMA-enabled [epilogue](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.
16+
- A SIMT-enabled pointer-array [epilogue](./include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp).
17+
- A new [Ping-Pong kernel schedule for Grouped GEMM](./include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.
18+
- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/profiler.md#instantiating-more-kernels-with-hopper).
19+
- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](./include/cutlass/bfloat16.h)
20+
- Fixed use of isnan on Windows for [`half_t`](./test/unit/core/functional.cu).
21+
Various improvements and fixed from the community and CUTLASS team. Thanks to everyone who submitted PRs!
22+
323
## [3.5.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.5.1) (2024-07-25)
424

525
- [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](./examples/cute/tutorial/wgmma_sm90.cu)

CMakeLists.txt

+42-24
Original file line numberDiff line numberDiff line change
@@ -134,7 +134,6 @@ set(CUTLASS_ENABLE_PERFORMANCE ${CUTLASS_ENABLE_PROFILER} CACHE BOOL "Enable CUT
134134
set(CUTLASS_ENABLE_TESTS ${CUTLASS_ENABLE_TESTS_INIT} CACHE BOOL "Enable CUTLASS Tests")
135135
set(CUTLASS_ENABLE_GTEST_UNIT_TESTS ${CUTLASS_ENABLE_TESTS} CACHE BOOL "Enable CUTLASS GTest-based Unit Tests")
136136
set(CUTLASS_USE_SYSTEM_GOOGLETEST OFF CACHE BOOL "Use system/external installation of GTest")
137-
138137
set(CUTLASS_USE_PACKED_TUPLE ON CACHE BOOL "If ON, make cute::tuple be new standard-layout tuple type; if OFF, use the original cute::tuple implementation that is _not_ standard-layout.")
139138
if (CUTLASS_USE_PACKED_TUPLE)
140139
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_USE_PACKED_TUPLE=1)
@@ -234,7 +233,6 @@ set(CUTLASS_NVCC_EMBED_CUBIN ON CACHE BOOL "Embed compiled CUDA kernel binaries
234233
set(CUTLASS_NVCC_EMBED_PTX ON CACHE BOOL "Embed compiled PTX into executables.")
235234
set(CUTLASS_NVCC_KEEP OFF CACHE BOOL "Keep intermediate files generated by NVCC.")
236235
set(CUTLASS_ENABLE_F16C OFF CACHE BOOL "Enable F16C x86 extensions in host code.")
237-
set(CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL OFF CACHE BOOL "Enable CUTLASS to directly call driver API.")
238236

239237
################################################################################
240238
#
@@ -271,6 +269,7 @@ set(CUTLASS_LIBRARY_OPERATIONS "all" CACHE STRING "Comma-delimited list of opera
271269
set(CUTLASS_LIBRARY_KERNELS ${CUTLASS_LIBRARY_KERNELS_INIT} CACHE STRING "Comma-delimited list of kernel name filters. If unspecified, only the largest tile size is enabled. If the string 'all' is specified, all kernels are enabled.")
272270
set(CUTLASS_LIBRARY_IGNORE_KERNELS "" CACHE STRING "Comma-delimited list of kernels to exclude from build. This option ONLY takes effect if CUTLASS_LIBRARY_KERNELS is set.")
273271
set(CUTLASS_LIBRARY_EXCLUDE_KERNELS "" CACHE STRING "Comma-delimited list of kernels to exclude from build. This option always takes effect, whether or not CUTLASS_LIBRARY_KERNELS is set. It also can exclude kernels from the filter file (see KERNEL_FILTER_FILE).")
272+
set(CUTLASS_LIBRARY_INSTANTIATION_LEVEL "" CACHE STRING "Instantiation level for SM90 kernels. Set to `max` and make sure CUTLASS_LIBRARY_KERNELS is non-empty to stamp all possible kernel configurations.")
274273

275274
################################################################################
276275

@@ -318,6 +317,8 @@ if(CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES)
318317
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
319318
endif()
320319

320+
set(CUTLASS_SKIP_REDUCTION_INIT OFF CACHE BOOL "Disable init reduction workspace")
321+
321322
#
322323
# NOTE: running with asan and CUDA requires the following environment variable:
323324
#
@@ -345,6 +346,10 @@ if(CUTLASS_NVCC_EMBED_PTX)
345346
list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-include-ptx=all)
346347
endif()
347348

349+
if (CUTLASS_SKIP_REDUCTION_INIT)
350+
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_SKIP_REDUCTION_INIT=1)
351+
endif()
352+
348353
if (CUTLASS_ENABLE_TENSOR_CORE_MMA)
349354
list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1)
350355
endif()
@@ -354,6 +359,18 @@ if (CUTLASS_PROFILER_DISABLE_REFERENCE)
354359
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_PROFILER_DISABLE_REFERENCE=1)
355360
endif()
356361

362+
if (CUTLASS_ENABLE_GDC_FOR_SM90)
363+
message(STATUS "Grid Dependency Control (GDC) is enabled for SM90 kernels (required for programmatic dependent launches).")
364+
list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_ENABLE_GDC_FOR_SM90=1)
365+
endif()
366+
367+
set(CUTLASS_ENABLE_SYNCLOG OFF CACHE BOOL "Enable synchronization event logging for race condition debugging. WARNING: This redefines __syncthreads() and __syncwarp() in all downstream code!")
368+
369+
if (CUTLASS_ENABLE_SYNCLOG)
370+
set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
371+
string(APPEND CMAKE_CXX_FLAGS " -DCUTLASS_ENABLE_SYNCLOG=1")
372+
string(APPEND CMAKE_CUDA_FLAGS " -DCUTLASS_ENABLE_SYNCLOG=1")
373+
endif()
357374

358375

359376

@@ -880,12 +897,27 @@ function(cutlass_add_executable_tests NAME TARGET)
880897

881898
set(TEST_GROUP_NAME ${NAME})
882899

900+
# To run the tests from an install package with tests enabled, we need to generate test files
901+
# that don't rely on the current directory structure in build.
902+
903+
set(TEST_NAME c${NAME})
904+
set(TEST_GEN_DIR ${CMAKE_CURRENT_BINARY_DIR}/ctest/${TEST_NAME})
905+
file(MAKE_DIRECTORY ${TEST_GEN_DIR})
906+
907+
set(TEST_EXE_PATH $<TARGET_FILE:${TARGET}>)
908+
set(TEST_USE_EXTENDED_FORMAT ON)
909+
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" @ONLY)
910+
911+
set(TEST_EXE_PATH $<TARGET_FILE_NAME:${TARGET}>)
912+
set(TEST_USE_EXTENDED_FORMAT OFF) # ctest does not support extended add_test format.
913+
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" @ONLY)
914+
883915
foreach(CMD_OPTIONS_VAR IN LISTS __TEST_COMMAND_OPTIONS)
884916

885917
if (CMD_COUNT GREATER 1)
886-
string(TOLOWER "${NAME}_${CMD_OPTIONS_VAR}" TEST_NAME)
918+
string(TOLOWER "${NAME}_${CMD_OPTIONS_VAR}" TESTCASE_NAME)
887919
else()
888-
string(TOLOWER "${NAME}" TEST_NAME)
920+
string(TOLOWER "${NAME}" TESTCASE_NAME)
889921
endif()
890922

891923
# The following rigmarole is needed to deal with spaces and possible quotes in
@@ -899,42 +931,28 @@ function(cutlass_add_executable_tests NAME TARGET)
899931
separate_arguments(TEST_COMMAND_OPTIONS)
900932

901933
add_custom_target(
902-
${TEST_NAME}
934+
${TESTCASE_NAME}
903935
COMMAND
904936
${CUTLASS_TEST_EXECUTION_ENVIRONMENT} $<TARGET_FILE:${TARGET}> ${TEST_COMMAND_OPTIONS}
905937
DEPENDS
906938
${TARGET}
907939
)
908940

909941
if (CMD_COUNT GREATER 1)
910-
add_dependencies(${NAME} ${TEST_NAME})
942+
add_dependencies(${NAME} ${TESTCASE_NAME})
911943
endif()
912944

913945
foreach(DEPENDEE ${__DEPENDEES})
914-
add_dependencies(${DEPENDEE} ${TEST_NAME})
946+
add_dependencies(${DEPENDEE} ${TESTCASE_NAME})
915947
endforeach()
916948

917-
set(TEST_NAME c${TEST_NAME})
949+
set(TESTCASE_NAME c${TESTCASE_NAME})
918950
string(CONFIGURE "${_INLINE_PER_TEST_CODE_TEMPLATE}" _TEST_CODE @ONLY)
919-
string(APPEND _INLINE_PER_TEST_CODE "${_TEST_CODE}")
951+
file(APPEND "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" "${_TEST_CODE}")
952+
file(APPEND "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" "${_TEST_CODE}")
920953

921954
endforeach()
922955

923-
# To run the tests from an install package with tests enabled, we need to generate test files
924-
# that don't rely on the current directory structure in build.
925-
926-
set(TEST_NAME c${NAME})
927-
set(TEST_GEN_DIR ${CMAKE_CURRENT_BINARY_DIR}/ctest/${TEST_NAME})
928-
file(MAKE_DIRECTORY ${TEST_GEN_DIR})
929-
930-
set(TEST_EXE_PATH $<TARGET_FILE:${TARGET}>)
931-
set(TEST_USE_EXTENDED_FORMAT ON)
932-
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" @ONLY)
933-
934-
set(TEST_EXE_PATH $<TARGET_FILE_NAME:${TARGET}>)
935-
set(TEST_USE_EXTENDED_FORMAT OFF) # ctest does not support extended add_test format.
936-
configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" @ONLY)
937-
938956
# The following line imports the tests for immediate run via `make test`.
939957

940958
include(${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake)

PUBLICATIONS.md

+2
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626

2727
- ["Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search"](https://arxiv.org/abs/2302.01382). Clemens JS Schaefer, Elfie Guo, Caitlin Stanton, Xiaofan Zhang, Tom Jablin, Navid Lambert-Shirzad, Jian Li, Chiachen Chou, Siddharth Joshi, Yu Emma Wang. _arXiv_, Feburary 2023.
2828

29+
- ["Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism"](https://dl.acm.org/doi/abs/10.1145/3572848.3577500). Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie. _Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming_, Feburary 2023.
30+
2931
- ["Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"](https://arxiv.org/abs/2301.03598). Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. _arXiv_, January 2023.
3032

3133
## 2022

0 commit comments

Comments
 (0)