|
1 | 1 | 
|
2 | 2 |
|
3 |
| -# CUTLASS 3.6.0 |
| 3 | +# CUTLASS 3.7.0 |
4 | 4 |
|
5 |
| -_CUTLASS 3.6.0 - October 2024_ |
| 5 | +_CUTLASS 3.7.0 - January 2025_ |
6 | 6 |
|
7 | 7 | CUTLASS is a collection of CUDA C++ template abstractions for implementing
|
8 | 8 | high-performance matrix-matrix multiplication (GEMM) and related computations at all levels
|
@@ -41,27 +41,14 @@ and improves code composability and readability. More documentation specific to
|
41 | 41 |
|
42 | 42 | In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
|
43 | 43 |
|
| 44 | +# What's New in CUTLASS 3.7 |
44 | 45 |
|
45 |
| -# What's New in CUTLASS 3.6 |
| 46 | +CUTLASS 3.7.0 is an update to CUTLASS adding: |
46 | 47 |
|
47 |
| -CUTLASS 3.6.0 is an update to CUTLASS adding: |
48 |
| - |
49 |
| -- [Hopper structured sparse GEMM](./examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu). |
50 |
| - + [FP16](./test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu) |
51 |
| - + [FP8](./test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu) |
52 |
| - + [INT8](./test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu) |
53 |
| - + [TF32](./test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu) |
54 |
| -- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API. |
55 |
| -- [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode. |
56 |
| -- [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu). |
57 |
| -- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md). |
58 |
| -- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details. |
59 |
| -- A new TMA-enabled [epilogue](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support. |
60 |
| -- A SIMT-enabled pointer-array [epilogue](./include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp). |
61 |
| -- A new [Ping-Pong kernel schedule for Grouped GEMM](./include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations. |
62 |
| -- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/profiler.md#instantiating-more-kernels-with-hopper). |
63 |
| -- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](./include/cutlass/bfloat16.h) |
64 |
| -- Fixed use of isnan on Windows for [`half_t`](./test/unit/core/functional.cu). |
| 48 | +- A new [Hopper blockwise scaling FP8 GEMM](./examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling.cu) where the operands and block scaling tensor are staged via shared memory. |
| 49 | +- [Distributed GEMM](./examples/65_distributed_gemm/65_distributed_gemm.cu) is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation. |
| 50 | +- Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new `make_kernel_hardware_info` API as shown in [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu). |
| 51 | +- Enabled high precision accumulation for Hopper FP8 Sparse GEMM. |
65 | 52 |
|
66 | 53 | Minimum requirements:
|
67 | 54 |
|
@@ -540,7 +527,7 @@ The official list of CUTLASS developers and contributors is available here: [CON
|
540 | 527 |
|
541 | 528 | # Copyright
|
542 | 529 |
|
543 |
| -Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 530 | +Copyright (c) 2017 - 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
544 | 531 | SPDX-License-Identifier: BSD-3-Clause
|
545 | 532 |
|
546 | 533 | ```
|
|
0 commit comments