Name	Name	Last commit message	Last commit date
parent directory ..
00_basic_gemm	00_basic_gemm
01_cutlass_utilities	01_cutlass_utilities
02_dump_reg_shmem	02_dump_reg_shmem
03_visualize_layout	03_visualize_layout
04_tile_iterator	04_tile_iterator
05_batched_gemm	05_batched_gemm
06_splitK_gemm	06_splitK_gemm
07_volta_tensorop_gemm	07_volta_tensorop_gemm
08_turing_tensorop_gemm	08_turing_tensorop_gemm
09_turing_tensorop_conv2dfprop	09_turing_tensorop_conv2dfprop
10_planar_complex	10_planar_complex
11_planar_complex_array	11_planar_complex_array
12_gemm_bias_relu	12_gemm_bias_relu
13_two_tensor_op_fusion	13_two_tensor_op_fusion
14_ampere_tf32_tensorop_gemm	14_ampere_tf32_tensorop_gemm
15_ampere_sparse_tensorop_gemm	15_ampere_sparse_tensorop_gemm
16_ampere_tensorop_conv2dfprop	16_ampere_tensorop_conv2dfprop
17_fprop_per_channel_bias	17_fprop_per_channel_bias
18_ampere_fp64_tensorop_affine2_gemm	18_ampere_fp64_tensorop_affine2_gemm
19_tensorop_canonical	19_tensorop_canonical
20_simt_canonical	20_simt_canonical
21_quaternion_gemm	21_quaternion_gemm
22_quaternion_conv	22_quaternion_conv
23_ampere_gemm_operand_reduction_fusion	23_ampere_gemm_operand_reduction_fusion
24_gemm_grouped	24_gemm_grouped
25_ampere_fprop_mainloop_fusion	25_ampere_fprop_mainloop_fusion
26_ampere_wgrad_mainloop_fusion	26_ampere_wgrad_mainloop_fusion
27_ampere_3xtf32_fast_accurate_tensorop_gemm	27_ampere_3xtf32_fast_accurate_tensorop_gemm
28_ampere_3xtf32_fast_accurate_tensorop_fprop	28_ampere_3xtf32_fast_accurate_tensorop_fprop
29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm	29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm
30_wgrad_split_k	30_wgrad_split_k
31_basic_syrk	31_basic_syrk
32_basic_trmm	32_basic_trmm
33_ampere_3xtf32_tensorop_symm	33_ampere_3xtf32_tensorop_symm
34_transposed_conv2d	34_transposed_conv2d
35_gemm_softmax	35_gemm_softmax
36_gather_scatter_fusion	36_gather_scatter_fusion
37_gemm_layernorm_gemm_fusion	37_gemm_layernorm_gemm_fusion
38_syr2k_grouped	38_syr2k_grouped
39_gemm_permute	39_gemm_permute
40_cutlass_py	40_cutlass_py
41_fused_multi_head_attention	41_fused_multi_head_attention
42_ampere_tensorop_group_conv	42_ampere_tensorop_group_conv
43_ell_block_sparse_gemm	43_ell_block_sparse_gemm
44_multi_gemm_ir_and_codegen	44_multi_gemm_ir_and_codegen
45_dual_gemm	45_dual_gemm
46_depthwise_simt_conv2dfprop	46_depthwise_simt_conv2dfprop
47_ampere_gemm_universal_streamk	47_ampere_gemm_universal_streamk
48_hopper_warp_specialized_gemm	48_hopper_warp_specialized_gemm
49_hopper_gemm_with_collective_builder	49_hopper_gemm_with_collective_builder
50_hopper_gemm_with_epilogue_swizzle	50_hopper_gemm_with_epilogue_swizzle
51_hopper_gett	51_hopper_gett
52_hopper_gather_scatter_fusion	52_hopper_gather_scatter_fusion
53_hopper_gemm_permute	53_hopper_gemm_permute
54_hopper_fp8_warp_specialized_gemm	54_hopper_fp8_warp_specialized_gemm
55_hopper_mixed_dtype_gemm	55_hopper_mixed_dtype_gemm
56_hopper_ptr_array_batched_gemm	56_hopper_ptr_array_batched_gemm
57_hopper_grouped_gemm	57_hopper_grouped_gemm
58_ada_fp8_gemm	58_ada_fp8_gemm
59_ampere_gather_scatter_conv	59_ampere_gather_scatter_conv
60_cutlass_import	60_cutlass_import
61_hopper_gemm_with_topk_and_softmax	61_hopper_gemm_with_topk_and_softmax
62_hopper_sparse_gemm	62_hopper_sparse_gemm
63_hopper_gemm_with_weight_prefetch	63_hopper_gemm_with_weight_prefetch
64_ada_fp8_gemm_grouped	64_ada_fp8_gemm_grouped
65_distributed_gemm	65_distributed_gemm
67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling	67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling
68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling	68_hopper_fp8_warp_specialized_grouped_gemm_with_blockwise_scaling
69_hopper_mixed_dtype_grouped_gemm	69_hopper_mixed_dtype_grouped_gemm
70_blackwell_gemm	70_blackwell_gemm
71_blackwell_gemm_with_collective_builder	71_blackwell_gemm_with_collective_builder
72_blackwell_narrow_precision_gemm	72_blackwell_narrow_precision_gemm
73_blackwell_gemm_preferred_cluster	73_blackwell_gemm_preferred_cluster
74_blackwell_gemm_streamk	74_blackwell_gemm_streamk
75_blackwell_grouped_gemm	75_blackwell_grouped_gemm
76_blackwell_conv	76_blackwell_conv
77_blackwell_fmha	77_blackwell_fmha
78_blackwell_emulated_bf16x9_gemm	78_blackwell_emulated_bf16x9_gemm
79_blackwell_geforce_gemm	79_blackwell_geforce_gemm
81_blackwell_gemm_blockwise	81_blackwell_gemm_blockwise
common	common
cute	cute
python	python
CMakeLists.txt	CMakeLists.txt
README.md	README.md

CUTLASS - Programming Examples

00_basic_gemm

launches a basic GEMM with single precision inputs and outputs
01_cutlass_utilities

demonstrates CUTLASS Utilities for allocating and initializing tensors
02_dump_reg_smem

debugging utilities for printing register and shared memory contents
03_visualize_layout

utility for visualizing all layout functions in CUTLASS
04_tile_iterator

example demonstrating an iterator over tiles in memory
05_batched_gemm

example demonstrating CUTLASS's batched strided GEMM operation
06_splitK_gemm

example demonstrating CUTLASS's Split-K parallel reduction kernel
07_volta_tensorop_gemm

example demonstrating mixed precision GEMM using Volta Tensor Cores
08_turing_tensorop_gemm

example demonstrating integer GEMM using Turing Tensor Cores
09_turing_tensorop_conv2dfprop

example demonstrating integer implicit GEMM convolution (forward propagation) using Turing Tensor Cores
10_planar_complex

example demonstrating planar complex GEMM kernels
11_planar_complex_array

example demonstrating planar complex kernels with batch-specific problem sizes
12_gemm_bias_relu

example demonstrating GEMM fused with bias and relu
13_two_tensor_op_fusion

example demonstrating two GEMMs or convolutions fused in one kernel
14_ampere_tf32_tensorop_gemm

example demonstrating FP32 GEMM with implicit TF32 conversion
15_ampere_sparse_tensorop_gemm

example demonstrating usage of Sparse Tensor cores
16_ampere_tensorop_conv2dfprop

example demonstrating forward convolution on tensors of layout NHWC
17_fprop_per_channel_bias

example demonstrating convolution fused with per channel bias and relu
18_ampere_fp64_tensorop_affine2_gemm

example demonstrating Affine-2 GEMM
19_tensorop_canonical

Canonical GEMM using tensor cores
20_simt_canonical

Canonical GEMM using SIMT
21_quaternion_gemm

example demonstrating Quaternion GEMM computations
22_quaternion conv

example demonstrating Quaternion convolution
23_ampere_gemm_operand_reduction_fusion

example demonstrating how to reduce one of the operands of the GEMM along the k-dimension when computing GEMM
24_gemm_grouped

example demonstrating batch of GEMM operations with distinct problem sizes
25_ampere_fprop_mainloop_fusion

example demonstrating fusing activation's per channel scale+bias+relu into the fgrad mainloop
26_ampere_wgrad_mainloop_fusion

example demonstrating fusing activation's per channel scale+bias+relu into the wgrad mainloop
27_ampere_3xtf32_fast_accurate_tensorop_gemm

example demonstrating emulation of a fast accurate SGEMM with TF32 operations
28_ampere_3xtf32_fast_accurate_tensorop_fprop

example demonstrating emulation of a fast accurate FP32 convolution with TF32 operation
29_ampere_3xtf32_fast_accurate_tensorop_complex_gemm

example demonstrating emulation of a fast accurate CGEMM with TF32 operation
30_wgrad_split_k

example demonstrating how to compute conv2d gradient with respect to weight (wgrad) together with split-K
31_basic_syrk

example demonstrating Symmetric Rank-K update
32_basic_trmm

example demonstrating Triangular Matrix-Matrix multiplication
33_ampere_3xtf32_tensorop_symm

example demonstrating Symmetric Matrix-Matrix multiplication with FP32 emulation
34_transposed_conv2d

example demonstrating how to compute 2d transposed convolution, also known as deconvolution, using CUTLASS conv2d Dgrad kernels
35_gemm_softmax

example demonstrating GEMM fused with Softmax in mixed precision using Ampere Tensor Cores
36_gather_scatter_fusion

example demonstrating fuses gather before GEMM and scatter after GEMM into the same GEMM kernel
37_gemm_layernorm_gemm_fusion

example demonstrating fuses gemm->layernorm->gemm into one kernel.
38_syr2k_grouped

example demonstrating a batch of SYR2K operations with distinct problem sizes
39_gemm_permute

example demonstrating batched GEMM operations with output results permuted as reshaped tensors
40_cutlass_py

example demonstrating CUTLASS with Python interface
41_multi_head_attention

example demonstrating attention example with non-fixed sequence length input
42_ampere_tensorop_group_conv

example demonstrating how to run group convolution kernels using functions and data structures provided by CUTLASS using tensor cores
43_ell_block_sparse_gemm

example demonstrating a Block-Ell sparse gemm
44_fused_multi_head_attention

example demonstrating fused multihead attention (fixed & variable) using shared memory
45_dual_gemm

example demonstrating how to fuse two GEMMs sharing the same left input matrix into one kernel
46_depthwise_simt_conv2dfprop

example demonstrating depthwise 2d convolution kernels using functions and data structures provided by CUTLASS using SIMT instruction
47_ampere_gemm_universal_streamk

example contrasting the Stream-K parallel decomposition for GEMM threadblocks versus the "classic data-parallel" and "Split-K" decompositions.
48_hopper_warp_specialized_gemm

Simple tensorop GEMM example using CUTLASS 3.0 APIs targeting NVIDIA Hopper architecture
49_hopper_gemm_schedules_with_collective_builder

Hopper GEMM example leveraging collective operation builders to showcase the builder API and the various kernel scheduled supported in CUTLASS 3.0 such as warp specialized persistent mainloops.
50_hopper_gemm_with_epilogue_swizzle

Hopper GEMM example to create a GEMM kernel with custom a collective mainloop and a custom vectorized epilogue.
51_hopper_gett

Hopper GETT example illustrating the ease with which GETTs can be run due to CUTLASS 3.0's unified micro-kernels and CuTe's hierarchical layouts.
52_hopper_gather_scatter_fusion

Hopper example that fuses gather before GEMM and scatter after GEMM into the same kernel
53_hopper_gemm_permute

Hopper example demonstrating the fusion of tensor permutation operations with a GEMM kernel
54_hopper_fp8_warp_specialized_gemm

Hopper example of instantiating and running an FP8 GEMM kernel
55_hopper_mixed_dtype_gemm

Hopper GEMM example with different A and B data types using CUTLASS 3.x APIs for DL kernels with fused dequantization.
56_hopper_ptr_array_batched_gemm

Hopper Ptr-Array Batched GEMM example using CUTLASS 3.x API.
57_hopper_grouped_gemm

Hopper Grouped GEMM using CUTLASS 3.x API.
58_ada_fp8_gemm

Ada GEMM kernel targetting Ada FP8 tensor cores via the CUTLASS 2.x API.
59_ampere_gather_scatter_conv

CuTe and CUTLASS 3.x based Ampere convolution fprop kernel capable of operating on both affine and gather/scatter tensors, showing how kernel authors can re-use CUTLASS 3.x collectives in their custom kernels.
61_hopper_gemm_with_topk_and_softmax

Hopper GEMM kernel with Top-K and softmax epilogue fusion.
70_blackwell_gemm

Simple dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
71_blackwell_gemm_with_collective_builder

Blackwell SM100 GEMM example demonstrating compatible mainloop+epilogue builder schedules and epilogue visitor tree (EVT) construction
72_blackwell_narrow_precision_gemm

Block-scaled dense GEMM example targeting the NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
73_blackwell_gemm_preferred_cluster

Blackwell SM100 GEMM kernel with preferred cluster feature.
74_blackwell_gemm_streamk

Blackwell SM100 GEMM kernel using the Stream-K scheduler
75_blackwell_grouped_gemm

Blackwell SM100 grouped GEMM kernel
76_blackwell_conv

Simple convolution(fprop/dgrad/wgrad) example targeting NVIDIA Blackwell SM100 Tensor Core MMA using CUTLASS 3.x APIs.
77_blackwell_fmha

Blackwell SM100 FMHA kernel
78_blackwell_emulated_bf16x9_gemm

Blackwell SM100 FastFP32 (using BF16 to emulate SGEMM) kernel
79_blackwell_geforce_gemm

Blackwell SM120 MMA kernel targeting GeForce RTX 50 series CUDA Cores

CuTe - Programming Examples

Examples that do not rely on CUTLASS and directly showcase the features of CuTe are located in cutlass/examples/cute.

Additionally, CuTe's core layout and layout algebra have their own test cases within cutlass/test/unit/cute/core/ that users might find useful as examples of CuTe.

Python Interface Examples

Examples leveraging CUTLASS's Python interface are located in cutlass/examples/python.

Copyright

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

  3. Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
  AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
  DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
  SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
  OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

README.md

CUTLASS - Programming Examples

CuTe - Programming Examples

Python Interface Examples

Copyright

Files

examples

Directory actions

More options

Directory actions

More options

Latest commit

History

examples

Folders and files

parent directory

README.md

CUTLASS - Programming Examples

CuTe - Programming Examples

Python Interface Examples

Copyright