Releases: NVIDIA/tilus
Releases · NVIDIA/tilus
v0.2.0
Tilus v0.2.0 Release Notes
Tilus v0.2.0 brings Blackwell GPU support, initial Hopper support, and a comprehensive set of tutorials, documentation, and optimizations. This release transforms Tilus from an Ampere-focused DSL into one that targets NVIDIA's latest GPU architectures, with fine-grained control over Tensor Memory, TMA, and Cluster Launch Control.
Highlights
- Blackwell (SM 10.0) support — Write kernels using 5th-gen Tensor Cores (tcgen05), Tensor Memory (TMEM), TMA bulk copies, and 2-CTA Tensor Core mode.
- Hopper (SM 9.0) support — Initial support for WGMMA instructions and warp-specialized GEMM.
- Step-by-step matmul tutorials — Seven tutorials (V0–V6) walk through building a high-performance Blackwell matmul from scratch, covering TMA, software pipelining, warp specialization, tile rasterization, CLC persistent kernels, and 2-CTA mode.
- Codegen optimizations — Faster generated code via fast divmod, predicated instruction emission, warp-uniform code generation, and named barriers.
Blackwell Architecture
Instruction Support
- Tensor Memory (tcgen05):
alloc,dealloc,relinquish_alloc_permit,load,store,wait,copy,commit, andmma— including 2-CTA mode (#47, #48, #50, #55, #89) - TMA (Tensor Memory Access): Asynchronous bulk copy instructions for global↔shared memory transfers (#46)
- Cluster Launch Control (CLC): Cross-CTA scheduling and synchronization primitives (#68)
- mbarrier: Full set of memory barrier instructions with explicit
arrive_and_expect_tx(#38, #88) - cp.async.bulk: Bulk async copy with
.readmodifier forwait_group(#40, #106) - Fence instructions: Refactored fence support for proxy async and memory ordering (#110)
Layout System
- Tensor Memory layout for TMEM tensors (#80)
- Refactored shared memory layout system with byte-level swizzle support (#85, #109)
GEMM Examples (V0–V8)
A progressive series of Blackwell matmul examples demonstrating increasing optimization levels (#58, #59, #64, #66, #75, #81, #90, #91, #95), reorganized for clarity in #128.
Hopper Architecture
- WGMMA instructions for Hopper Tensor Cores (#83)
- Hopper GEMM examples: pipelined matmul and warp-specialized GEMM (#84, #86)
Language & IR
- Thread groups: Support for
thread_group,single_thread,single_warp, andwarp_groupin Tilus Script (#41) - State construct: Persistent state across kernel invocations (#71)
- Cluster dimensions: Specify cluster layout via
cluster_blocks(#33) - Tensor indexing/slicing: Support for shared and global tensor indexing (#37)
.item()/.item_ptr(): Access scalar values and pointers from tensors (#60)- Target suffixes: Added
a(architecture-specific) andf(family-portable) target variants (#32)
Codegen & Optimizations
- Fast divmod: Hardware-accelerated integer division (#117)
- Predicated instruction emission: Reduced warp divergence in generated code (#114)
- Warp-uniform code generation: Uniform execution for
ThreadGroupStmt(#96) - Named barriers: Use named barriers for warpgroup sync instead of mbarrier (#113)
- Dead code elimination: New Tilus IR pass (#92)
- Barrier register spill avoidance: Prevent local memory spill for barrier tensors (#93)
- TVM-FFI runtime: Generated libraries now use the TVM-FFI ABI (#53)
Documentation & Tutorials
- Blackwell matmul tutorial series (V0–V6): Step-by-step guides covering the full optimization journey (#123–#131)
- Instruction documentation: Comprehensive API docs for all instruction groups (#122)
- Programming guides: Revised guides for thread groups, autotuning, caching, and targets (#122, #134)
- Interactive register layout demo (#98)
- Multi-version docs with Sphinx (#119)
Infrastructure
- Pre-commit hooks for lint and formatting (#62)
- Docstring lint enforcement (#74)
- Separate CI workflows for tests and docs (#121)
- Python version compatibility tests (#121)
- Nsight Compute report analysis skill (#105)
Bug Fixes
- Fix low-precision pointer assignment transformation (#35)
- Fix tcgen05.cp codegen (#51)
- Fix volatile specifier for MMA instruction (#44)
- Add
cuda_bf16.hinclude header (#77) - Fix semaphore codegen (#112)
New Contributors
- @soodoshll — Hopper WGMMA and GEMM examples (#83, #84, #86)
- @qiching — Fused softmax example (#99)
- @splint-disk-8i — CI improvements and README edits (#97, #102)
- @WilliamZhang20 — Vector addition example (#111)
Full Changelog: v0.1.1...v0.2.0
v0.1.1
This is a small enhancement release of Tilus.
Highlights
- Add more examples: flash attention with kv-cache, flash linear attention deocde
- Fix a bug when multiple tilus process access the dispatch table in cache
- Add targets
sm_100,sm_103,sm_110,sm_120andsm_121.
What's Changed
- [Docs] Update README.md by @yaoyaoding in #11
- [CI] Use RTX 4090 for docs building by @yaoyaoding in #12
- [Docs] Update README.md by @yaoyaoding in #13
- [Package] Rename to under @NVIDIA organization by @nekomeowww in #15
- [Docs] Update installation guide by @yaoyaoding in #17
- [CI] Fix concurrency issue by @yaoyaoding in #18
- [Docs] Correct gflops to tflops in examples by @YichengDWu in #19
- [Example] Add the attention example with kv-cache by @yaoyaoding in #21
- [Example] Add example for decoding kernel of flash linear attention by @yaoyaoding in #25
- [Example] Add a kernel in the flash linear attention by @yaoyaoding in #26
- [Example] Add the fused kernel for decoding of flash linear attention by @yaoyaoding in #27
- [Tuning] Add lock to cache dir when dump the tuning result by @yaoyaoding in #28
- [Target] Add targets properties by @yaoyaoding in #29
- [Bump] Bump version of hidet from 0.6.0 to 0.6.1 by @yaoyaoding in #30
New Contributors
- @nekomeowww made their first contribution in #15
- @YichengDWu made their first contribution in #19
Full Changelog: v0.1...v0.1.1
v0.1
The initial release of tilus.
What's Changed
- [CI] Add workflow to deploy wheel to pypi (#10)
- [CI] Use deep checkout for diff (#9)
- [Docs] Update copyright and remove some redundant descriptions (#8)
- [CI] Use nvidia github runners for docs building (#7)
- [CI] Update docs and ci runner for format (#6)
- [CI] Fix the permission issue of deploy github pages (#5)
- [CI][Docs] Add workflow to deploy docs (#4)
- [CI] Migrate CI runners (#2)
- [License] Add license header and contribution guide
- [Misc] Add vscoding settings
- [Docs] Add documentation for layout system
- [Docs] Add more sections in programming guides
- [Docs] Add the framework of programming guide
- [Docs] Add documentation for the remaining matmuls
- [Docs] Add docs for two matmul examples
- [Docs] Add the documentation for naive matmul
- [Docs] Add initial version of docs
- [Bugfix] Improve the performance of HoistLoopInvariants pass
- [Pass] Add HoistLoopInvariants pass
- [Pass] Add affine to recursive transformation pass
- [Pass] Explicitly list the used hidet passes
- [Bufix] Fix OOM issue in attention example
- [Example] Optimize the attention operator by spliting the sequence of kv
- [Example] Remove explicit layout in examples
- [Operator] Optimize attention operator
- [Kernel] Optimize attention kernel
- [Tool] Update tilus IRPrinter
- [Script] Support script procedure
- [Operator] Optimize the attention operator by pipelining
- [Operator] Optimize attention operator with software pipelining
- [Bugfix] Fix a bug in attention example
- [Example] Add attention example
- [Package] Update information in
pyproject.toml - [Submodule] Remove .gitmodules
- [Feature] Automatic Layout Inference
- [Layout] Remove old layout definition
- [Layout] Use the new layout system in the emitters
- [Layout] Add the unified representation of layout system
- [Bug] Unify the segments of dynamic shape for tuning
- [Layout] Add transpose operation for register layout
- [Cache] Add cache for instantiated_script
- [Enhancement] Add more instructions and functionality
- [Reduce] Support reduce instruction
- [Instruction] Add repeat and repeat_interleave instructions
- [Optimize] Optimize layout for cast kernel
- [Options] Configure hidet option to avoid ftz flag in nvcc
- [CUDA] Avoid any cuda runtime api call during import
- [Fix] Fix a bug in quantization example
- [Sync] Upstream changes to hidet
- [Refactor] Simplify and generalize the dot instruction
- [Example] Add quantized matmul with full range of quantized data types
- [Quantization] Support low-precision data types in cast kernel
- [Tensor] Add tensor class and cast kernel
- [Fix] Fix a bug in mma emitter
- [Pass] Update the bound aware simplification pass
- [Example] Add example matmul-v7 with parallel-k implemented
- [Example] Add matmul-v6 that implements an efficient write back
- [Exampe] Add matmul-v5 example that implements software pipeline
- [Example] Add matmul-v4 that uses
copy_asyncinstruction - [Instruction][Example] Refactor LoadMatrix instruction
- [Version] Use setuptools_scm to manage the version number
- [IR][Codegen] Add generic load/store instructions for shared tensor
- [Tools] Add
IRVerifierto verify the integrity and correctness of IR - [Examples] Add examples to the lint script and ci
- [Tuning] Add support of auto-tune
- [Script] Add
load_sharedandstore_sharedin Tilus Script - [IR][GlobalTensor] Introduce GlobalTensor in the Tilus IR
- [Matmul] Add simple matmul example
- [Linter] Enable mypy disallow incomplete defs
- [Linter] Enable check-untyped-defs flag of mypy
- [IR] Refactor IR classes to enforce copy-on-write machanism
- [Script] Add
tilus.scriptmodule - [Workflow] Refactor hidet installation and wheel building as actions
- [Build] Enable end to end compilation and build
- [IR] Add Program IR node to hold multiple functions
- [IR] Use short module name and fix some bugs
- [CI] Skip installation of hidet dependency to speedup format and lint
- [IR] Refactor the virtual machine IR
- [IR] Add the core IR of the tilus language
- [Init] Initial commit
Full Changelog: https://github.com/NVIDIA/tilus/commits/v0.1