17 Apr 17:56

bcdfdb6

v0.2.0 Latest

Latest

Tilus v0.2.0 Release Notes

Tilus v0.2.0 brings Blackwell GPU support, initial Hopper support, and a comprehensive set of tutorials, documentation, and optimizations. This release transforms Tilus from an Ampere-focused DSL into one that targets NVIDIA's latest GPU architectures, with fine-grained control over Tensor Memory, TMA, and Cluster Launch Control.

Highlights

Blackwell (SM 10.0) support — Write kernels using 5th-gen Tensor Cores (tcgen05), Tensor Memory (TMEM), TMA bulk copies, and 2-CTA Tensor Core mode.
Hopper (SM 9.0) support — Initial support for WGMMA instructions and warp-specialized GEMM.
Step-by-step matmul tutorials — Seven tutorials (V0–V6) walk through building a high-performance Blackwell matmul from scratch, covering TMA, software pipelining, warp specialization, tile rasterization, CLC persistent kernels, and 2-CTA mode.
Codegen optimizations — Faster generated code via fast divmod, predicated instruction emission, warp-uniform code generation, and named barriers.

Blackwell Architecture

Instruction Support

Tensor Memory (tcgen05): alloc, dealloc, relinquish_alloc_permit, load, store, wait, copy, commit, and mma — including 2-CTA mode (#47, #48, #50, #55, #89)
TMA (Tensor Memory Access): Asynchronous bulk copy instructions for global↔shared memory transfers (#46)
Cluster Launch Control (CLC): Cross-CTA scheduling and synchronization primitives (#68)
mbarrier: Full set of memory barrier instructions with explicit arrive_and_expect_tx (#38, #88)
cp.async.bulk: Bulk async copy with .read modifier for wait_group (#40, #106)
Fence instructions: Refactored fence support for proxy async and memory ordering (#110)

Layout System

Tensor Memory layout for TMEM tensors (#80)
Refactored shared memory layout system with byte-level swizzle support (#85, #109)

GEMM Examples (V0–V8)

A progressive series of Blackwell matmul examples demonstrating increasing optimization levels (#58, #59, #64, #66, #75, #81, #90, #91, #95), reorganized for clarity in #128.

Hopper Architecture

WGMMA instructions for Hopper Tensor Cores (#83)
Hopper GEMM examples: pipelined matmul and warp-specialized GEMM (#84, #86)

Language & IR

Thread groups: Support for thread_group, single_thread, single_warp, and warp_group in Tilus Script (#41)
State construct: Persistent state across kernel invocations (#71)
Cluster dimensions: Specify cluster layout via cluster_blocks (#33)
Tensor indexing/slicing: Support for shared and global tensor indexing (#37)
.item() / .item_ptr(): Access scalar values and pointers from tensors (#60)
Target suffixes: Added a (architecture-specific) and f (family-portable) target variants (#32)

Codegen & Optimizations

Fast divmod: Hardware-accelerated integer division (#117)
Predicated instruction emission: Reduced warp divergence in generated code (#114)
Warp-uniform code generation: Uniform execution for ThreadGroupStmt (#96)
Named barriers: Use named barriers for warpgroup sync instead of mbarrier (#113)
Dead code elimination: New Tilus IR pass (#92)
Barrier register spill avoidance: Prevent local memory spill for barrier tensors (#93)
TVM-FFI runtime: Generated libraries now use the TVM-FFI ABI (#53)

Documentation & Tutorials

Blackwell matmul tutorial series (V0–V6): Step-by-step guides covering the full optimization journey (#123–#131)
Instruction documentation: Comprehensive API docs for all instruction groups (#122)
Programming guides: Revised guides for thread groups, autotuning, caching, and targets (#122, #134)
Interactive register layout demo (#98)
Multi-version docs with Sphinx (#119)

Infrastructure

Pre-commit hooks for lint and formatting (#62)
Docstring lint enforcement (#74)
Separate CI workflows for tests and docs (#121)
Python version compatibility tests (#121)
Nsight Compute report analysis skill (#105)

Bug Fixes

Fix low-precision pointer assignment transformation (#35)
Fix tcgen05.cp codegen (#51)
Fix volatile specifier for MMA instruction (#44)
Add cuda_bf16.h include header (#77)
Fix semaphore codegen (#112)

New Contributors

@soodoshll — Hopper WGMMA and GEMM examples (#83, #84, #86)
@qiching — Fused softmax example (#99)
@splint-disk-8i — CI improvements and README edits (#97, #102)
@WilliamZhang20 — Vector addition example (#111)

Full Changelog: v0.1.1...v0.2.0

Contributors

soodoshll, WilliamZhang20, and 2 other contributors

Assets 2

02 Sep 19:45

yaoyaoding

v0.1.1

3dea14f

v0.1.1

This is a small enhancement release of Tilus.

Highlights

Add more examples: flash attention with kv-cache, flash linear attention deocde
Fix a bug when multiple tilus process access the dispatch table in cache
Add targets sm_100, sm_103, sm_110, sm_120 and sm_121.

What's Changed

[Docs] Update README.md by @yaoyaoding in #11
[CI] Use RTX 4090 for docs building by @yaoyaoding in #12
[Docs] Update README.md by @yaoyaoding in #13
[Package] Rename to under @NVIDIA organization by @nekomeowww in #15
[Docs] Update installation guide by @yaoyaoding in #17
[CI] Fix concurrency issue by @yaoyaoding in #18
[Docs] Correct gflops to tflops in examples by @YichengDWu in #19
[Example] Add the attention example with kv-cache by @yaoyaoding in #21
[Example] Add example for decoding kernel of flash linear attention by @yaoyaoding in #25
[Example] Add a kernel in the flash linear attention by @yaoyaoding in #26
[Example] Add the fused kernel for decoding of flash linear attention by @yaoyaoding in #27
[Tuning] Add lock to cache dir when dump the tuning result by @yaoyaoding in #28
[Target] Add targets properties by @yaoyaoding in #29
[Bump] Bump version of hidet from 0.6.0 to 0.6.1 by @yaoyaoding in #30

New Contributors

@nekomeowww made their first contribution in #15
@YichengDWu made their first contribution in #19

Full Changelog: v0.1...v0.1.1

Contributors

NVIDIA, nekomeowww, and 2 other contributors

Assets 2

05 Aug 22:46

yaoyaoding

v0.1

c5a0d16

v0.1

The initial release of tilus.

What's Changed

[CI] Add workflow to deploy wheel to pypi (#10)
[CI] Use deep checkout for diff (#9)
[Docs] Update copyright and remove some redundant descriptions (#8)
[CI] Use nvidia github runners for docs building (#7)
[CI] Update docs and ci runner for format (#6)
[CI] Fix the permission issue of deploy github pages (#5)
[CI][Docs] Add workflow to deploy docs (#4)
[CI] Migrate CI runners (#2)
[License] Add license header and contribution guide
[Misc] Add vscoding settings
[Docs] Add documentation for layout system
[Docs] Add more sections in programming guides
[Docs] Add the framework of programming guide
[Docs] Add documentation for the remaining matmuls
[Docs] Add docs for two matmul examples
[Docs] Add the documentation for naive matmul
[Docs] Add initial version of docs
[Bugfix] Improve the performance of HoistLoopInvariants pass
[Pass] Add HoistLoopInvariants pass
[Pass] Add affine to recursive transformation pass
[Pass] Explicitly list the used hidet passes
[Bufix] Fix OOM issue in attention example
[Example] Optimize the attention operator by spliting the sequence of kv
[Example] Remove explicit layout in examples
[Operator] Optimize attention operator
[Kernel] Optimize attention kernel
[Tool] Update tilus IRPrinter
[Script] Support script procedure
[Operator] Optimize the attention operator by pipelining
[Operator] Optimize attention operator with software pipelining
[Bugfix] Fix a bug in attention example
[Example] Add attention example
[Package] Update information in pyproject.toml
[Submodule] Remove .gitmodules
[Feature] Automatic Layout Inference
[Layout] Remove old layout definition
[Layout] Use the new layout system in the emitters
[Layout] Add the unified representation of layout system
[Bug] Unify the segments of dynamic shape for tuning
[Layout] Add transpose operation for register layout
[Cache] Add cache for instantiated_script
[Enhancement] Add more instructions and functionality
[Reduce] Support reduce instruction
[Instruction] Add repeat and repeat_interleave instructions
[Optimize] Optimize layout for cast kernel
[Options] Configure hidet option to avoid ftz flag in nvcc
[CUDA] Avoid any cuda runtime api call during import
[Fix] Fix a bug in quantization example
[Sync] Upstream changes to hidet
[Refactor] Simplify and generalize the dot instruction
[Example] Add quantized matmul with full range of quantized data types
[Quantization] Support low-precision data types in cast kernel
[Tensor] Add tensor class and cast kernel
[Fix] Fix a bug in mma emitter
[Pass] Update the bound aware simplification pass
[Example] Add example matmul-v7 with parallel-k implemented
[Example] Add matmul-v6 that implements an efficient write back
[Exampe] Add matmul-v5 example that implements software pipeline
[Example] Add matmul-v4 that uses copy_async instruction
[Instruction][Example] Refactor LoadMatrix instruction
[Version] Use setuptools_scm to manage the version number
[IR][Codegen] Add generic load/store instructions for shared tensor
[Tools] Add IRVerifier to verify the integrity and correctness of IR
[Examples] Add examples to the lint script and ci
[Tuning] Add support of auto-tune
[Script] Add load_shared and store_shared in Tilus Script
[IR][GlobalTensor] Introduce GlobalTensor in the Tilus IR
[Matmul] Add simple matmul example
[Linter] Enable mypy disallow incomplete defs
[Linter] Enable check-untyped-defs flag of mypy
[IR] Refactor IR classes to enforce copy-on-write machanism
[Script] Add tilus.script module
[Workflow] Refactor hidet installation and wheel building as actions
[Build] Enable end to end compilation and build
[IR] Add Program IR node to hold multiple functions
[IR] Use short module name and fix some bugs
[CI] Skip installation of hidet dependency to speedup format and lint
[IR] Refactor the virtual machine IR
[IR] Add the core IR of the tilus language
[Init] Initial commit

Full Changelog: https://github.com/NVIDIA/tilus/commits/v0.1

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Tilus v0.2.0 Release Notes

Highlights