Skip to content

Commit 03640d1

Browse files
authored
Begin migration of enqueue_function to enqueue_function_checked (#127)
* Begin migration of enqueue_function to enqueue_function_checked. * Fix typo. * Update to MutAnyOrigin, ImmutAnyOrigin. * Formatting fix.
1 parent 843fc62 commit 03640d1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

80 files changed

+641
-670
lines changed

book/src/puzzle_04/intro.mojo

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ alias WIDTH = 3
66
alias dtype = DType.float32
77
alias layout = Layout.row_major(HEIGHT, WIDTH)
88

9-
fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
9+
fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[dtype, layout, MutAnyOrigin]):
1010
print("Before:")
1111
print(tensor)
1212
tensor[0, 0] += 1
@@ -17,8 +17,8 @@ def main():
1717
ctx = DeviceContext()
1818

1919
a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
20-
tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
20+
tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a)
2121
# Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
22-
ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)
22+
ctx.enqueue_function_checked[kernel[dtype, layout], kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)
2323

2424
ctx.synchronize()

book/src/puzzle_08/layout_tensor.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ The key insight is how LayoutTensor simplifies shared memory management while ma
3030
shared = stack_allocation[TPB, Scalar[dtype]]()
3131
3232
# LayoutTensor approach
33-
shared = LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
33+
shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
3434
```
3535

3636
2. **Memory access**: Same syntax
@@ -168,7 +168,7 @@ This solution demonstrates how LayoutTensor simplifies shared memory usage while
168168
169169
```txt
170170
# Clean LayoutTensor API with address_space
171-
shared = LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
171+
shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
172172
```
173173
174174
- Natural indexing for both global and shared:

book/src/puzzle_11/layout_tensor.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ The key insight is how LayoutTensor simplifies shared memory management while ma
2424

2525
Notes:
2626

27-
- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
27+
- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
2828
- **Window access**: Natural indexing for 3-element windows
2929
- **Edge handling**: Special cases for first two positions
3030
- **Memory pattern**: One shared memory load per thread
@@ -116,7 +116,7 @@ The solution implements a sliding window sum using LayoutTensor with these key s
116116
- LayoutTensor creates block-local storage with address_space:
117117

118118
```txt
119-
shared = LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
119+
shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
120120
```
121121
122122
- Each thread loads one element:

book/src/puzzle_12/layout_tensor.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ The key insight is how LayoutTensor simplifies memory management while maintaini
2525

2626
Notes:
2727

28-
- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
28+
- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
2929
- **Element access**: Natural indexing with bounds checking
3030
- **Layout handling**: Separate layouts for input and output
3131
- **Thread coordination**: Same synchronization patterns with `barrier()`

book/src/puzzle_13/block_boundary.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Notes:
3232

3333
<div class="solution-tips">
3434

35-
1. Use `LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` for shared memory
35+
1. Use `LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` for shared memory
3636
2. Load main data: `shared_a[local_i] = a[global_i]`
3737
3. Load boundary: `if local_i < CONV_2 - 1` handle next block data
3838
4. Load kernel: `shared_b[local_i] = b[local_i]`
@@ -125,8 +125,8 @@ Size calculation:
125125

126126
```mojo
127127
# First: account for padding needed for convolution window
128-
shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
129-
shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
128+
shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
129+
shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
130130
```
131131

132132
This allocation pattern ensures we have enough space for both the block's data and the overlap region.

book/src/puzzle_14/complete.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -343,10 +343,10 @@ The two kernel phases execute sequentially **without any explicit synchronizatio
343343

344344
```mojo
345345
# Phase 1: Local prefix sums
346-
ctx.enqueue_function[prefix_sum_local_phase[...]](...)
346+
ctx.enqueue_function_checked[prefix_sum_local_phase[...], prefix_sum_local_phase[...]](...)
347347
348348
# Phase 2: Add block sums (automatically waits for Phase 1)
349-
ctx.enqueue_function[prefix_sum_block_sum_phase[...]](...)
349+
ctx.enqueue_function_checked[prefix_sum_block_sum_phase[...], prefix_sum_block_sum_phase[...]](...)
350350
```
351351

352352
**Key insight**: Mojo's `DeviceContext` uses a single execution stream (CUDA stream on NVIDIA GPUs, HIP stream on AMD ROCm GPUs), which guarantees that kernel launches execute in the exact order they are enqueued. No explicit synchronization is needed between kernels.

book/src/puzzle_16/shared_memory.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -131,8 +131,8 @@ Matrix B: b_shared: (similar layout)
131131

132132
```mojo
133133
# Create 2D shared memory tensors using LayoutTensor with address_space
134-
a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
135-
b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
134+
a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
135+
b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
136136
```
137137

138138
2. **Thread Indexing**:

book/src/puzzle_17/puzzle_17.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ Let's break down how this works in the larger context:
189189
```mojo
190190
gpu_ctx = ctx.get_device_context()
191191
gpu_ctx.enqueue_memset(...) # Zero output buffer
192-
gpu_ctx.enqueue_function[...](...) # Schedule kernel
192+
gpu_ctx.enqueue_function_checked[..., ...](...) # Schedule kernel
193193
```
194194

195195
- Device context manages GPU resources

book/src/puzzle_18/puzzle_18.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -273,8 +273,8 @@ The kernel is parameterized with:
273273
#### Shared memory allocation
274274

275275
```mojo
276-
shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
277-
shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
276+
shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
277+
shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
278278
```
279279

280280
The kernel allocates two shared memory buffers:

book/src/puzzle_19/puzzle_19.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ To complete this puzzle, we'll leverage the tiled matmul kernel from [Puzzle 16]
121121

122122
**Transpose Kernel Implementation Guide:**
123123

124-
1. **Shared Memory Setup**: Use `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` to create a square `TRANSPOSE_BLOCK_DIM_XY` × `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads
124+
1. **Shared Memory Setup**: Use `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` to create a square `TRANSPOSE_BLOCK_DIM_XY` × `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads
125125

126126
2. **Thread Indexing**: Map threads to matrix elements:
127127
- `local_row = thread_idx.y`, `local_col = thread_idx.x` (position within the block)

0 commit comments

Comments
 (0)