modular
diff --git a/‎book/src/puzzle_04/intro.mojo‎
Lines changed: 3 additions & 3 deletions b/‎book/src/puzzle_04/intro.mojo‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎book/src/puzzle_08/layout_tensor.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_08/layout_tensor.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/src/puzzle_11/layout_tensor.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_11/layout_tensor.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/src/puzzle_12/layout_tensor.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_12/layout_tensor.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/puzzle_13/block_boundary.md‎
Lines changed: 3 additions & 3 deletions b/‎book/src/puzzle_13/block_boundary.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎book/src/puzzle_14/complete.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_14/complete.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/src/puzzle_16/shared_memory.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_16/shared_memory.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/src/puzzle_17/puzzle_17.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_17/puzzle_17.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎book/src/puzzle_18/puzzle_18.md‎
Lines changed: 2 additions & 2 deletions b/‎book/src/puzzle_18/puzzle_18.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎book/src/puzzle_19/puzzle_19.md‎
Lines changed: 1 addition & 1 deletion b/‎book/src/puzzle_19/puzzle_19.md‎
Lines changed: 1 addition & 1 deletion
@@ -6,7 +6,7 @@ alias WIDTH = 3
 alias dtype = DType.float32
 alias layout = Layout.row_major(HEIGHT, WIDTH)
 
-fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
+fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[dtype, layout, MutAnyOrigin]):
     print("Before:")
     print(tensor)
     tensor[0, 0] += 1
@@ -17,8 +17,8 @@ def main():
     ctx = DeviceContext()
 
     a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
-    tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
+    tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a)
     # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
-    ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)
+    ctx.enqueue_function_checked[kernel[dtype, layout], kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)
 
     ctx.synchronize()
@@ -30,7 +30,7 @@ The key insight is how LayoutTensor simplifies shared memory management while ma
    shared = stack_allocation[TPB, Scalar[dtype]]()
 
    # LayoutTensor approach
-   shared = LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+   shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    ```
 
 2. **Memory access**: Same syntax
@@ -168,7 +168,7 @@ This solution demonstrates how LayoutTensor simplifies shared memory usage while
 
      ```txt
      # Clean LayoutTensor API with address_space
-     shared = LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+     shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
      ```
 
    - Natural indexing for both global and shared:
 
@@ -24,7 +24,7 @@ The key insight is how LayoutTensor simplifies shared memory management while ma
 
 Notes:
 
-- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
+- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
 - **Window access**: Natural indexing for 3-element windows
 - **Edge handling**: Special cases for first two positions
 - **Memory pattern**: One shared memory load per thread
@@ -116,7 +116,7 @@ The solution implements a sliding window sum using LayoutTensor with these key s
    - LayoutTensor creates block-local storage with address_space:
 
      ```txt
-     shared = LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+     shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
      ```
 
    - Each thread loads one element:
 
@@ -25,7 +25,7 @@ The key insight is how LayoutTensor simplifies memory management while maintaini
 
 Notes:
 
-- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
+- **LayoutTensor allocation**: Use `LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()`
 - **Element access**: Natural indexing with bounds checking
 - **Layout handling**: Separate layouts for input and output
 - **Thread coordination**: Same synchronization patterns with `barrier()`
 
@@ -32,7 +32,7 @@ Notes:
 
 <div class="solution-tips">
 
-1. Use `LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` for shared memory
+1. Use `LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` for shared memory
 2. Load main data: `shared_a[local_i] = a[global_i]`
 3. Load boundary: `if local_i < CONV_2 - 1` handle next block data
 4. Load kernel: `shared_b[local_i] = b[local_i]`
@@ -125,8 +125,8 @@ Size calculation:
 
    ```mojo
    # First: account for padding needed for convolution window
-   shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
-   shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+   shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+   shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    ```
 
    This allocation pattern ensures we have enough space for both the block's data and the overlap region.
 
@@ -343,10 +343,10 @@ The two kernel phases execute sequentially **without any explicit synchronizatio
 
 ```mojo
 # Phase 1: Local prefix sums
-ctx.enqueue_function[prefix_sum_local_phase[...]](...)
+ctx.enqueue_function_checked[prefix_sum_local_phase[...], prefix_sum_local_phase[...]](...)
 
 # Phase 2: Add block sums (automatically waits for Phase 1)
-ctx.enqueue_function[prefix_sum_block_sum_phase[...]](...)
+ctx.enqueue_function_checked[prefix_sum_block_sum_phase[...], prefix_sum_block_sum_phase[...]](...)
 ```
 
 **Key insight**: Mojo's `DeviceContext` uses a single execution stream (CUDA stream on NVIDIA GPUs, HIP stream on AMD ROCm GPUs), which guarantees that kernel launches execute in the exact order they are enqueued. No explicit synchronization is needed between kernels.
 
@@ -131,8 +131,8 @@ Matrix B:                           b_shared: (similar layout)
 
    ```mojo
    # Create 2D shared memory tensors using LayoutTensor with address_space
-   a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
-   b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+   a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+   b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    ```
 
 2. **Thread Indexing**:
 
@@ -189,7 +189,7 @@ Let's break down how this works in the larger context:
    ```mojo
    gpu_ctx = ctx.get_device_context()
    gpu_ctx.enqueue_memset(...)  # Zero output buffer
-   gpu_ctx.enqueue_function[...](...) # Schedule kernel
+   gpu_ctx.enqueue_function_checked[..., ...](...) # Schedule kernel
    ```
 
    - Device context manages GPU resources
 
@@ -273,8 +273,8 @@ The kernel is parameterized with:
 #### Shared memory allocation
 
 ```mojo
-shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
-shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
+shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
 ```
 
 The kernel allocates two shared memory buffers:
 
@@ -121,7 +121,7 @@ To complete this puzzle, we'll leverage the tiled matmul kernel from [Puzzle 16]
 
 **Transpose Kernel Implementation Guide:**
 
-1. **Shared Memory Setup**: Use `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutableAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` to create a square `TRANSPOSE_BLOCK_DIM_XY` × `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads
+1. **Shared Memory Setup**: Use `LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()` to create a square `TRANSPOSE_BLOCK_DIM_XY` × `TRANSPOSE_BLOCK_DIM_XY` shared memory tile for efficient data exchange between threads
 
 2. **Thread Indexing**: Map threads to matrix elements:
    - `local_row = thread_idx.y`, `local_col = thread_idx.x` (position within the block)