Skip to content

Commit 3adec01

Browse files
committed
fix typo
1 parent 2ce9650 commit 3adec01

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

lib/Dialect/TritonGPU/Transforms/Pipeliner/MatmulLoopPipeline.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -90,13 +90,13 @@ static void createAsyncCopy(scf::ForOp &forOp, tt::LoadOp loadOp, Value alloc,
9090
// If the following are true...
9191
// 1) Operand A is for WGMMA and is to be loaded in registers
9292
// 2) We upcast operand A in registers before the WGMMA
93-
// (downcasting is not yet supporting)
93+
// (downcasting is not yet supported)
9494
//
9595
// ...then the SharedEncoding vec will be less than BlockedEncoding's
96-
// sizePerThread, for k-dim. E.g. if shared vec is 8 and sizePerThread
97-
// for k is 16, then AsyncCopyGlobalToLocal will generate two 8B-LDGSTS
96+
// sizePerThread for k-dim. E.g. if shared vec is 8 and sizePerThread
97+
// for k is 16, then AsyncCopyGlobalToLocal will generate two 8B-LDGSTS's
9898
// for each contiguous 16B global data owned by each thread. This breaks
99-
// coalescing.
99+
// coalescing (i.e. results 2x the minimum required transactions)
100100
//
101101
// The fix is to clip the BlockedEnc's sizePerThread using SharedEnc's vec.
102102
auto tensorTy = cast<RankedTensorType>(src.getType());

0 commit comments

Comments
 (0)