Use linear-indexing broadcast kernel when possible #520

maleadt · 2024-03-05T08:18:03Z

Attempt to re-land #454, this time using a slightly nicer implementation.

It hasn't fundamentally changed though, so should run into the same issues. Let's do this carefully.

The motivation is also unchanged: on certain platforms, like Metal.jl, the integer divisions required to go from a linear hardware index to a cartesian one for indexing the input/output containers is extremely expensive. By using static iteration bounds, the compiler can replace the idiv with a series of bitshifts. This improves the performance of broadcast by 3-4x on those platforms.

cc @maxwindiff

…ions.

maleadt · 2024-03-05T08:34:18Z

This looks to be working well, so tagging people who ran into issues before: @ToucheSir and @chengchingwen. Note that this will still cause additional compilation, i.e. every time the size of any container involved in a broadcast changes, but I'm curious about which workloads would trigger that (once in the steady-state application regime, of course).

ToucheSir · 2024-03-05T17:53:26Z

I had a look back through the CI failure on the Flux side. Apparently this call was the one that failed:

   [15] broadcast(::typeof(+), ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
      @ Base.Broadcast ./broadcast.jl:821

But that's strange, because surely broadcasting + was already tested by GPUArrays + CUDA.jl? Anyhow, I doubt this will cause any problems for FluxML as long as elementwise broadcasting of binary ops still work across the board.

maleadt · 2024-03-05T18:06:03Z

Looking back at the CUDA.jl CI logs, there seemed to be some issue with printing too, is why I added a show method here. I'm not sure whether that was the cause of an issue, or whether it was just masking an actual error in CI...

maleadt · 2024-03-06T08:17:01Z

I tried testing Transformers.jl, but that seems not possible right now (see chengchingwen/Transformers.jl#153 and linked PRs in NeuralAttentionlib.jl).

maleadt · 2024-03-06T09:09:32Z

One alternative would be that we expose 1d/2d/3d indices and only generate 4 broadcast kernels. I'll experiment with that, as it would lead to far fewer kernels being compiled (but the fact that the bounds aren't fully statically known may come at a cost again).

Given #451 the above would also mean that KA.jl would need to support 1d/2d/3d indices, so cc @vchuravy.

maleadt · 2024-03-06T11:24:54Z

... or, I should probably just confine this optimization to Metal.jl...

chengchingwen · 2024-03-06T11:29:19Z

i.e. every time the size of any container involved in a broadcast changes, but I'm curious about which workloads would trigger that (once in the steady-state application regime, of course).

This, unfortunately, happens a lot when doing sequence generation inference with transformer models. It might also happen during training but can be avoided with padding.

maleadt · 2024-03-06T12:19:12Z

OK, good to know. I have an alternative in JuliaGPU/Metal.jl#304, relying on hadware indices instead. That will only accelerate 2d and 3d broadcasts though, so it's a trade-off.

chengchingwen · 2024-03-07T12:00:57Z

I think we might be able to port the algorithms used in libdivide to implement a new CartesianIndices without integer division. It's similar to the method used in StaticCartesian.jl but without requiring the divisor in compile-time.

maleadt · 2024-03-08T10:57:24Z

I only noticed significant impact of the idiv on Metal.jl, so I've opted to move the specialization to Metal.jl (forcing static bounds when a specific broadcast shape is used more than 10 times).

Introduce StaticCartesianIndices to eliminate expensive integer divis…

ac3911e

…ions.

maleadt added the performance label Mar 5, 2024

Bump Julia version used by CI.

90fb573

maleadt force-pushed the tb/static_cartesian_indices branch from fcc80ce to 90fb573 Compare March 5, 2024 09:02

maleadt added 4 commits March 6, 2024 13:04

Simplify.

c3783d7

Remove StaticCartesian.

5b9d7c0

Revert map change.

548a8d9

Remove outdated comment.

fb8cf80

Fixes.

0f6bf2a

maleadt changed the title ~~Introduce StaticCartesianIndices to eliminate expensive integer divisions.~~ Use linear-indexing broadcast kernel when possible Mar 6, 2024

This was referenced Mar 6, 2024

Specialize broadcast to avoid integer divisions. JuliaGPU/Metal.jl#304

Merged

Shader validator error with linear broadcast kernel JuliaGPU/Metal.jl#308

Open

maleadt merged commit e4d40ea into master Mar 8, 2024

maleadt deleted the tb/static_cartesian_indices branch March 8, 2024 10:56

vchuravy mentioned this pull request Apr 4, 2024

Significant perf drop when using dynamic ranges in GPU kernel JuliaGPU/KernelAbstractions.jl#470

Open

charleskawczynski mentioned this pull request Oct 8, 2024

Add support for linear indexing for pointwise kernels CliMA/ClimaCore.jl#1922

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use linear-indexing broadcast kernel when possible #520

Use linear-indexing broadcast kernel when possible #520

maleadt commented Mar 5, 2024 •

edited

Loading

maleadt commented Mar 5, 2024

ToucheSir commented Mar 5, 2024

maleadt commented Mar 5, 2024

maleadt commented Mar 6, 2024

maleadt commented Mar 6, 2024

maleadt commented Mar 6, 2024

chengchingwen commented Mar 6, 2024

maleadt commented Mar 6, 2024

chengchingwen commented Mar 7, 2024

maleadt commented Mar 8, 2024

Use linear-indexing broadcast kernel when possible #520

Use linear-indexing broadcast kernel when possible #520

Conversation

maleadt commented Mar 5, 2024 • edited Loading

maleadt commented Mar 5, 2024

ToucheSir commented Mar 5, 2024

maleadt commented Mar 5, 2024

maleadt commented Mar 6, 2024

maleadt commented Mar 6, 2024

maleadt commented Mar 6, 2024

chengchingwen commented Mar 6, 2024

maleadt commented Mar 6, 2024

chengchingwen commented Mar 7, 2024

maleadt commented Mar 8, 2024

maleadt commented Mar 5, 2024 •

edited

Loading