Skip to content

Add Neon f32-bf16-vcvt kernels#9922

Draft
GregoryComer wants to merge 1 commit intogoogle:masterfrom
GregoryComer:f32-bf16-vcvt-neon
Draft

Add Neon f32-bf16-vcvt kernels#9922
GregoryComer wants to merge 1 commit intogoogle:masterfrom
GregoryComer:f32-bf16-vcvt-neon

Conversation

@GregoryComer
Copy link
Copy Markdown
Contributor

Add neon kernels for f32 <-> bf16 vcvt. I've generally followed the fp16 pattern here. For fp32->bf16, use neonbf16 where available, otherwise fall back to neon w/ explicit rounding handling. For bf16->fp32, just shift.

The manual rounding logic is less than half throughput compared to feat_bf16, but I figure it's worth it as bf16 doesn't have many fraction bits. Also, note that the scalar kernels are getting auto-vectorized well on M4 (confirmed via disasm), hence the negligible difference between neon without feat_bf16 and scalar.

On the rounding logic, there is an existing edge case where fp32 NaNs with 7 upper fraction bits unset get promoted to infs. I fixed this in the math.h function that the scalar kernels call.

Benchmarks

Comparison to fp16 kernels is given for reference.

M4 Max

Kernel bf16→f32 GiB/s f16→f32 GiB/s f32→bf16 GiB/s f32→f16 GiB/s
neon_u8 157.4 26.8–31.0 50.5 27.7
neon_u16 166.0 27.0–31.3 50.4 27.6
neon_u24 167.4 26.7–31.3 49.7 27.6
neon_u32 166.4 26.3–31.3 49.2 27.8
neonbf16_u8 122.5
neonbf16_u16 121.3
neonfp16_u8 159.0 145.1
neonfp16_u16 156.6 115.5
scalar_u1 24.2 11.6 18.0 8.2–9.0
scalar_u2 154.4 25.8 49.1 20.1–20.8
scalar_u3 155.9 25.9 41.5 20.7
scalar_u4 164.1 23.3 49.7 18.5–18.9

Cortex-X1 (Pixel 7 Pro)

Kernel bf16→f32 GiB/s f16→f32 GiB/s f32→bf16 GiB/s f32→f16 GiB/s
neon_u8 80.8 13.4–17.9 28.4 13.5
neon_u16 130.0 15.0–17.4 30.3 14.8
neon_u24 101.4 15.7–18.3 31.0 16.6
neon_u32 90.2 15.7–18.6 32.0 16.7
neonbf16_u8 unsupported
neonbf16_u16 unsupported
neonfp16_u8 67.7 66.6
neonfp16_u16 67.8 67.0
scalar_u1 127.6 15.5 28.4 10.7–10.9
scalar_u2 127.6 13.7 30.1 10.8–10.9
scalar_u3 99.0 13.6 30.3 11.1–11.5
scalar_u4 127.6 11.8 32.0 9.7–10.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants