Add Neon f32-bf16-vcvt kernels by GregoryComer · Pull Request #9922 · google/XNNPACK

GregoryComer · 2026-04-08T21:54:21Z

Add neon kernels for f32 <-> bf16 vcvt. I've generally followed the fp16 pattern here. For fp32->bf16, use neonbf16 where available, otherwise fall back to neon w/ explicit rounding handling. For bf16->fp32, just shift.

The manual rounding logic is less than half throughput compared to feat_bf16, but I figure it's worth it as bf16 doesn't have many fraction bits. Also, note that the scalar kernels are getting auto-vectorized well on M4 (confirmed via disasm), hence the negligible difference between neon without feat_bf16 and scalar.

On the rounding logic, there is an existing edge case where fp32 NaNs with 7 upper fraction bits unset get promoted to infs. I fixed this in the math.h function that the scalar kernels call.

Benchmarks

Comparison to fp16 kernels is given for reference.

M4 Max

Kernel	bf16→f32 GiB/s	f16→f32 GiB/s	f32→bf16 GiB/s	f32→f16 GiB/s
neon_u8	157.4	26.8–31.0	50.5	27.7
neon_u16	166.0	27.0–31.3	50.4	27.6
neon_u24	167.4	26.7–31.3	49.7	27.6
neon_u32	166.4	26.3–31.3	49.2	27.8
neonbf16_u8			122.5
neonbf16_u16			121.3
neonfp16_u8		159.0		145.1
neonfp16_u16		156.6		115.5
scalar_u1	24.2	11.6	18.0	8.2–9.0
scalar_u2	154.4	25.8	49.1	20.1–20.8
scalar_u3	155.9	25.9	41.5	20.7
scalar_u4	164.1	23.3	49.7	18.5–18.9

Cortex-X1 (Pixel 7 Pro)

Kernel	bf16→f32 GiB/s	f16→f32 GiB/s	f32→bf16 GiB/s	f32→f16 GiB/s
neon_u8	80.8	13.4–17.9	28.4	13.5
neon_u16	130.0	15.0–17.4	30.3	14.8
neon_u24	101.4	15.7–18.3	31.0	16.6
neon_u32	90.2	15.7–18.6	32.0	16.7
neonbf16_u8			unsupported
neonbf16_u16			unsupported
neonfp16_u8		67.7		66.6
neonfp16_u16		67.8		67.0
scalar_u1	127.6	15.5	28.4	10.7–10.9
scalar_u2	127.6	13.7	30.1	10.8–10.9
scalar_u3	99.0	13.6	30.3	11.1–11.5
scalar_u4	127.6	11.8	32.0	9.7–10.0

GregoryComer mentioned this pull request Apr 8, 2026

[Tracker] Expanded BF16 Support #9728

Open

23 tasks

dsharlet approved these changes Apr 8, 2026

View reviewed changes

Add neon f32-bf16-vcvt kernels

ab685da

GregoryComer force-pushed the f32-bf16-vcvt-neon branch from 7387dfc to ab685da Compare April 8, 2026 23:41

dsharlet approved these changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Neon f32-bf16-vcvt kernels#9922

Add Neon f32-bf16-vcvt kernels#9922
GregoryComer wants to merge 1 commit intogoogle:masterfrom
GregoryComer:f32-bf16-vcvt-neon

GregoryComer commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GregoryComer commented Apr 8, 2026

Benchmarks

M4 Max

Cortex-X1 (Pixel 7 Pro)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants