Skip to content

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

@lemire

Description

@lemire

The runtime has a great and fast function for UTF-8 validation: Utf8Utility.GetPointerToFirstInvalidByte. But we might be able to do better.

We implemented in C#, the 'lookup' UTF-8validation algorithm from

The algorithm is used by Oracle GraalVM, the Node.js and Bun JavaScript runtimes. For example, Node.js is capable of validating Arabic or Chinese strings at 17 GB/s on an 2 GHz Intel server (from JavaScript).

We adapted it so that we can match exactly the functionality of Utf8Utility.GetPointerToFirstInvalidByte with a function called SimdUnicode.UTF8.GetPointerToFirstInvalidByte. It is available on GitHub at simdutf/SimdUnicode. We have good tests, and decent benchmarks. We use .NET's excellent runtime dispatching functionality to select the best function (SSE4.2, AVX2, AVX-512, fallback, NEON). We used @EgorBo's Disasmo to help tune the code, although we make no claim that it is optimal (it probably is not).

Intel Ice Lake results:

data set SimdUnicode AVX-512 (GB/s) .NET speed (GB/s) speed up
Twitter.json 29 12 2.4 x
Arabic-Lipsum 12 2.3 5.2 x
Chinese-Lipsum 12 3.9 3.0 x
Emoji-Lipsum 12 0.9 13 x
Hebrew-Lipsum 12 2.3 5.2 x
Hindi-Lipsum 12 2.1 5.7 x
 Japanese-Lipsum 10  3.5 2.9 x
Korean-Lipsum 10 1.3 7.7 x
Latin-Lipsum 76 76 ---
Russian-Lipsum 12 1.2 10 x

Twitter.json
 SimdUnicode ▏   29 GB/s █████████████████████████
.NET Runtime ▏   12 GB/s ██████████▎

Arabic-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  2.3 GB/s ████▊

Chinese-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  3.9 GB/s ████████▏

Emoji-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  0.9 GB/s █▉

Japanese-Lipsum
 SimdUnicode ▏   10 GB/s █████████████████████████
.NET Runtime ▏  3.5 GB/s ████████▊

Apple M2 results:

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 25 14 1.8 x
Arabic-Lipsum 7.4 3.5 2.1 x
Chinese-Lipsum 7.4 4.8 1.5 x
Emoji-Lipsum 7.4 2.5 3.0 x
Hebrew-Lipsum 7.4 3.5 2.1 x
Hindi-Lipsum 7.3 3.0 2.4 x
 Japanese-Lipsum 7.3 4.6  1.6 x
Korean-Lipsum 7.4 1.8 4.1 x
Latin-Lipsum 87 38 2.3 x
Russian-Lipsum 7.4 2.7 2.7 x

On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times
faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 14 8.7 1.4 x
Arabic-Lipsum 4.2 2.0 2.1 x
Chinese-Lipsum 4.2 2.6 1.6 x
Emoji-Lipsum 4.2 0.8 5.3 x
Hebrew-Lipsum 4.2 2.0 2.1 x
Hindi-Lipsum 4.2 1.6 2.6 x
 Japanese-Lipsum 4.2 2.4  1.8 x
Korean-Lipsum 4.2 1.3 3.2 x
Latin-Lipsum 42 17 2.5 x
Russian-Lipsum 4.2 0.95 4.4 x

On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
boost as the Neoverse V1.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 17 10 1.7 x
Arabic-Lipsum 5.0 2.3 2.2 x
Chinese-Lipsum 5.0 2.9 1.7 x
Emoji-Lipsum 5.0 0.9 5.5 x
Hebrew-Lipsum 5.0 2.3 2.2 x
Hindi-Lipsum 5.0 1.9 2.6 x
 Japanese-Lipsum 5.0 2.7  1.9 x
Korean-Lipsum 5.0 1.5 3.3 x
Latin-Lipsum 50 20 2.5 x
Russian-Lipsum 5.0 1.2 5.2 x

On a Neoverse N1 (Graviton 2), our validation function is up to over three times
faster than the standard library.

data set SimdUnicode speed (GB/s) .NET speed (GB/s) speed up
Twitter.json 7.8 5.7 1.4 x
Arabic-Lipsum 2.5 0.9 2.8 x
Chinese-Lipsum 2.5 1.8 1.4 x
Emoji-Lipsum 2.5 0.7 3.6 x
Hebrew-Lipsum 2.5 0.9 2.7 x
Hindi-Lipsum 2.3 1.0 2.3 x
 Japanese-Lipsum 2.4 1.7  1.4 x
Korean-Lipsum 2.5 1.0 2.5 x
Latin-Lipsum 23 13 1.8 x
Russian-Lipsum 2.3 0.7 3.3 x

Importantly, there is no patent involved, and no licensing issue. We are eager for reviews, feedback and so forth.

Note that we have other fast Unicode algorithms that could be implemented in C#, including fast transcoding functions. UTF-8 validation is simply the simplest non-trivial case.

This is joint work with @Nick-Nuon

Further reading: Validating gigabytes of Unicode strings per second… in C#? (blog post)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions