Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation

The runtime has a great and fast function for UTF-8 validation: `Utf8Utility.GetPointerToFirstInvalidByte`. But we might be able to do better.

We implemented in C#, the 'lookup' UTF-8validation algorithm from 

- [Validating UTF-8 In Less Than One Instruction Per Byte](https://arxiv.org/abs/2010.03090), Software: Practice and Experience 51 (5), 2021

The algorithm is used by Oracle GraalVM, the Node.js and Bun JavaScript runtimes. For example, [Node.js is capable of validating Arabic or Chinese strings at 17 GB/s on an 2 GHz Intel server (from JavaScript)](https://lemire.me/blog/2023/12/05/how-fast-can-you-validate-utf-8-strings-in-javascript/).

We adapted it so that we can match exactly the functionality of `Utf8Utility.GetPointerToFirstInvalidByte` with a function called `SimdUnicode.UTF8.GetPointerToFirstInvalidByte`.  It is available on GitHub at [simdutf/SimdUnicode](https://github.com/simdutf/SimdUnicode). We have good tests, and decent benchmarks.  We use .NET's excellent runtime dispatching functionality to select the best function (SSE4.2, AVX2, AVX-512, fallback, NEON). We used @EgorBo's [Disasmo](https://github.com/EgorBo/Disasmo) to help tune the code, although we make no claim that it is optimal (it probably is not).

Intel Ice Lake results:

| data set        | SimdUnicode AVX-512 (GB/s) | .NET speed (GB/s) | speed up |
|:----------------|:------------------------|:-------------------|:-------------------|
| Twitter.json    | 29                      | 12                | 2.4 x |
| Arabic-Lipsum   | 12                    | 2.3               | 5.2 x |
| Chinese-Lipsum  | 12                    | 3.9               | 3.0 x |
| Emoji-Lipsum    | 12                     | 0.9               | 13 x |
| Hebrew-Lipsum   |12                    | 2.3               | 5.2 x |
| Hindi-Lipsum    | 12                     | 2.1               | 5.7 x |
| Japanese-Lipsum | 10                     | 3.5               | 2.9 x |
| Korean-Lipsum   | 10                     | 1.3               | 7.7 x |
| Latin-Lipsum    | 76                      | 76                | --- |
| Russian-Lipsum  | 12                    | 1.2               | 10 x |

```

Twitter.json
 SimdUnicode ▏   29 GB/s █████████████████████████
.NET Runtime ▏   12 GB/s ██████████▎

Arabic-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  2.3 GB/s ████▊

Chinese-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  3.9 GB/s ████████▏

Emoji-Lipsum
 SimdUnicode ▏   12 GB/s █████████████████████████
.NET Runtime ▏  0.9 GB/s █▉

Japanese-Lipsum
 SimdUnicode ▏   10 GB/s █████████████████████████
.NET Runtime ▏  3.5 GB/s ████████▊

```


Apple M2 results:

| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json    |  25        | 14                        | 1.8 x           |
| Arabic-Lipsum   |  7.4       | 3.5                       | 2.1 x           |
| Chinese-Lipsum  |  7.4       | 4.8                       | 1.5 x           |
| Emoji-Lipsum    |  7.4       | 2.5                       | 3.0 x           |
| Hebrew-Lipsum   |  7.4       | 3.5                       | 2.1 x           |
| Hindi-Lipsum    |  7.3       | 3.0                       | 2.4 x           |
| Japanese-Lipsum |  7.3       | 4.6                       | 1.6 x           |
| Korean-Lipsum   |  7.4       | 1.8                       | 4.1 x           |
| Latin-Lipsum    |  87        | 38                        | 2.3 x           |
| Russian-Lipsum  |  7.4       | 2.7                       | 2.7 x           |




On a Neoverse V1 (Graviton 3), our validation function is 1.3 to over five times
faster than the standard library.

| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json    |  14        | 8.7                        | 1.4 x           |
| Arabic-Lipsum   |  4.2       | 2.0                       | 2.1 x           |
| Chinese-Lipsum  |  4.2        | 2.6                       | 1.6 x           |
| Emoji-Lipsum    |  4.2        | 0.8                       | 5.3 x           |
| Hebrew-Lipsum   |  4.2        | 2.0                       | 2.1 x           |
| Hindi-Lipsum    |  4.2        | 1.6                       | 2.6 x           |
| Japanese-Lipsum |  4.2        | 2.4                       | 1.8 x           |
| Korean-Lipsum   |  4.2        | 1.3                       | 3.2 x           |
| Latin-Lipsum    |  42        | 17                        | 2.5 x           |
| Russian-Lipsum  |  4.2        | 0.95                       | 4.4 x           |



On a Qualcomm 8cx gen3 (Windows Dev Kit 2023), we get roughly the same relative performance
boost as the Neoverse V1.

| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json    |  17        | 10                        | 1.7 x           |
| Arabic-Lipsum   |  5.0       | 2.3                       | 2.2 x           |
| Chinese-Lipsum  |  5.0       | 2.9                       | 1.7 x           |
| Emoji-Lipsum    |  5.0       | 0.9                       | 5.5 x           |
| Hebrew-Lipsum   |  5.0       | 2.3                       | 2.2 x           |
| Hindi-Lipsum    |  5.0       | 1.9                       | 2.6 x           |
| Japanese-Lipsum |  5.0       | 2.7                       | 1.9 x           |
| Korean-Lipsum   |  5.0       | 1.5                       | 3.3 x           |
| Latin-Lipsum    |  50        | 20                       | 2.5 x           |
| Russian-Lipsum  |  5.0       | 1.2                       | 5.2 x           |


On a Neoverse N1 (Graviton 2), our validation function is up to over three times
faster than the standard library.

| data set      | SimdUnicode speed (GB/s) | .NET speed (GB/s) |  speed up |
|:----------------|:-----------|:--------------------------|:-------------------|
| Twitter.json    |  7.8        | 5.7                        | 1.4 x           |
| Arabic-Lipsum   |  2.5       | 0.9                       | 2.8 x           |
| Chinese-Lipsum  |  2.5       | 1.8                       | 1.4 x           |
| Emoji-Lipsum    |  2.5       | 0.7                       | 3.6 x           |
| Hebrew-Lipsum   |  2.5       | 0.9                       | 2.7 x           |
| Hindi-Lipsum    |  2.3       | 1.0                       | 2.3 x           |
| Japanese-Lipsum |  2.4       | 1.7                       | 1.4 x           |
| Korean-Lipsum   |  2.5       | 1.0                       | 2.5 x           |
| Latin-Lipsum    |  23        | 13                        | 1.8 x           |
| Russian-Lipsum  |  2.3      | 0.7                       | 3.3 x           |



Importantly, there is no patent involved, and no licensing issue. We are eager for reviews, feedback and so forth. 

Note that we have other fast Unicode algorithms that could be implemented in C#, including fast transcoding functions. UTF-8 validation is simply the simplest non-trivial case.


This is joint work with @Nick-Nuon

**Further reading**: [Validating gigabytes of Unicode strings per second… in C#?](https://lemire.me/blog/2024/06/20/validating-gigabytes-of-unicode-strings-per-second-in-c/) (blog post)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

data set	SimdUnicode AVX-512 (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	29	12	2.4 x
Arabic-Lipsum	12	2.3	5.2 x
Chinese-Lipsum	12	3.9	3.0 x
Emoji-Lipsum	12	0.9	13 x
Hebrew-Lipsum	12	2.3	5.2 x
Hindi-Lipsum	12	2.1	5.7 x
Japanese-Lipsum	10	3.5	2.9 x
Korean-Lipsum	10	1.3	7.7 x
Latin-Lipsum	76	76	---
Russian-Lipsum	12	1.2	10 x

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	25	14	1.8 x
Arabic-Lipsum	7.4	3.5	2.1 x
Chinese-Lipsum	7.4	4.8	1.5 x
Emoji-Lipsum	7.4	2.5	3.0 x
Hebrew-Lipsum	7.4	3.5	2.1 x
Hindi-Lipsum	7.3	3.0	2.4 x
Japanese-Lipsum	7.3	4.6	1.6 x
Korean-Lipsum	7.4	1.8	4.1 x
Latin-Lipsum	87	38	2.3 x
Russian-Lipsum	7.4	2.7	2.7 x

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	14	8.7	1.4 x
Arabic-Lipsum	4.2	2.0	2.1 x
Chinese-Lipsum	4.2	2.6	1.6 x
Emoji-Lipsum	4.2	0.8	5.3 x
Hebrew-Lipsum	4.2	2.0	2.1 x
Hindi-Lipsum	4.2	1.6	2.6 x
Japanese-Lipsum	4.2	2.4	1.8 x
Korean-Lipsum	4.2	1.3	3.2 x
Latin-Lipsum	42	17	2.5 x
Russian-Lipsum	4.2	0.95	4.4 x

data set	SimdUnicode speed (GB/s)	.NET speed (GB/s)	speed up
Twitter.json	17	10	1.7 x
Arabic-Lipsum	5.0	2.3	2.2 x
Chinese-Lipsum	5.0	2.9	1.7 x
Emoji-Lipsum	5.0	0.9	5.5 x
Hebrew-Lipsum	5.0	2.3	2.2 x
Hindi-Lipsum	5.0	1.9	2.6 x
Japanese-Lipsum	5.0	2.7	1.9 x
Korean-Lipsum	5.0	1.5	3.3 x
Latin-Lipsum	50	20	2.5 x
Russian-Lipsum	5.0	1.2	5.2 x

Consider using SimdUnicode.UTF8.GetPointerToFirstInvalidByte for UTF-8 validation #103781

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions