-
Notifications
You must be signed in to change notification settings - Fork 73
Add SSE2 implementations of to_chars and from_chars
#190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
4a1133b to
665de40
Compare
|
Thanks very much for all the work you've been doing.
... but please let's not do that. |
665de40 to
bdde014
Compare
|
Ok, I've removed the changes to CMakeLists.txt. |
This adds SSE2 code paths to to_chars_x86.hpp. The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3, in millions of to_chars() calls per second with a 16-byte aligned output buffer: Char | Generic | SSE2 | SSE4.1 | AVX2 | AVX10.1 =========+=========+=================+==================+==================+================= char | 202.314 | 564.857 (2.79x) | 1194.772 (5.91x) | 1192.094 (5.89x) | 1191.838 (5.89x) char16_t | 188.532 | 457.281 (2.43x) | 795.798 (4.22x) | 935.016 (4.96x) | 938.368 (4.98x) char32_t | 193.151 | 345.612 (1.79x) | 489.620 (2.53x) | 688.829 (3.57x) | 689.617 (3.57x) Here, Generic column was generated with BOOST_UUID_NO_SIMD defined and SSE2 with -march=x86-64. SSE2 support can be useful in cases when users need to be compatible with the base x86-64 ISA.
This adds SSE2 and SSSE3 code paths to from_chars_x86.hpp. The performance effect on Intel Golden Cove (Core i7-12700K), gcc 13.3, in millions of successful from_chars() calls per second: Char | Generic | SSE2 | SSSE3 | SSE4.1 | AVX2 | AVX512v1 =========+=========+=================+=================+=================+=================+================ char | 40.475 | 327.791 (8.10x) | 465.857 (11.5x) | 555.346 (13.7x) | 504.648 (12.5x) | 539.700 (13.3x) char16_t | 38.757 | 292.048 (7.54x) | 401.117 (10.3x) | 478.574 (12.3x) | 426.188 (11.0x) | 416.205 (10.7x) char32_t | 50.200 | 150.900 (3.01x) | 204.588 (4.08x) | 389.882 (7.77x) | 359.591 (7.16x) | 349.663 (6.97x) In addition, the workarounds to avoid (v)pblendvb instructions have been extended to Intel Haswell and Broadwell, as these microarchitectures have poor performance with these instructions (including the SSE4.1 pblendvb). Two new experimental control macros added: BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVB and BOOST_UUID_FROM_CHARS_X86_USE_PBLENDVB. The former indicates that (v)pblendvb instructions are slow and should be avoided on the target microarchitectures. The latter indicates that (v)pblendvb should be used by the implementation. The latter macro is derived from the former and takes precedence. As before, these macros can be used for experimenting and fine tuning performance for specific target CPUs. By default, BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVB is defined for Haswell/Broadwell or if AVX is detected. Lastly, made selection between blend-based and shuffle-based character code conversion in various places unified, controlled by a single internal macro BOOST_UUID_DETAIL_FROM_CHARS_X86_USE_BLENDS.
This allows for testing that the ISA-specific code at least compiles, even if running the tests isn't possible. The support is only added to b2, CMake still always compiles and runs the tests to keep using boost_test_jamfile for easier maintenance. In the future, similar support can be added to CMake as well.
The targets verify the respective code paths in SIMD algorithms. The recently added SSE2 paths are already tested in the other, unspecialized jobs. Also added jobs to compile tests with BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMM and BOOST_UUID_FROM_CHARS_X86_USE_VPERMI2B experimental macros defined.
bdde014 to
3af17a4
Compare
to_chars()calls per second with a 16-byte aligned output buffer:Test source code: uuid_to_chars_perftest.cpp
Compile with:
Full results: uuid_to_chars_perftest.txt
Test source code: uuid_from_chars_perftest.cpp
Compile with:
Full results: uuid_from_chars_perftest.txt
In addition, the workarounds to avoid
(v)pblendvbinstructions have been extended to Intel Haswell and Broadwell, as these microarchitectures have poor performance with these instructions (including the SSE4.1pblendvb).Two new experimental control macros added:
BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVBandBOOST_UUID_FROM_CHARS_X86_USE_PBLENDVB. The former indicates that(v)pblendvbinstructions are slow and should be avoided on the target microarchitectures. The latter indicates that(v)pblendvbshould be used by the implementation. The latter macro is derived from the former and takes precedence. As before, these macros can be used for experimenting and fine tuning performance for specific target CPUs. By default,BOOST_UUID_FROM_CHARS_X86_SLOW_PBLENDVBis defined for Haswell/Broadwell or if AVX is detected.Also, made selection between blend-based and shuffle-based character code conversion in various places unified, controlled by a single internal macro
BOOST_UUID_DETAIL_FROM_CHARS_X86_USE_BLENDS.NOTE: The tests used in this PR were modified, so the performance numbers presented here may not be comparable with the previous PRs.
As a side effect of this, CMakeLists.txt no longer uses
boost_test_jamfileas it doesn't support custom logic for test type selection.Also added jobs to compile tests with
BOOST_UUID_TO_FROM_CHARS_X86_USE_ZMMandBOOST_UUID_FROM_CHARS_X86_USE_VPERMI2Bexperimental macros defined.