You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* implement faster floating-point `isless`
Previously `isless` relied on the C intrinsic `fpislt` in
`src/runtime_intrinsics.c`, while the new implementation in Julia
arguably generates better code, namely:
1. The NaN-check compiles to a single instruction + branch amenable
for branch prediction in arguably most usecases (i.e. comparing
non-NaN floats), thus speeding up execution.
2. The compiler now often manages to remove NaN-computation if the
embedding code has already proven the arguments to be non-NaN.
3. The actual operation compares both arguments as sign-magnitude
integers instead of case analysis based on the sign of one
argument. This symmetric treatment may generate vectorized
instructions for the sign-magnitude conversion depending on how the
arguments are layed out.
The actual behaviour of `isless` did not change and apart from the
Julia-specific NaN-handling (which may be up for debate) the resulting
total order corresponds to the IEEE-754 specified `totalOrder`.
While the new implementation no longer generates fully branchless code I
did not manage to construct a usecase where this was detrimental: the
saved work seems to outweight the potential cost of a branch
misprediction in all of my tests with various NaN-polluted data. Also
auto-vectorization was not effective on the previous `fpislt` either.
Quick benchmarks (AMD A10-7860K) on `sort`, avoiding the specialized
algorithm:
```julia
a = rand(1000);
@Btime sort($a, lt=(a,b)->isless(a,b));
# before: 56.030 μs (1 allocation: 7.94 KiB)
# after: 40.853 μs (1 allocation: 7.94 KiB)
a = rand(1000000);
@Btime sort($a, lt=(a,b)->isless(a,b));
# before: 159.499 ms (2 allocations: 7.63 MiB)
# after: 120.536 ms (2 allocations: 7.63 MiB)
a = [rand((rand(), NaN)) for _ in 1:1000000];
@Btime sort($a, lt=(a,b)->isless(a,b));
# before: 111.925 ms (2 allocations: 7.63 MiB)
# after: 77.669 ms (2 allocations: 7.63 MiB)
```
* Remove old intrinsic fpslt code
Co-authored-by: Mustafa Mohamad <mus-m@outlook.com>
0 commit comments