Skip to content

Distressing performance unless promotion is manual #21065

@timholy

Description

@timholy

Given the impact of the splatting penalty in other areas, I've long wondered why these definitions aren't problematic. It turns out that, in some circumstances, they are:

julia> using FixedPointNumbers

julia> x = 0.2f0
0.2f0

julia> y = N0f8(0.2)
0.2N0f8

julia> @code_llvm x*y

define float @"julia_*_68175"(float, %Normed*) #0 !dbg !5 {
top:
  %ptls_i8 = call i8* asm "movq %fs:0, $0;\0Aaddq $$-10896, $0", "=r,~{dirflag},~{fpsr},~{flags}"() #3
  %ptls = bitcast i8* %ptls_i8 to i8****
  %2 = alloca [4 x i8**], align 8
  %.sub = getelementptr inbounds [4 x i8**], [4 x i8**]* %2, i64 0, i64 0
  %3 = getelementptr [4 x i8**], [4 x i8**]* %2, i64 0, i64 2
  %4 = bitcast i8*** %3 to i8*
  call void @llvm.memset.p0i8.i32(i8* %4, i8 0, i32 16, i32 8, i1 false)
  %5 = bitcast [4 x i8**]* %2 to i64*
  store i64 4, i64* %5, align 8
  %6 = getelementptr [4 x i8**], [4 x i8**]* %2, i64 0, i64 1
  %7 = bitcast i8* %ptls_i8 to i64*
  %8 = load i64, i64* %7, align 8
  %9 = bitcast i8*** %6 to i64*
  store i64 %8, i64* %9, align 8
  store i8*** %.sub, i8**** %ptls, align 8
  %10 = getelementptr [4 x i8**], [4 x i8**]* %2, i64 0, i64 3
  %11 = call [2 x float] @julia_promote_68176(float %0, %Normed* %1)
  %.elt = extractvalue [2 x float] %11, 0
  %.elt2 = extractvalue [2 x float] %11, 1
  store i8** inttoptr (i64 139997130231784 to i8**), i8*** %3, align 8
  %12 = call i8** @jl_gc_pool_alloc(i8* %ptls_i8, i32 1440, i32 16)
  %13 = getelementptr i8*, i8** %12, i64 -1
  %14 = bitcast i8** %13 to i8***
  store i8** inttoptr (i64 139997180171536 to i8**), i8*** %14, align 8
  %15 = bitcast i8** %12 to float*
  store float %.elt, float* %15, align 8
  %.sroa_raw_cast5 = bitcast i8** %12 to i8*
  %.sroa_raw_idx6 = getelementptr inbounds i8, i8* %.sroa_raw_cast5, i64 4
  %16 = bitcast i8* %.sroa_raw_idx6 to float*
  store float %.elt2, float* %16, align 4
  store i8** %12, i8*** %10, align 8
  %17 = call i8** @jl_f__apply(i8** null, i8*** %3, i32 2)
  %18 = bitcast i8** %17 to float*
  %19 = load float, float* %18, align 16
  %20 = load i64, i64* %9, align 8
  store i64 %20, i64* %7, align 8
  ret float %19
}

julia> function myprod1(x, y)
           xp, yp = promote(x, y)
           xp*yp
       end
myprod1 (generic function with 1 method)

julia> function myprod2(x, y)
           T = promote_type(typeof(x), typeof(y))
           convert(T, x) * convert(T, y)
       end
myprod2 (generic function with 1 method)

julia> @code_llvm myprod1(x, y)

define float @julia_myprod1_68215(float, %Normed*) #0 !dbg !5 {
top:
  %2 = call [2 x float] @julia_promote_68176(float %0, %Normed* %1)
  %.elt = extractvalue [2 x float] %2, 0
  %.elt2 = extractvalue [2 x float] %2, 1
  %3 = fmul float %.elt, %.elt2
  ret float %3
}

julia> @code_llvm myprod2(x, y)

define float @julia_myprod2_68216(float, %Normed*) #0 !dbg !5 {
top:
  %2 = getelementptr inbounds %Normed, %Normed* %1, i64 0, i32 0
  %3 = load i8, i8* %2, align 1
  %4 = uitofp i8 %3 to float
  %5 = fmul fast float %4, 0x3F70101020000000
  %6 = fmul float %5, %0
  ret float %6
}

julia> using BenchmarkTools

julia> @benchmark $x*$y seconds=1
BenchmarkTools.Trial: 
  memory estimate:  64 bytes
  allocs estimate:  4
  --------------
  minimum time:     73.504 ns (0.00% GC)
  median time:      74.888 ns (0.00% GC)
  mean time:        79.331 ns (3.29% GC)
  maximum time:     1.387 μs (87.63% GC)
  --------------
  samples:          10000
  evals/sample:     972
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark myprod1($x, $y) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.037 ns (0.00% GC)
  median time:      5.041 ns (0.00% GC)
  mean time:        5.101 ns (0.00% GC)
  maximum time:     18.131 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @benchmark myprod2($x, $y) seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.703 ns (0.00% GC)
  median time:      3.706 ns (0.00% GC)
  mean time:        3.738 ns (0.00% GC)
  maximum time:     14.419 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

Note that this is in stark contrast to the case with "conventional" types:

julia> z = 3
3

julia> @which x*z
*(x::Number, y::Number) in Base at promotion.jl:247

julia> @which x*y
*(x::Number, y::Number) in Base at promotion.jl:247

julia> @benchmark $x*$z seconds=1
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.025 ns (0.00% GC)
  median time:      2.030 ns (0.00% GC)
  mean time:        2.060 ns (0.00% GC)
  maximum time:     12.631 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000
  time tolerance:   5.00%
  memory tolerance: 1.00%

julia> @code_llvm x*z

define float @"julia_*_68579"(float, i64) #0 !dbg !5 {
top:
  %2 = sitofp i64 %1 to float
  %3 = fmul float %2, %0
  ret float %3
}

Before I make the obvious fix, is there a deeper issue to understand here?

Thanks to @Evizero for the observation that led to this investigation. CC @vchuravy, since it seriously affects FPN.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions