Julia: sum|mapreduce versus unrolled for-loop. performance disparity

Created on 8 Feb 2017  ·  3Comments  ·  Source: JuliaLang/julia

mapreduce can be much slower than the equivalent for-loop. A small (real-life) example:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        B[j] = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)
    end
    B
end
function dfor(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)

    @inbounds for j in 1:n
        d = z
        for k in 1:j
            d += A[j,k]*A[k,j]
        end
        B[j] = d
    end
    B
end

A = randn(127,127)
time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))

gives me a performance ratio of about x50 on Julia 0.5, juliabox.com. I think this could be because the for-loop can be automatically simd, and the mapreduce isn't? When A = randn(N,N) and N is 16, the gap is around x75, and for N = 10000, the gap is around x25. Replacing the array access A[j,k] with A[rand(1:size(A,1)),rand(1:size(A,2))] destroys the performance on both, but the ratio becomes x1.

  1. Is simd the reason why one is x50 faster?
  2. Should this be described in Performance Tips? mapreduce underlies sum, so this could be a popular trap that isn't currently mentioned
  3. Would this be a useful benchmark? on nanosoldier?
  4. Could the performance gap be smaller?

(Benchmarking mapreduce versus for-loops without array access, I still see a x2 performance gap. E.g. mapreduce(identity, +, 0, i for i in 1:n) versus the equivalent integer-summing for loop. It looks like this gap used to be smaller? Worth another benchmark in CI?)

performance

Most helpful comment

+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.

All 3 comments

Looks like dup of https://github.com/JuliaLang/julia/issues/15276.

Doing:

function dsum(A::Matrix)
    z = zero(A[1,1])
    n = Base.LinAlg.checksquare(A)
    B = Vector{typeof(z)}(n)
    @inbounds for j::Int in 1:n
        B[j] = _help(A, j, z)
    end
    B
end

_help(A, j, z) = mapreduce(k -> A[j,k]*A[k,j], +, z, 1:j)

gives

julia> time(median(@benchmark dsum(A)))/time(median(@benchmark dfor(A)))
1.0013213312412255

You can see the problem by @code_warntype and looking for Core.Box.

+1 for putting a benchmark of this in BaseBenchmarks; a PR there would be great.

Closing as dup of #15276.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Keno picture Keno  ·  3Comments

omus picture omus  ·  3Comments

i-apellaniz picture i-apellaniz  ·  3Comments

tkoolen picture tkoolen  ·  3Comments

helgee picture helgee  ·  3Comments