julia: efficient ways to vcat n arrays

2019-08-05 10:01发布

I have a data structure that I have loaded in from json that resembles the below

json_in =
  [ Dict("customer" => "cust1", "transactions" => 1:10^6)
  , Dict("customer" => "cust2", "transactions" => 1:10^6)
  , Dict("customer" => "cust3", "transactions" => 1:10^6)]

I know of two methods to collapse the transactions into one array

@time methodA = reduce(vcat,[cust["transactions"] for cust in json_in])
@time methodB = vcat(json_in[1]["transactions"],json_in[2]["transactions"],json_in[3]["transactions"])

However the timing of methodA is ~0.22s vs ~0.02s for methodB on my computer. I intend to perform this thousands of times so 10x quicker performance is a big deal.

I see methodB is not very robust as it can only deal with 3 Dicts (customers) so even though it's performant it doesn't generalise.

What would be the most efficient way to concatenate arrays that are elements in an array of Dict efficiently?

标签: arrays julia
2条回答
Anthone
2楼-- · 2019-08-05 10:50

As @Gnimuc states in his comment, you should not benchmark in global scope, and benchmarks are best done using BenchmarkTools.jl - here are the timings done right:

julia> methodA(json_in) = reduce(vcat,[cust["transactions"] for cust in json_in])
method1 (generic function with 1 method)

julia> methodB(json_in) = vcat(json_in[1]["transactions"],json_in[2]["transactions"],json_in[3]["transactions"])
method2 (generic function with 1 method)

#Gnimuc's syntax from his comment
julia> methodC(json_in) = mapreduce(x->x["transactions"], vcat, json_in)
method3 (generic function with 1 method)

julia> using BenchmarkTools

julia> @benchmark methodA(json_in)
BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  15
  --------------
  minimum time:     10.584 ms (3.10% GC)
  median time:      14.781 ms (32.02% GC)
  mean time:        15.112 ms (32.19% GC)
  maximum time:     69.341 ms (85.28% GC)
  --------------
  samples:          331
  evals/sample:     1

julia> @benchmark methodB(json_in)
BenchmarkTools.Trial:
  memory estimate:  22.89 MiB
  allocs estimate:  2
  --------------
  minimum time:     5.921 ms (5.92% GC)
  median time:      8.402 ms (32.48% GC)
  mean time:        8.701 ms (33.46% GC)
  maximum time:     69.268 ms (91.09% GC)
  --------------
  samples:          574
  evals/sample:     1

julia> @benchmark methodC(json_in)
BenchmarkTools.Trial:
  memory estimate:  38.15 MiB
  allocs estimate:  12
  --------------
  minimum time:     10.599 ms (3.37% GC)
  median time:      14.843 ms (32.12% GC)
  mean time:        15.228 ms (32.24% GC)
  maximum time:     71.954 ms (85.95% GC)
  --------------
  samples:          328
  evals/sample:     1

Method B is still like twice as fast. That is exactly because it is more specialized, on an array with exactly three elements.

An alternative solution that might work well here is to use a MappedArray, which creates a lazy view into the original array:

using MappedArrays
method4(json_in) = mappedarray(x->x["transactions"], json_in)

Of course this doesn't concatenate the arrays, but you can concatenate views using the CatView package:

using CatViews
julia> method5(json_in) = reduce(CatView, mappedarray(x->x["transactions"], json_in))
method5 (generic function with 1 method)

julia> @benchmark method5(json_in)
BenchmarkTools.Trial:
  memory estimate:  1.73 KiB
  allocs estimate:  46
  --------------
  minimum time:     23.320 μs (0.00% GC)
  median time:      23.916 μs (0.00% GC)
  mean time:        25.466 μs (0.00% GC)
  maximum time:     179.092 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Because it doesn't allocate it is like 300x faster than method B (but it's possible it's slower to use the result because of nonlocality - worth benchmarking).

查看更多
狗以群分
3楼-- · 2019-08-05 10:58

Thanks for the help, after some research I came up with this idea to inline expand the code using macros, see code below, and it performs pretty well on the benchmarks (on Juliabox.com 21Sep2017)

macro inline_vcat(a)
  quote
    astr = $(string(a))
    s = reduce(string, string(astr,"[",aa,"][\"transactions\"],") for aa in 1:length($a))    
    string("vcat(", s[1:(end-1)],")")
  end
end

methodE(json_in) = (@inline_vcat json_in) |> parse |> eval

using BenchmarkTools
@benchmark methodE(json_in)

enter image description here

One shortcoming of this method is that if there are a large (~1million) customers in the JSON then the code generated will be long and parsing it would take a long time I assume well. Hence it's probably not a good idea for large datasets.

查看更多
登录 后发表回答