Vectorized “in” function in julia?

2019-01-07 23:37发布

I often want to loop over a long array or column of a dataframe, and for each item, see if it is a member of another array. Rather than doing

giant_list = ["a", "c", "j"]
good_letters = ["a", "b"]
isin = falses(size(giant_list,1))
for i=1:size(giant_list,1)
    isin[i] = giant_list[i] in good_letters
end

Is there any vectorized (doubly-vectorized?) way to do this in julia? In analogy with the basic operators I want to do something like

isin = giant_list .in good_letters

I realize this may not be possible, but I just wanted to make sure I wasn't missing something. I know I could probably use DefaultDict from DataStructures to do the similar but don't know of anything in base.

4条回答
兄弟一词,经得起流年.
2楼-- · 2019-01-08 00:13

findin() doesn't give you a boolean mask, but you can easily use it to subset an array/DataFrame for values that are contained in another array:

julia> giant_list[findin(giant_list, good_letters)]
1-element Array{String,1}:
 "a"
查看更多
时光不老,我们不散
3楼-- · 2019-01-08 00:18

The indexin function does something similar to what you want:

indexin(a, b)

Returns a vector containing the highest index in b for each value in a that is a member of b. The output vector contains 0 wherever a is not a member of b.

Since you want a boolean for each element in your giant_list (instead of the index in good_letters), you can simply do:

julia> indexin(giant_list, good_letters) .> 0
3-element BitArray{1}:
  true
 false
 false

The implementation of indexin is very straightforward, and points the way to how you might optimize this if you don't care about the indices in b:

function vectorin(a, b)
    bset = Set(b)
    [i in bset for i in a]
end

Only a limited set of names may be used as infix operators, so it's not possible to use it as an infix operator.

查看更多
戒情不戒烟
4楼-- · 2019-01-08 00:25

You can vectorize in quite easily in Julia v0.6, using the unified broadcasting syntax.

julia> in.(giant_list, (good_letters,))
3-element Array{Bool,1}:
  true
 false
 false

Note the scalarification of good_letters by using a one-element tuple. Alternatively, you can use a Scalar type such as the one introduced in StaticArrays.jl.

Julia v0.5 supports the same syntax, but requires a specialized function for scalarificiation (or the Scalar type mentioned earlier):

scalar(x) = setindex!(Array{typeof(x)}(), x)

after which

julia> in.(giant_list, scalar(good_letters))
3-element Array{Bool,1}:
  true
 false
 false
查看更多
太酷不给撩
5楼-- · 2019-01-08 00:27

There are a handful of modern (i.e. Julia v1.0) solutions to this problem:

First, an update to the scalar strategy. Rather than using a 1-element tuple or array, scalar broadcasting can be achieved using a Ref object:

julia> in.(giant_list, Ref(good_letters))
3-element BitArray{1}:
  true
 false
 false

This same result can be achieved by broadcasting the infix (\inTAB) operator:

julia> giant_list .∈ Ref(good_letters)
3-element BitArray{1}:
  true
 false
 false

Additionally, calling in with one argument creates a Base.Fix2, which may later be applied via a broadcasted call. This seems to have limited benefits compared to simply defining a function, though.

julia> is_good1 = in(good_letters);
       is_good2(x) = x in good_letters;

julia> is_good1.(giant_list)
3-element BitArray{1}:
  true
 false
 false

julia> is_good2.(giant_list)
3-element BitArray{1}:
  true
 false
 false

All in all, using .∈ with a Ref will probably lead to the shortest, cleanest code.

查看更多
登录 后发表回答