Similar questions have already been asked on SO, but they have more specific constraints and their answers don't apply to my question.
Generally speaking, what is the most pythonic way to determine if an arbitrary numpy array is a subset of another array? More specifically, I have a roughly 20000x3 array and I need to know the indices of the 1x3 elements that are entirely contained within a set. More generally, is there a more pythonic way of writing the following:
master=[12,155,179,234,670,981,1054,1209,1526,1667,1853] #some indices of interest
triangles=np.random.randint(2000,size=(20000,3)) #some data
for i,x in enumerate(triangles):
if x[0] in master and x[1] in master and x[2] in master:
print i
For my use case, I can safely assume that len(master) << 20000. (Consequently, it is also safe to assume that master is sorted because this is cheap).
You can do this easily via iterating over an array in list comprehension. A toy example is as follows:
gives
Now iterate over the elements:
The result is
You can modify it for being a subset as in your question:
If you need the indices, then use
It gives
A more natural (and possibly faster) solution for set operations in numpy is to use the functions in
numpy.lib.arraysetops
. These generally allow you to avoid having to convert back and forth between Python'sset
type. To check if one array is a subset of another, usenumpy.setdiff1d()
and test if the returned array has 0 length:You can also optionally set
assume_unique=True
for a potential speed boost.I'm actually a bit surprised that
numpy
doesn't have something like a built-inissubset()
function to do the above (analogous toset.issubset()
).Another option is to use
numpy.in1d()
(see https://stackoverflow.com/a/37262010/2020363)Edit: I just realized that at some point in the distant past this bothered me enough that I wrote my own simple function:
Here are two approaches you could try:
1, Use sets. Sets are implemented much like python dictionaries and have have constant time lookups. That would look much like the code you already have, just create a set from master:
2, Use search sorted. This is nice because it doesn't require you to use hashable types and uses numpy builtins.
searchsorted
is log(N) in the size of master and O(N) in the size of triangels so it should also be pretty fast, maybe faster depending on the size of your arrays and such.This second case assumes master is sorted, as
searchsorted
implies.