This is definitely more of a notional question, but I wanted to get others expertise input on this topic at SO. Most of my programming is coming from Numpy arrays lately. I've been matching items in two or so arrays that are different in sizes. Most of the time I will go to a for-loop or even worst, nested for-loop. I'm ultimately trying to avoid using for-loops as I try to gain more experience in Data Science because for-loops perform slower.
I am well aware of Numpy and the pre-defined cmds I can research, but for those of you whom are experienced, do you have a general school of thought when you iterate through something?
Something similar to the following:
small_array = np.array(["a", "b"])
big_array = np.array(["a", "b", "c", "d"])
for i in range(len(small_array)):
for p in range(len(big_array)):
if small_array[i] == big_array[p]:
print "This item is matched: ", small_array[i]
I'm well aware there are more than one way to skin a cat with this, but I am interested in others approach and way of thinking.
As you said, you better use vectorized stuff to speed up. Learning it is a long path. You have to get used with matrices multiplication if you aren't already. Once you are, try to translate your data into matrix and see which multiplication you can do. Usually you can't do what you want with this and have super-matrices (more than 2D dimensions). That's where numpy get useful.
Numpy provides some functions like
np.where
, know how to use them. Know shortcuts likesmall_array[small_array == 'a'] = 'z'
. Try to combine numpy functions with nativ pythons (map, filter...).To handle multi-dimension matrix, there's no seccret, practice and use paper to understand what you're doing. But over 4 dimensions it starts getting very tricky.
I will interpret your question in a more specific way:
How do I quit using index variables?
How do I start writing list comprehensions instead of normal loops"?
To quit using index variables, the key is to understand that "for" in Python is not the "for" of other languagues. It should be called "for each".
That's much better.
I also find myself in situations where I would write code with normal loops (or actually do it) and then start wondering whether it would be clearer and more elegant with a list comprehension.
List comprehensions are really a domain-specific language to create lists, so the first step would be to learn its basics. A typical statement would be:
Meaning "give me a list of f(x), for all x out of list_expression that meet condition g"
So you could write it in this way:
Et voilà, you are on the road to pythonic style!
Since I've been working with array languages for decades (APL, MATLAB, numpy) I can't help with the starting steps. But I suspect I work mostly from patterns, things I've seen and used in the past. And I do a lot to experimentation in an interactive session.
To take your example:
Often I run the iterative case just to get a clear(er) idea of what is desired.
I've seen this before - iterating over two arrays, and doing something with the paired values. This is a kind of
outer
operation. There are various tools, but the one I like best makes use ofnumpy
broadcasting. It turn one array into a (n,1) array, and use it with the other (m,) arrayThe result of (n,1) operating with (1,m) is a (n,m) array:
Now I can take an
any
orall
reduction on either axis:I could also use
np.where
to reduce that boolean to indices.Oops, I should have used
any
Having played with this I remember that there's a
in1d
that does something similarBut when I look at the code for
in1d
(see the[source]
link in the docs), I see that, in some cases it actually iterates on the small array:Compare that to
Out[277]
.x==big_array
compares a scalar with an array. Innumpy
, doing something like==
,+
,*
etc with an array and scalar is easy, and should become second nature. Doing the same thing with 2 arrays of matching shapes is the next step. And from there do it with broadcastable shapes.In other cases it use
np.unique
andnp.argsort
.This pattern of creating a higher dimension array by broadcasting the inputs against each other, and then combining values with some sort of reduction (any, all, sum, mean, etc) is very common.
For loops are not necessarily slow. That's a matlab nonsense spread through time because of matlab's own fault. Vectorization is "for" looping but in a lower level. You need to get a handle on what kind of data and architecture you are working in and which kind of function your are executing over your data.