How do I shift my thinking to 'vectorize my co

2020-06-25 07:20发布

This is definitely more of a notional question, but I wanted to get others expertise input on this topic at SO. Most of my programming is coming from Numpy arrays lately. I've been matching items in two or so arrays that are different in sizes. Most of the time I will go to a for-loop or even worst, nested for-loop. I'm ultimately trying to avoid using for-loops as I try to gain more experience in Data Science because for-loops perform slower.

I am well aware of Numpy and the pre-defined cmds I can research, but for those of you whom are experienced, do you have a general school of thought when you iterate through something?

Something similar to the following:

small_array = np.array(["a", "b"])
big_array = np.array(["a", "b", "c", "d"])

for i in range(len(small_array)):
    for p in range(len(big_array)):
        if small_array[i] == big_array[p]:
            print "This item is matched: ", small_array[i]

I'm well aware there are more than one way to skin a cat with this, but I am interested in others approach and way of thinking.

4条回答
爱情/是我丢掉的垃圾
2楼-- · 2020-06-25 07:35

As you said, you better use vectorized stuff to speed up. Learning it is a long path. You have to get used with matrices multiplication if you aren't already. Once you are, try to translate your data into matrix and see which multiplication you can do. Usually you can't do what you want with this and have super-matrices (more than 2D dimensions). That's where numpy get useful.

Numpy provides some functions like np.where, know how to use them. Know shortcuts like small_array[small_array == 'a'] = 'z'. Try to combine numpy functions with nativ pythons (map, filter...).

To handle multi-dimension matrix, there's no seccret, practice and use paper to understand what you're doing. But over 4 dimensions it starts getting very tricky.

查看更多
何必那么认真
3楼-- · 2020-06-25 07:38

I will interpret your question in a more specific way:

  1. How do I quit using index variables?

  2. How do I start writing list comprehensions instead of normal loops"?

To quit using index variables, the key is to understand that "for" in Python is not the "for" of other languagues. It should be called "for each".

for x in small_array:
    for y in big_array:
        if x == y:
            print "This item is matched: ", x

That's much better.

I also find myself in situations where I would write code with normal loops (or actually do it) and then start wondering whether it would be clearer and more elegant with a list comprehension.

List comprehensions are really a domain-specific language to create lists, so the first step would be to learn its basics. A typical statement would be:

l = [f(x) for x in list_expression if g(x)]

Meaning "give me a list of f(x), for all x out of list_expression that meet condition g"

So you could write it in this way:

matched = [x for x in small_array if x in big_array]

Et voilà, you are on the road to pythonic style!

查看更多
迷人小祖宗
4楼-- · 2020-06-25 07:46

Since I've been working with array languages for decades (APL, MATLAB, numpy) I can't help with the starting steps. But I suspect I work mostly from patterns, things I've seen and used in the past. And I do a lot to experimentation in an interactive session.

To take your example:

In [273]: small_array = np.array(["a", "b"])
     ...: big_array = np.array(["a", "b", "c", "d"])
     ...: 
     ...: for i in range(len(small_array)):
     ...:     for p in range(len(big_array)):
     ...:         if small_array[i] == big_array[p]:
     ...:             print( "This item is matched: ", small_array[i])
     ...:             
This item is matched:  a
This item is matched:  b

Often I run the iterative case just to get a clear(er) idea of what is desired.

In [274]: small_array
Out[274]: 
array(['a', 'b'],
      dtype='<U1')
In [275]: big_array
Out[275]: 
array(['a', 'b', 'c', 'd'],
      dtype='<U1')

I've seen this before - iterating over two arrays, and doing something with the paired values. This is a kind of outer operation. There are various tools, but the one I like best makes use of numpy broadcasting. It turn one array into a (n,1) array, and use it with the other (m,) array

In [276]: small_array[:,None]
Out[276]: 
array([['a'],
       ['b']],
      dtype='<U1')

The result of (n,1) operating with (1,m) is a (n,m) array:

In [277]: small_array[:,None]==big_array
Out[277]: 
array([[ True, False, False, False],
       [False,  True, False, False]], dtype=bool)

Now I can take an any or all reduction on either axis:

In [278]: _.all(axis=0)
Out[278]: array([False, False, False, False], dtype=bool)

In [280]: __.all(axis=1)
Out[280]: array([False, False], dtype=bool)

I could also use np.where to reduce that boolean to indices.


Oops, I should have used any

In [284]: (small_array[:,None]==big_array).any(0)
Out[284]: array([ True,  True, False, False], dtype=bool)
In [285]: (small_array[:,None]==big_array).any(1)
Out[285]: array([ True,  True], dtype=bool)

Having played with this I remember that there's a in1d that does something similar

In [286]: np.in1d(big_array, small_array)
Out[286]: array([ True,  True, False, False], dtype=bool)

But when I look at the code for in1d (see the [source] link in the docs), I see that, in some cases it actually iterates on the small array:

In [288]: for x in small_array:
     ...:     print(x==big_array)
     ...:     
[ True False False False]
[False  True False False]

Compare that to Out[277]. x==big_array compares a scalar with an array. In numpy, doing something like ==, +, * etc with an array and scalar is easy, and should become second nature. Doing the same thing with 2 arrays of matching shapes is the next step. And from there do it with broadcastable shapes.

In other cases it use np.unique and np.argsort.

This pattern of creating a higher dimension array by broadcasting the inputs against each other, and then combining values with some sort of reduction (any, all, sum, mean, etc) is very common.

查看更多
唯我独甜
5楼-- · 2020-06-25 07:50

For loops are not necessarily slow. That's a matlab nonsense spread through time because of matlab's own fault. Vectorization is "for" looping but in a lower level. You need to get a handle on what kind of data and architecture you are working in and which kind of function your are executing over your data.

查看更多
登录 后发表回答