Iterate two or more lists / numpy arrays… and comp

2020-07-26 05:15发布

问题:

I am new to python and my problem is the following:

I have defined a function func(a,b) that return a value, given two input values.

Now I have my data stored in lists or numpy arrays A,Band would like to use func for every combination. (A and B have over one million entries)

ATM i use this snippet:

for p in A:
  for k in B:
    value = func(p,k)

This takes really really a lot of time.

So i was thinking that maybe something like this:

C=(map(func,zip(A,B)))

But this method only works pairwise... Any ideas?

Thanks for help

回答1:

suppose, itertools.product does what you need:

from itertools import product

pro = product(A,B)
C = map(lambda x: func(*x), pro)

so far as it is generator it doesn't require additional memory



回答2:

First issue

You need to calculate the output of f for many pairs of values. The "standard" way to speed up this kind of loops (calculations) is to make your function f accept (NumPy) arrays as input, and do the calculation on the whole array at once (ie, no looping as seen from Python). Check any NumPy tutorial to get an introduction.

Second issue

If A and B have over a million entries each, there are one trillion combinations. For 64 bits numbers, that means you'll need 7.3 TiB of space just to store the result of your calculation. Do you have enough hard drive to just store the result?

Third issue

If A and B where much smaller, in your particular case you'd be able to do this:

values = f(*meshgrid(A, B))

meshgrid returns the cartesian product of A and B, so it's simply a way to generate the points that have to be evaluated.

Summary

  • You need to use NumPy effectively to avoid Python loops. (Or if all else fails or they can't easily be vectorized, write those loops in a compiled language, for instance by using Cython)

  • Working with terabytes of data is hard. Do you really need that much data?

  • Any solution that calls a function f 1e12 times in a loop is bound to be slow, specially in CPython (which is the default Python implementation. If you're not really sure and you're using NumPy, you're using it too).



回答3:

One million times one million is one trillion. Calling f one trillion times will take a while.

Unless you have a way of reducing the number of values to compute, you can't do better than the above.



回答4:

If you use NumPy, you should definitely look the np.vectorize function which is designed for this kind of problems...