No binary operators for structured arrays in Numpy

2019-04-06 19:50发布

Okay, so after going through the tutorials on numpy's structured arrays I am able to create some simple examples:

from numpy import array, ones
names=['scalar', '1d-array', '2d-array']
formats=['float64', '(3,)float64', '(2,2)float64']
my_dtype = dict(names=names, formats=formats)
struct_array1 = ones(1, dtype=my_dtype)
struct_array2 = array([(42., [0., 1., 2.], [[5., 6.],[4., 3.]])], dtype=my_dtype)

(My intended use case would have more than three entries and would use very long 1d-arrays.) So, all goes well until we try to perform some basic math. I get errors for all of the following:

struct_array1 + struct_array2
struct_array1 * struct_array2
1.0 + struct_array1
2.0 * struct_array2

Apparently, simple operators (+, -, *, /) are not supported for even the simplest structured arrays. Or am I missing something? Should I be looking at some other package (and don't say Pandas, because it is total overkill for this)? This seems like an obvious capability, so I'm a little dumbfounded. But it's difficult to find any chatter about this on the net. Doesn't this severely limit the usefulness of structured arrays? Why would anyone use a structure array rather than arrays packed into a dict? Is there a technical reason why this might be intractable? Or, if the correct solution is to perform the arduous work of overloading, then how is that done while keeping the operations fast?

3条回答
霸刀☆藐视天下
2楼-- · 2019-04-06 20:11

Another way to operate on the whole array is to use the 'union' dtype described in the documentation. In your example, you could expand your dtype by adding a 'union' field, and specifying overlapping 'offsets':

from numpy import array, ones, zeros

names=['scalar', '1d-array', '2d-array', 'union']
formats=['float64', '(3,)float64', '(2,2)float64', '(8,)float64']
offsets=[0, 8, 32, 0]
my_dtype = dict(names=names, formats=formats, offsets=offsets)
struct_array3=zeros((4,), dtype=my_dtype)

['union'] now gives access to all the data as a (n,8) array

struct_array3['union'] # == struct_array3.view('(8,)f8')
struct_array3['union'].shape  # (4,8)

You can operate on 'union' or any other fields:

struct_array3['union'] += 2
struct_array3['scalar']= 1

The 'union' field could another compatible shape, such as '(2,4)float64'. A 'row' of such an array might look like:

array([ (3.0, [0.0, 0.0, 0.0], [[2.0, 2.0], [0.0, 0.0]], 
      [[3.0, 0.0, 0.0, 0.0], [2.0, 2.0, 0.0, 0.0]])], 
      dtype={'names':['scalar','1d-array','2d-array','union'], 
             'formats':['<f8',('<f8', (3,)),('<f8', (2, 2)),('<f8', (2, 4))], 
             'offsets':[0,8,32,0], 
             'itemsize':64})
查看更多
Fickle 薄情
3楼-- · 2019-04-06 20:19

On the numpy structured array doc pages, most of the examples involve mixed data types - floats, ints, and strings. On SO most of the structured array questions have to do with loading mixed data from CSV files. On the other hand, in your example it appears that the main purpose of the structure is to give names to the 'columns'.

You can do math on the named columns, e.g.

struct_array1['scalar']+struct_array2['scalar']
struct_array1['2d-array']+struct_array2['2d-array']

You can also 'iterate' over the fields:

for n in my_dtype['names']:
    print a1[n]+a2[n]

And yes, for that purpose, making those arrays values in a dictionary, or attributes of an object, works just as well.

However, thinking about the CSV case, sometimes we want to talk about specific 'rows' of a CSV or structured array, e.g. struct_array[0]. Such a 'row' is a tuple of values.

In any case, the primary data structures in numpy are multiple dimensional arrays of numeric values, and most of the code revolves around number data types - float, int, etc. Structured arrays are a generalization of this, using elements that are, fundamentally, just fixed sets of bytes. How those bytes are interpreted is determined by the dtype.

Think about how MATLAB evolved - Matrices came first, then cells (like Python lists), then structures, and finally classes and objects. Python already had the lists, dictionaries and objects. numpy adds the arrays. It doesn't need to reinvent the general Python structures.

I'd lean toward defining a class like this:

class Foo(object):
    def __init__(self):
        self.scalar = 1
        self._1d_array = np.arange(10)
        self._2d_array = np.array([[1,2],[3,4]])

and implementing only the binary operations that really needed for the application.

查看更多
smile是对你的礼貌
4楼-- · 2019-04-06 20:21

Okay, after more research I stumbled upon an answer. (No fault to hpaulj - the question was not posed all that well.) But I wanted to post in case someone else has a similar frustration.

The answer comes from the numpy documentation on ndarray.view. They specifically provide an example in which they "[create] a view on a structured array so it can be used in calculations".

So, I was frustrated that I couldn't operate on my example structured arrays. After all, I "see" my structured array as simply a collection of floating point numbers! Well, in the end all I needed was to inform numpy of this abstraction using "view". The errors in the question can be avoided using:

( struct_array1.view(dtype='float64') + struct_array2.view(dtype='float64') ).view(dtype=my_dtype)
( struct_array1.view(dtype='float64') + struct_array2.view(dtype='float64') ).view(dtype=my_dtype)
( 1.0 + struct_array2.view(dtype='float64') ).view(dtype=my_dtype)
( 2.0 * struct_array2.view(dtype='float64') ).view(dtype=my_dtype)

This is not as elegant as one might want, but at least numpy has the capability.

查看更多
登录 后发表回答