I want to understand the NumPy behavior.
When I try to get the reference of an inner array of a NumPy array, and then compare it to the object itself, I get as returned value False
.
Here is the example:
In [198]: x = np.array([[1,2,3], [4,5,6]])
In [201]: x0 = x[0]
In [202]: x0 is x[0]
Out[202]: False
While on the other hand, with Python native objects, the returned is True
.
In [205]: c = [[1,2,3],[1]]
In [206]: c0 = c[0]
In [207]: c0 is c[0]
Out[207]: True
My question, is that the intended behavior of NumPy? If so, what should I do if I want to create a reference of inner objects of NumPy arrays.
2d slicing
When I first wrote this I constructed and indexed a 1d array. But the OP is working with a 2d array, so x[0]
is a 'row', a slice of the original.
In [81]: arr = np.array([[1,2,3], [4,5,6]])
In [82]: arr.__array_interface__['data']
Out[82]: (181595128, False)
In [83]: x0 = arr[0,:]
In [84]: x0.__array_interface__['data']
Out[84]: (181595128, False) # same databuffer pointer
In [85]: id(x0)
Out[85]: 2886887088
In [86]: x1 = arr[0,:] # another slice, different id
In [87]: x1.__array_interface__['data']
Out[87]: (181595128, False)
In [88]: id(x1)
Out[88]: 2886888888
What I wrote earlier about slices still applies. Indexing an individual elements, as with arr[0,0]
works the same as with a 1d array.
This 2d arr has the same databuffer as the 1d arr.ravel()
; the shape and strides are different. And the distinction between view
, copy
and item
still applies.
A common way of implementing 2d arrays in C is to have an array of pointers to other arrays. numpy
takes a different, strided
approach, with just one flat array of data, and usesshape
and strides
parameters to implement the transversal. So a subarray requires its own shape
and strides
as well as a pointer to the shared databuffer.
1d array indexing
I'll try to illustrate what is going on when you index an array:
In [51]: arr = np.arange(4)
The array is an object with various attributes such as shape, and a data buffer. The buffer stores the data as bytes (in a C array), not as Python numeric objects. You can see information on the array with:
In [52]: np.info(arr)
class: ndarray
shape: (4,)
strides: (4,)
itemsize: 4
aligned: True
contiguous: True
fortran: True
data pointer: 0xa84f8d8
byteorder: little
byteswap: False
type: int32
or
In [53]: arr.__array_interface__
Out[53]:
{'data': (176486616, False),
'descr': [('', '<i4')],
'shape': (4,),
'strides': None,
'typestr': '<i4',
'version': 3}
One has the data pointer in hex, the other decimal. We usually don't reference it directly.
If I index an element, I get a new object:
In [54]: x1 = arr[1]
In [55]: type(x1)
Out[55]: numpy.int32
In [56]: x1.__array_interface__
Out[56]:
{'__ref': array(1),
'data': (181158400, False),
....}
In [57]: id(x1)
Out[57]: 2946170352
It has some properties of an array, but not all. For example you can't assign to it. Notice also that its 'data` value is totally different.
Make another selection from the same place - different id and different data:
In [58]: x2 = arr[1]
In [59]: id(x2)
Out[59]: 2946170336
In [60]: x2.__array_interface__['data']
Out[60]: (181143288, False)
Also if I change the array at this point, it does not affect the earlier selections:
In [61]: arr[1] = 10
In [62]: arr
Out[62]: array([ 0, 10, 2, 3])
In [63]: x1
Out[63]: 1
x1
and x2
don't have the same id
, and thus won't match with is
, and they don't use the arr
data buffer either. There's no record that either variable was derived from arr
.
With slicing
it is possible get a view
of the original array,
In [64]: y = arr[1:2]
In [65]: y.__array_interface__
Out[65]:
{'data': (176486620, False),
'descr': [('', '<i4')],
'shape': (1,),
....}
In [66]: y
Out[66]: array([10])
In [67]: y[0]=4
In [68]: arr
Out[68]: array([0, 4, 2, 3])
In [69]: x1
Out[69]: 1
It's data pointer is 4 bytes larger than arr
- that is, it points to the same buffer, just a different spot. And changing y
does change arr
(but not the independent x1
).
I could even make a 0d view of this item
In [71]: z = y.reshape(())
In [72]: z
Out[72]: array(4)
In [73]: z[...]=0
In [74]: arr
Out[74]: array([0, 0, 2, 3])
In Python code we normally don't work with objects like this. When we use the c-api
or cython
is it possible to access the data buffer directly. nditer
is an iteration mechanism that works with 0d objects like this (either in Python or the c-api). In cython
typed memoryviews
are particularly useful for low level access.
http://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html
https://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
https://docs.scipy.org/doc/numpy/reference/c-api.iterator.html#c.NpyIter
elementwise ==
In response to comment, Comparing NumPy object references
np.array([1]) == np.array([2]) will return array([False], dtype=bool)
==
is defined for arrays as an elementwise operation. It compares the values of the respective elements and returns a matching boolean array.
If such a comparison needs to be used in a scalar context (such as an if
) it needs to be reduced to a single value, as with np.all
or np.any
.
The is
test compares object id's (not just for numpy objects). It has limited value in practical coding. I used it most often in expressions like is None
, where None
is an object with a unique id, and which does not play nicely with equality tests.
I think that you have a miss understanding about Numpy arrays. You think that sub arrays in a multidimensional array in Numpy (like in Python lists) are separate objects, well, they're not.
A Numpy array, regardless of its dimension is just one object. And that's because Numpy creates the arrays at C levels and when loads them up as a python object it can't be break down to multiple objects. That makes Python to create a new object for preserving new parts when you use some attributes like split()
, __getitem__
, take()
or etc., which as a mater of fact, its just the way that python abstracts the list-like behavior for Numpy arrays.
You can also check thin in real-time like following:
In [7]: x
Out[7]:
array([[1, 2, 3],
[4, 5, 6]])
In [8]: x[0] is x[0]
Out[8]: False
So as soon as you have an array or any mutable object that can hols other object in it you'll have a python mutable object and therefore you will lose the performance and all other Numpy array's cool features.
Also as @Imanol mentioned in comments you may want to use Numpy view objects if you want to have a memory optimized and flexible operation when you want to modify an array(s) with reference(s). view
objects can be constructed in following two ways:
a.view(some_dtype)
or a.view(dtype=some_dtype)
constructs a view of
the array’s memory with a different data-type. This can cause a
reinterpretation of the bytes of memory.
a.view(ndarray_subclass)
or a.view(type=ndarray_subclass)
just returns
an instance of ndarray_subclass that looks at the same array (same
shape, dtype, etc.) This does not cause a reinterpretation of the
memory.
For a.view(some_dtype)
, if some_dtype
has a different number of bytes
per entry than the previous dtype
(for example, converting a regular
array to a structured array), then the behavior of the view cannot be
predicted just from the superficial appearance of a (shown by
print(a)
). It also depends on exactly how a is stored in memory.
Therefore if a is C-ordered versus fortran-ordered, versus defined as
a slice or transpose, etc., the view may give different results.