I have a rank-1 numpy.array of which I want to make a boxplot. However, I want to exclude all values equal to zero in the array ... Currently, I solved this by looping the array and copy the value to a new array if not equal to zero. However, as the array consists of 86 000 000 values and I have to do this multiple times, this takes a lot of patience.
Is there a more intelligent way to do this?
this is a case where you want to use masked arrays, it keeps the shape of your array and it is automatically recognized by all numpy and matplotlib functions.
X = np.random.randn(1e3, 5)
X[np.abs(X)< .1]= 0 # some zeros
X = np.ma.masked_equal(X,0)
plt.boxplot(X) #masked values are not plotted
#other functionalities of masked arrays
X.compressed() # get normal array with masked values removed
X.mask # get a boolean array of the mask
X.mean() # it automatically discards masked values
For a NumPy array a
, you can use
a[a != 0]
to extract the values not equal to zero.
A simple line of code can get you an array that excludes all '0' values:
np.argwhere(*array*)
example:
import numpy as np
array = [0, 1, 0, 3, 4, 5, 0]
array2 = np.argwhere(array)
print array2
[1, 3, 4, 5]
I would like to suggest you to simply utilize NaN
for cases like this, where you'll like to ignore some values, but still want to keep the procedure statistical as meaningful as possible. So
In []: X= randn(1e3, 5)
In []: X[abs(X)< .1]= NaN
In []: isnan(X).sum(0)
Out[: array([82, 84, 71, 81, 73])
In []: boxplot(X)
You can index with a Boolean array. For a NumPy array A
:
res = A[A != 0]
You can use Boolean array indexing as above, bool
type conversion, np.nonzero
, or np.where
. Here's some performance benchmarking:
# Python 3.7, NumPy 1.14.3
np.random.seed(0)
A = np.random.randint(0, 5, 10**8)
%timeit A[A != 0] # 768 ms
%timeit A[A.astype(bool)] # 781 ms
%timeit A[np.nonzero(A)] # 1.49 s
%timeit A[np.where(A)] # 1.58 s