Lets say I have arrays a
and b
a = np.array([1,2,3])
b = np.array(['red','red','red'])
If I were to apply some fancy indexing like this to these arrays
b[a<3]="blue"
the output I get is
array(['blu', 'blu', 'red'], dtype='<U3')
I understand that the issue is because of numpy initially allocating space only for 3 characters at first hence it cant fit the whole word blue into the array, what work around can I use?
Currently I am doing
b = np.array([" "*100 for i in range(3)])
b[a>2] = "red"
b[a<3] = "blue"
but it's just a work around, is this a fault in my code? Or is it some issue with numpy, how can I fix this?
You can handle variable length strings by setting the dtype
of b
to be "object"
:
import numpy as np
a = np.array([1,2,3])
b = np.array(['red','red','red'], dtype="object")
b[a<3] = "blue"
print(b)
this outputs:
['blue' 'blue' 'red']
This dtype
will handle strings, or other general Python objects. This also necessarily means that under the hood you'll have a numpy
array of pointers, so don't expect the performance you get when using a primitive datatype.
A marginal improvement on your current approach (which is potentially very wasteful in space):
import numpy as np
a = np.array([1,2,3])
b = np.array(['red','red','red'])
replacement = "blue"
b = b.astype('<U{}'.format(max(len(replacement), a.dtype.itemsize)))
b[a<3] = replacement
print(b)
This accounts for strings already in the array, so the allocated space only increases if the replacement
is longer than all existing strings in the array.
If you construct such array, the type looks like:
>>> b
array(['red', 'red', 'red'], dtype='<U3')
This means that the strings have a length of at most 3 characters. In case you assign longer strings, these strings are truncated.
You can change the data type to make the maximum length longer, for example:
b2 = b.astype('<U10')
So now we have an array that can store strings up to 10 characters. Note however that if you make the maximum length larger, the size of the matrix will increase.