I have np matrix and I want to convert it to a 3d array with one hot encoding of the elements as third dimension. Is there a way to do with without looping over each row eg
a=[[1,3],
[2,4]]
should be made into
b=[[1,0,0,0], [0,0,1,0],
[0,1,0,0], [0,0,0,1]]
Approach #1
Here's a cheeky one-liner that abuses
broadcasted
comparison -Sample run -
For
0-based
indexing, it would be -If the one-hot enconding is to cover for the range of values ranging from the minimum to the maximum values, then offset by the minimum value and then feed it to the proposed method for
0-based
indexing. This would be applicable for rest of the approaches discussed later on in this post as well.Here's a sample run on the same -
If you are okay with a boolean array with
True
for1's
and False for0's
, you can skip the.astype(int)
conversion.Approach #2
We can also initialize a zeros arrays and index into the output with
advanced-indexing
. Thus, for0-based
indexing, we would have -Helper func -
This should be especially more performant when dealing with larger range of values.
For
1-based
indexing, simply feed ina-1
as the input.Approach #3 : Sparse matrix solution
Now, if you are looking for sparse array as output and AFAIK since scipy's inbuilt sparse matrices support only
2D
formats, you can get a sparse output that is a reshaped version of the output shown earlier with the first two axes merging and the third axis being kept intact. The implementation for0-based
indexing would look something like this -Again, for
1-based
indexing, simply feed ina-1
as the input.Sample run -
This would be much better than previous two approaches if you are okay with having sparse output.
Runtime comparison for 0-based indexing
Case #1 :
Case #2 :
Squeezing out best performance
To squeeze out the best performance, we could modify approach #2 to use indexing on a
2D
shaped output array and also useuint8
dtype for memory efficiency and that leading to much faster assignments, like so -Timings -
If you are trying to create one-hot tensor for your machine learning models (you have
tensorflow
orkeras
installed) then you can useone_hot
function from https://www.tensorflow.org/api_docs/python/tf/keras/backend/one_hot or https://www.tensorflow.org/api_docs/python/tf/one_hotIt's what I'm using and is working well for high dimensional data.
Here's example usage:
Edit: I just realized that my answer is covered already in the accepted answer. Unfortunately, as an unregistered user, I cannot delete it any more.
As an addendum to the accepted answer: If you have a very small number of classes to encode and if you can accept
np.bool
arrays as output, I found the following to be even slightly faster:Timings (for 10 classes):
This changes, however, if the number of classes increases (now 100 classes):
So, depending on your problem, either might be the faster version.