I have a numpy array of these dimensions
data.shape (categories, models, types, events): (10, 11, 50, 100)
Now I want to do sample with replacement
in the innermost array (100) only. For a single array such as this:
data[0][0][0]
array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 ,
35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 ,
16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 ,
68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 ,
17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 ,
77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 ,
46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 ,
94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 ,
35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 ,
13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 ,
24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 ,
15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 ,
24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 ,
18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 ,
14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 ,
38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 ,
18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 ,
41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 ,
27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 ,
20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ],
dtype=float32)
I can do sample with replacement
using this: np.random.choice(data[0][0][0], 100)
, which I will be doing thousands of times.
array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 ,
16.026619, 20.203123, 23.430847, 16.512974, 15.524615,
18.967432, 22.860783, 85.41213 , 21.779675, 23.586731,
24.26776 , 66.577 , 20.904285, 19.068447, 21.960285,
68.874756, 31.585844, 23.586731, 61.161705, 101.15249 ,
59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 ,
23.430847, 14.781404, 40.448624, 13.629236, 24.26776 ,
19.068447, 16.026619, 16.512974, 16.108913, 77.40547 ,
12.629299, 31.585844, 24.798452, 18.967432, 14.781404,
23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178,
16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 ,
31.20367 , 17.970661, 29.745787, 39.459843, 10.792178,
43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497,
21.50163 , 18.788296, 20.904285, 17.352108, 41.798622,
18.447166, 16.108913, 19.068447, 61.161705, 52.865868,
20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782,
41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696,
20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 ,
23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 ,
17.352108, 46.886642, 27.399046, 18.008326, 15.176119],
dtype=float32)
But since there is no axis
in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?
databoot
-> (5, 10, 11, 50, 100)data
-> (10, 11, 50, 100)You can draw the indices of your samples and then apply fancy indexing:
A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.
The fastest/simplest answer turns out to be based on indexing a flattened version of your array:
Timings confirm that this is the fastest answer.
Timings
I tested out the above
resampFlat
function alongside a simplerfor
loop based solution:and a solution based on Paul Panzer's fancy indexing approach:
I tested with the following data:
Here's the results from the array flattening approach:
the results from the
for
loop approach:and from Paul's fancy indexing:
Contrary to my expectations,
resampFancyIdx
beatresampFor
, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.