可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a numpy array of these dimensions

data.shape (categories, models, types, events): (10, 11, 50, 100)

Now I want to do sample with replacement in the innermost array (100) only. For a single array such as this:

data[0][0][0]

array([ 40.448624 , 39.459843 , 33.76762 , 38.944622 , 21.407362 , 35.55499 , 68.5111 , 16.512974 , 21.118315 , 18.447166 , 16.026619 , 21.596252 , 41.798622 , 63.01645 , 46.886642 , 68.874756 , 17.472408 , 53.015724 , 85.41213 , 59.388977 , 17.352108 , 61.161705 , 23.430847 , 20.203123 , 22.73194 , 77.40547 , 43.02974 , 29.745787 , 21.50163 , 13.820962 , 46.91466 , 41.43656 , 18.008326 , 13.122162 , 59.79936 , 94.555305 , 24.798452 , 30.362497 , 13.629236 , 10.792178 , 35.298515 , 20.904285 , 15.409604 , 20.567234 , 46.376335 , 13.82727 , 17.970661 , 18.408686 , 21.987917 , 21.30094 , 24.26776 , 27.399046 , 49.16879 , 21.831453 , 66.577 , 15.524615 , 18.091696 , 24.346598 , 24.709772 , 19.068447 , 24.221592 , 25.244864 , 52.865868 , 22.860783 , 23.586731 , 18.928782 , 21.960285 , 74.77856 , 15.176119 , 20.795431 , 14.3638935, 35.937237 , 29.993324 , 30.848495 , 48.145336 , 38.02541 , 101.15249 , 49.801117 , 38.123184 , 12.041505 , 18.788296 , 20.53382 , 31.20367 , 19.76104 , 92.56279 , 41.62944 , 23.53344 , 18.967432 , 14.781404 , 20.02018 , 27.736559 , 16.108913 , 44.935062 , 12.629299 , 34.65672 , 20.60169 , 21.779675 , 31.585844 , 23.768578 , 92.463196 ], dtype=float32)

I can do sample with replacement using this: np.random.choice(data[0][0][0], 100), which I will be doing thousands of times.

array([ 13.629236, 92.56279 , 21.960285, 20.567234, 21.50163 , 16.026619, 20.203123, 23.430847, 16.512974, 15.524615, 18.967432, 22.860783, 85.41213 , 21.779675, 23.586731, 24.26776 , 66.577 , 20.904285, 19.068447, 21.960285, 68.874756, 31.585844, 23.586731, 61.161705, 101.15249 , 59.79936 , 16.512974, 43.02974 , 16.108913, 24.26776 , 23.430847, 14.781404, 40.448624, 13.629236, 24.26776 , 19.068447, 16.026619, 16.512974, 16.108913, 77.40547 , 12.629299, 31.585844, 24.798452, 18.967432, 14.781404, 23.430847, 49.16879 , 18.408686, 22.73194 , 10.792178, 16.108913, 18.967432, 12.041505, 85.41213 , 41.62944 , 31.20367 , 17.970661, 29.745787, 39.459843, 10.792178, 43.02974 , 21.831453, 21.50163 , 24.798452, 30.362497, 21.50163 , 18.788296, 20.904285, 17.352108, 41.798622, 18.447166, 16.108913, 19.068447, 61.161705, 52.865868, 20.795431, 85.41213 , 49.801117, 13.82727 , 18.928782, 41.43656 , 46.886642, 92.56279 , 41.62944 , 18.091696, 20.60169 , 48.145336, 20.53382 , 40.448624, 20.60169 , 23.586731, 22.73194 , 92.56279 , 94.555305, 22.73194 , 17.352108, 46.886642, 27.399046, 18.008326, 15.176119], dtype=float32)

But since there is no axis in np.random.choice, how can I do it for all arrays (i.e. (categories, models, types))? Or is looping through it the only option?

回答1:

The fastest/simplest answer turns out to be based on indexing a flattened version of your array:

def resampFlat(arr, reps):
    n = arr.shape[-1]

    # create an array to shift random indexes as needed
    shift = np.repeat(np.arange(0, arr.size, n), n).reshape(arr.shape)

    # get a flat view of the array
    arrflat = arr.ravel()
    # sample the array by generating random ints and shifting them appropriately
    return np.array([arrflat[np.random.randint(0, n, arr.shape) + shift] 
                     for i in range(reps)])

Timings confirm that this is the fastest answer.

Timings

I tested out the above resampFlat function alongside a simpler for loop based solution:

def resampFor(arr, reps):
    # store the shape for the return value
    shape = arr.shape
    # flatten all dimensions of arr except the last
    arr = arr.reshape(-1, arr.shape[-1])
    # preallocate the return value
    ret = np.empty((reps, *arr.shape), dtype=arr.dtype)
    # generate the indices of the resampled values
    idxs = np.random.randint(0, arr.shape[-1], (reps, *arr.shape))

    for rep,idx in zip(ret, idxs):
        # iterate over the resampled replicates
        for row,rowrep,i in zip(arr, rep, idx):
            # iterate over the event arrays within a replicate
            rowrep[...] = row[i]

    # give the return value the appropriate shape
    return ret.reshape((reps, *shape))

and a solution based on Paul Panzer's fancy indexing approach:

def resampFancyIdx(arr, reps):
    idx = np.random.randint(0, arr.shape[-1], (reps, *data.shape))
    _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *arr.shape[:-1], 0)))]

    return arr[I, J, K, idx]

I tested with the following data:

shape = ((10, 11, 50, 100))
data = np.arange(np.prod(shape)).reshape(shape)

Here's the results from the array flattening approach:

%%timeit
resampFlat(data, 100)

1.25 s ± 9.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

the results from the for loop approach:

%%timeit
resampFor(data, 100)

1.66 s ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and from Paul's fancy indexing:

%%timeit
resampFancyIdx(data, 100)

1.42 s ± 16.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Contrary to my expectations, resampFancyIdx beat resampFor, and I actually had to work fairly hard to come up with something better. At this point I would really like a better explanation of how fancy indexing works at the C-level, and why it's so performant.

回答2:

You can draw the indices of your samples and then apply fancy indexing:

>>> import numpy as np
>>> 
>>> (categories, models, types, events) = (10, 11, 50, 100)
>>> data = np.random.random((categories, models, types, events))
>>> N_samples = 1000
>>> 
>>> idx = np.random.randint(0, events, (categories, models, types, N_samples))
>>> I, J, K, _ = np.ogrid[:categories, :models, :types, :0]
>>> 
>>> resampled = data[I, J, K, idx]

A small explicit example for concreteness. The fields are labeled with "category" (A or B), "model" (a or b) and "type" (1 or 2) to make it easy to verify that sampling does preserve these.

>>> I, J, K, L = np.ix_(*(np.array(list(x), 'O') for x in ('AB', 'ab', '12', 'xyzw')))
>>> data = I+J+K+L
>>> data
array([[[['Aa1x', 'Aa1y', 'Aa1z', 'Aa1w'],
         ['Aa2x', 'Aa2y', 'Aa2z', 'Aa2w']],

        [['Ab1x', 'Ab1y', 'Ab1z', 'Ab1w'],
         ['Ab2x', 'Ab2y', 'Ab2z', 'Ab2w']]],


       [[['Ba1x', 'Ba1y', 'Ba1z', 'Ba1w'],
         ['Ba2x', 'Ba2y', 'Ba2z', 'Ba2w']],

        [['Bb1x', 'Bb1y', 'Bb1z', 'Bb1w'],
         ['Bb2x', 'Bb2y', 'Bb2z', 'Bb2w']]]], dtype=object)
>>> N_samples = 3
>>> 
>>> idx = np.random.randint(0, data.shape[-1], (N_samples, *data.shape))
>>> _, I, J, K, _ = np.ogrid[tuple(map(slice, (0, *data.shape[:-1], 0)))]
>>> 
>>> resampled = data[I, J, K, idx]
>>> res
ResourceWarning  resampled        
>>> resampled
array([[[[['Aa1z', 'Aa1y', 'Aa1y', 'Aa1x'],
          ['Aa2y', 'Aa2z', 'Aa2z', 'Aa2z']],

         [['Ab1w', 'Ab1z', 'Ab1y', 'Ab1x'],
          ['Ab2y', 'Ab2w', 'Ab2y', 'Ab2w']]],


        [[['Ba1z', 'Ba1y', 'Ba1y', 'Ba1x'],
          ['Ba2x', 'Ba2x', 'Ba2z', 'Ba2x']],

         [['Bb1x', 'Bb1x', 'Bb1y', 'Bb1z'],
          ['Bb2y', 'Bb2w', 'Bb2y', 'Bb2z']]]],



       [[[['Aa1x', 'Aa1w', 'Aa1x', 'Aa1z'],
          ['Aa2y', 'Aa2y', 'Aa2x', 'Aa2z']],

         [['Ab1y', 'Ab1x', 'Ab1w', 'Ab1z'],
          ['Ab2w', 'Ab2x', 'Ab2w', 'Ab2w']]],


        [[['Ba1x', 'Ba1z', 'Ba1x', 'Ba1z'],
          ['Ba2x', 'Ba2y', 'Ba2y', 'Ba2w']],

         [['Bb1z', 'Bb1w', 'Bb1y', 'Bb1w'],
          ['Bb2w', 'Bb2x', 'Bb2w', 'Bb2z']]]],



       [[[['Aa1w', 'Aa1w', 'Aa1w', 'Aa1y'],
          ['Aa2z', 'Aa2x', 'Aa2y', 'Aa2x']],

         [['Ab1z', 'Ab1z', 'Ab1x', 'Ab1y'],
          ['Ab2w', 'Ab2x', 'Ab2x', 'Ab2y']]],


        [[['Ba1w', 'Ba1x', 'Ba1y', 'Ba1y'],
          ['Ba2z', 'Ba2x', 'Ba2x', 'Ba2x']],

         [['Bb1z', 'Bb1w', 'Bb1x', 'Bb1x'],
          ['Bb2z', 'Bb2x', 'Bb2w', 'Bb2z']]]]], dtype=object)

回答3:

databoot = []
for i in range(5):
    idx = np.random.choice(100, 100)
    databoot.append(data[:,:,:,idx])

shape of databoot -> (5, 10, 11, 50, 100)
shape of data -> (10, 11, 50, 100)

How can I bootstrap the innermost array of a numpy

问题:

回答1:

Timings

回答2:

回答3:

收藏的人(0)

How can I bootstrap the innermost array of a numpy

问题:

回答1:

Timings

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮