Is there a way to reshape an array that does not m

2019-05-10 17:37发布

问题:

As a simplified example, suppose I have a dataset composed of 40 sorted values. The values of this example are all integers, though this is not necessarily the case for the actual dataset.

import numpy as np
data = np.linspace(1,40,40)

I am trying to find the maximum value inside the dataset for certain window sizes. The formula to compute the window sizes yields a pattern that is best executed with arrays (in my opinion). For simplicity sake, let's say the indices denoting the window sizes are a list [1,2,3,4,5]; this corresponds to window sizes of [2,4,8,16,32] (the pattern is 2**index).

## this code looks long because I've provided docstrings
## just in case the explanation was unclear

def shapeshifter(num_col, my_array=data):
    """
    This function reshapes an array to have 'num_col' columns, where 
    'num_col' corresponds to index.
    """
    return my_array.reshape(-1, num_col)

def looper(num_col, my_array=data):
    """
    This function calls 'shapeshifter' and returns a list of the 
    MAXimum values of each row in 'my_array' for 'num_col' columns. 
    The length of each row (or the number of columns per row if you 
    prefer) denotes the size of each window.
    EX:
        num_col = 2
        ==> window_size = 2
        ==> check max( data[1], data[2] ),
                  max( data[3], data[4] ),
                  max( data[5], data[6] ), 
                               .
                               .
                               .
                  max( data[39], data[40] )
            for k rows, where k = len(my_array)//num_col
    """
    my_array = shapeshifter(num_col=num_col, my_array=data)
    rows = [my_array[index] for index in range(len(my_array))]
    res = []
    for index in range(len(rows)):
        res.append( max(rows[index]) )
    return res

So far, the code is fine. I checked it with the following:

check1 = looper(2)
check2 = looper(4)
print(check1)
>> [2.0, 4.0, ..., 38.0, 40.0] 
print(len(check1))
>> 20
print(check2)
>> [4.0, 8.0, ..., 36.0, 40.0] 
print(len(check2))
>> 10

So far so good. Now here is my problem.

def metalooper(col_ls, my_array=data):
    """
    This function calls 'looper' - which calls
    'shapeshifter' - for every 'col' in 'col_ls'.

    EX:
        j_list = [1,2,3,4,5]
        ==> col_ls = [2,4,8,16,32]
        ==> looper(2), looper(4),
            looper(8), ..., looper(32)
        ==> shapeshifter(2), shapeshifter(4),
            shapeshifter(8), ..., shapeshifter(32)
                such that looper(2^j) ==> shapeshifter(2^j)
                for j in j_list
    """
    res = []
    for col in col_ls:
        res.append(looper(num_col=col))
    return res

j_list = [2,4,8,16,32]
check3 = metalooper(j_list)

Running the code above provides this error:

ValueError: total size of new array must be unchanged

With 40 data points, the array can be reshaped into 2 columns of 20 rows, or 4 columns of 10 rows, or 8 columns of 5 rows, BUT at 16 columns, the array cannot be reshaped without clipping data since 40/16 ≠ integer. I believe this is the problem with my code, but I do not know how to fix it.

I am hoping there is a way to cutoff the last values in each row that do not fit in each window. If this is not possible, I am hoping I can append zeroes to fill the entries that maintain the size of the original array, so that I can remove the zeroes after. Or maybe even some complicated if - try - break block. What are some ways around this problem?

回答1:

I think this will give you what you want in one step:

def windowFunc(a, window, f = np.max):
    return np.array([f(i) for i in np.split(a, range(window, a.size, window))])

with default f, that will give you a array of maximums for your windows.

Generally, using np.split and range, this will let you split into a (possibly ragged) list of arrays:

def shapeshifter(num_col, my_array=data):    
    return np.split(my_array, range(num_col, my_array.size, num_col))

You need a list of arrays because a 2D array can't be ragged (every row needs the same number of columns)

If you really want to pad with zeros, you can use np.lib.pad:

def shapeshifter(num_col, my_array=data):
    return np.lib.pad(my_array, (0, num_col - my.array.size % num_col), 'constant',  constant_values = 0).reshape(-1, num_col)

Warning:

It is also technically possible to use, for example, a.resize(32,2) which will create an ndArray padded with zeros (as you requested). But there are some big caveats:

  1. You would need to calculate the second axis because -1 tricks don't work with resize.
  2. If the original array a is referenced by anything else, a.resize will fail with the following error:

    ValueError: cannot resize an array that references or is referenced
    by another array in this way.  Use the resize function
    
  3. The resize function (i.e. np.resize(a)) is not equivalent to a.resize, as instead of padding with zeros it will loop back to the beginning.

Since you seem to want to reference a by a number of windows, a.resize isn't very useful. But it's a rabbit hole that's easy to fall into.

EDIT:

Looping through a list is slow. If your input is long and windows are small, the windowFunc above will bog down in the for loops. This should be more efficient:

def windowFunc2(a, window, f = np.max):
    tail = - (a.size % window)
    if tail == 0:
        return f(a.reshape(-1, window), axis = -1)
    else:
        body = a[:tail].reshape(-1, window)
        return np.r_[f(body, axis = -1), f(a[tail:])]


回答2:

Here's a generalized way to reshape with truncation:

def reshape_and_truncate(arr, shape):
    desired_size_factor = np.prod([n for n in shape if n != -1])
    if -1 in shape:  # implicit array size
        desired_size = arr.size // desired_size_factor * desired_size_factor
    else:
        desired_size = desired_size_factor
    return arr.flat[:desired_size].reshape(shape)

Which your shapeshifter could use in place of reshape