Aggregate Numpy Array By Summing

2020-04-11 05:56发布

问题:

I have a Numpy array of shape (4320,8640). I would like to have an array of shape (2160,4320).

You'll notice that each cell of the new array maps to a 2x2 set of cells in the old array. I would like a cell's value in the new array to be the sum of the values in this block in the old array.

I can achieve this as follows:

import numpy

#Generate an example array
arr = numpy.random.randint(10,size=(4320,8640))

#Perform the transformation
arrtrans = numpy.array([ [ arr[y][x]+arr[y+1][x]+arr[y][x+1]+arr[y+1][x+1] for x in range(0,8640,2)] for y in range(0,4320,2)])

But this is slow and more than a little ugly.

Is there a way to do this using Numpy (or an interoperable package)?

回答1:

I'm not sure there exists the package you want, but this code will compute much faster.

>>> arrtrans2 = arr[::2, ::2] + arr[::2, 1::2] + arr[1::2, ::2] + arr[1::2, 1::2]
>>> numpy.allclose(arrtrans, arrtrans2)
True

Where ::2 and 1::2 are translated by 0, 2, 4, ... and 1, 3, 5, ... respectively.



回答2:

When the window fits exactly into the array, reshaping to more dimensions and collapsing the extra dimensions with np.sum is sort of the canonical way of doing this with numpy:

>>> a = np.random.rand(4320,8640)
>>> a.shape
(4320, 8640)
>>> a_small = a.reshape(2160, 2, 4320, 2).sum(axis=(1, 3))
>>> a_small.shape
(2160, 4320)
>>> np.allclose(a_small[100, 203], a[200:202, 406:408].sum())
True


回答3:

You are operating on sliding windows of the original array. There are numerous questions and answers on SO regarding. sliding windows and numpy and python. By manipulating the strides of an array, this process can be sped up considerably. Here is a generic function that will return (x,y) windows of the array with or without overlap. Using this stride trick appears to be just a hair slower than @mskimm's solution. It's a nice thing to have in your toolkit. This function is not mine - it was found at Efficient Overlapping Windows with Numpy

import numpy as np
from numpy.lib.stride_tricks import as_strided as ast
from itertools import product

def norm_shape(shape):
    '''
    Normalize numpy array shapes so they're always expressed as a tuple, 
    even for one-dimensional shapes.

    Parameters
        shape - an int, or a tuple of ints

    Returns
        a shape tuple

    from http://www.johnvinyard.com/blog/?p=268
    '''
    try:
        i = int(shape)
        return (i,)
    except TypeError:
        # shape was not a number
        pass

    try:
        t = tuple(shape)
        return t
    except TypeError:
        # shape was not iterable
        pass

    raise TypeError('shape must be an int, or a tuple of ints')


def sliding_window(a,ws,ss = None,flatten = True):
    '''
    Return a sliding window over a in any number of dimensions

    Parameters:
        a  - an n-dimensional numpy array
        ws - an int (a is 1D) or tuple (a is 2D or greater) representing the size 
             of each dimension of the window
        ss - an int (a is 1D) or tuple (a is 2D or greater) representing the 
             amount to slide the window in each dimension. If not specified, it
             defaults to ws.
        flatten - if True, all slices are flattened, otherwise, there is an 
                  extra dimension for each dimension of the input.

    Returns
        an array containing each n-dimensional window from a

    from http://www.johnvinyard.com/blog/?p=268
    '''

    if None is ss:
        # ss was not provided. the windows will not overlap in any direction.
        ss = ws
    ws = norm_shape(ws)
    ss = norm_shape(ss)

    # convert ws, ss, and a.shape to numpy arrays so that we can do math in every 
    # dimension at once.
    ws = np.array(ws)
    ss = np.array(ss)
    shape = np.array(a.shape)


    # ensure that ws, ss, and a.shape all have the same number of dimensions
    ls = [len(shape),len(ws),len(ss)]
    if 1 != len(set(ls)):
        error_string = 'a.shape, ws and ss must all have the same length. They were{}'
        raise ValueError(error_string.format(str(ls)))

    # ensure that ws is smaller than a in every dimension
    if np.any(ws > shape):
        error_string = 'ws cannot be larger than a in any dimension. a.shape was {} and ws was {}'
        raise ValueError(error_string.format(str(a.shape),str(ws)))

    # how many slices will there be in each dimension?
    newshape = norm_shape(((shape - ws) // ss) + 1)
    # the shape of the strided array will be the number of slices in each dimension
    # plus the shape of the window (tuple addition)
    newshape += norm_shape(ws)
    # the strides tuple will be the array's strides multiplied by step size, plus
    # the array's strides (tuple addition)
    newstrides = norm_shape(np.array(a.strides) * ss) + a.strides
    strided = ast(a,shape = newshape,strides = newstrides)
    if not flatten:
        return strided

    # Collapse strided so that it has one more dimension than the window.  I.e.,
    # the new array is a flat list of slices.
    meat = len(ws) if ws.shape else 0
    firstdim = (np.product(newshape[:-meat]),) if ws.shape else ()
    dim = firstdim + (newshape[-meat:])
    # remove any dimensions with size 1
    dim = filter(lambda i : i != 1,dim)
    return strided.reshape(dim)

Usage:

# 2x2 windows with NO overlap
b = sliding_window(arr, (2,2), flatten = False)
c = b.sum((1,2))

Approximate 24% performance improvement using numpy.einsum

c = np.einsum('ijkl -> ij', b)

One SO Q&A example How can I efficiently process a numpy array in blocks similar to Matlab's blkproc (blockproc) function, the selected answer would work for you.