Concatenate range arrays given start, stop numbers

2019-02-19 12:34发布

问题:

I have two matrices of interest, the first is a "bag of words" matrix, with two columns: the document ID and the term ID. For example:

bow[0:10]

Out[1]:
    array([[ 0, 10],
           [ 0, 12],
           [ 0, 19],
           [ 0, 20],
           [ 1,  9],
           [ 1, 24],
           [ 2, 33],
           [ 2, 34],
           [ 2, 35],
           [ 3, 2]])

In addition, I have an "index" matrix, where every row in the matrix contains the index of the first and last row for a given document ID in the bag of words matrix. Ex: row 0 is the first and last index for doc id 0. For example:

index[0:4]

Out[2]:
    array([[ 0,  4],
           [ 4,  6],
           [ 6,  9],
           [ 9, 10]])

What I'd like to do is take a random sample of document ID's and get all of the bag of word rows for those document ID's. The bag of words matrix is roughly 150M rows (~1.5Gb), so using numpy.in1d() is too slow. We need to return these rapidly for feeding into a downstream task.

The naive solution I have come up with is as follows:

def get_rows(ids):
    indices = np.concatenate([np.arange(x1, x2) for x1,x2 in index[ids]])
    return bow[indices]

get_rows([4,10,3,5])

Generic sample

A generic sample to put forth the problem would be with something like this -

indices = np.array([[ 4, 7],
                    [10,16],
                    [11,18]]

The expected output would be -

array([ 4,  5,  6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])

回答1:

Think I have cracked it finally with a cumsum trick for a vectorized solution -

def create_ranges(a):
    l = a[:,1] - a[:,0]
    clens = l.cumsum()
    ids = np.ones(clens[-1],dtype=int)
    ids[0] = a[0,0]
    ids[clens[:-1]] = a[1:,0] - a[:-1,1]+1
    out = ids.cumsum()
    return out

Sample runs -

In [416]: a = np.array([[4,7],[10,16],[11,18]])

In [417]: create_ranges(a)
Out[417]: array([ 4,  5,  6, 10, 11, 12, 13, 14, 15, 11, 12, 13, 14, 15, 16, 17])

In [425]: a = np.array([[-2,4],[-5,2],[11,12]])

In [426]: create_ranges(a)
Out[426]: array([-2, -1,  0,  1,  2,  3, -5, -4, -3, -2, -1,  0,  1, 11])