I have general data, e.g. strings:
np.random.seed(343)
arr = np.sort(np.random.randint(5, size=(10, 10)), axis=1).astype(str)
print (arr)
[['0' '1' '1' '2' '2' '3' '3' '4' '4' '4']
['1' '2' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '2' '2' '2' '2' '3' '3' '4' '4' '4']
['0' '1' '2' '2' '3' '3' '3' '4' '4' '4']
['0' '1' '1' '1' '2' '2' '2' '2' '4' '4']
['0' '0' '1' '1' '2' '3' '3' '3' '4' '4']
['0' '0' '2' '2' '2' '2' '2' '2' '3' '4']
['0' '0' '1' '1' '1' '2' '2' '2' '3' '3']
['0' '1' '1' '2' '2' '2' '3' '4' '4' '4']
['0' '1' '1' '2' '2' '2' '2' '2' '4' '4']]
I need count with reset if difference for counter of cumulative values, so is used pandas.
First create DataFrame:
df = pd.DataFrame(arr)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 1 1 2 2 3 3 4 4 4
1 1 2 2 2 3 3 3 4 4 4
2 0 2 2 2 2 3 3 4 4 4
3 0 1 2 2 3 3 3 4 4 4
4 0 1 1 1 2 2 2 2 4 4
5 0 0 1 1 2 3 3 3 4 4
6 0 0 2 2 2 2 2 2 3 4
7 0 0 1 1 1 2 2 2 3 3
8 0 1 1 2 2 2 3 4 4 4
9 0 1 1 2 2 2 2 2 4 4
How it working for one column:
First compare shifted data and add cumulative sum:
a = (df[0] != df[0].shift()).cumsum()
print (a)
0 1
1 2
2 3
3 3
4 3
5 3
6 3
7 3
8 3
9 3
Name: 0, dtype: int32
And then call GroupBy.cumcount
:
b = a.groupby(a).cumcount() + 1
print (b)
0 1
1 1
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
dtype: int64
If want apply solution to all columns is possible use apply
:
print (df.apply(lambda x: x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
0 1 2 3 4 5 6 7 8 9
0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 2 1 2 2 2 2 2
2 1 2 2 3 1 3 3 3 3 3
3 2 1 3 4 1 4 4 4 4 4
4 3 2 1 1 1 1 1 1 5 5
5 4 1 2 2 2 1 1 1 6 6
6 5 2 1 1 3 1 1 1 1 7
7 6 3 1 1 1 2 2 2 2 1
8 7 1 2 1 1 3 1 1 1 1
9 8 2 3 2 2 4 1 1 2 2
But it is slow, because large data. Is possible create some fast numpy solution?
I find solutions working only for 1d array.
General Idea
Consider the generic case where we perform this cumulative counting or if you think of them as ranges, we could call them - Grouped ranges.
Now, the idea starts off simple - Compare one-off slices along the respective axis to look for inequalities. Pad with
True
at the start of each row/col (depending on axis of counting).Then, it gets complicated - Setup an ID array with the intention that we would a final cumsum which would be desired output in its flattened order. So, the setup starts off with initializing a
1s
array with same shape as input array. At each group start in input, offset the ID array with the previous group lengths. Follow the code (should give more insight) on how we would do it for each row -We would extend this to solve for a generic case of row and columns. For the columns case, we would simply transpose, feed to earlier row-solution and finally transpose back, like so -
Sample run
Let's consider a sample run as would find grouped ranges along each column with each group starting with
1
-Thus, to solve our case for dataframe input & output, it would be -
Using the method of Divakar column wise is pretty faster, even so there is probably a fully vectorized way.
To check the equality:
and the speed:
EDIT: with
Numpy
, you can also usenp.maximum.accumulate
withnp.arange
.Some TIMING
Solution with
np.maximum.accumulate
Solution of Divakar
Solution with Numba of B. M.
And the numba solution. For such tricky problem, it always wins, here by a 7x factor vs numpy, since only one pass on res is done.
I need to externalize
arr[1:]==arr[:-1]
since numba doesn't support strings.For bigger array (100 000 rows x 100 cols), the gap is not so wide :
If
arr
can be convert to 'S8', we can win a lot of time :