Per https://stackoverflow.com/a/48981834/1840471, this is an implementation of the weighted Gini coefficient in Python:
import numpy as np
def gini(x, weights=None):
if weights is None:
weights = np.ones_like(x)
# Calculate mean absolute deviation in two steps, for weights.
count = np.multiply.outer(weights, weights)
mad = np.abs(np.subtract.outer(x, x) * count).sum() / count.sum()
rmad = mad / np.average(x, weights=weights)
# Gini equals half the relative mean absolute deviation.
return 0.5 * rmad
This is clean and works well for medium-sized arrays, but as warned in its initial suggestion (https://stackoverflow.com/a/39513799/1840471) it's O(n2). On my computer that means it breaks after ~20k rows:
n = 20000 # Works, 30000 fails.
gini(np.random.rand(n), np.random.rand(n))
Can this be adjusted to work for larger datasets? Mine is ~150k rows.
Here is a version which is much faster than the one you provided above, and also uses a simplified formula for the case without weight to get even faster results in that case.
Here is some test code to check we get (mostly) the same results:
But the speed is very different:
If you remove the pandas ops from the function, it is already much faster:
If you want to get the last drop of performance you could use numba or cython but that would only gain a few percent because most of the time is spent in sorting.
edit: gini_max_ghenis is the code used in Max Ghenis' answer
Adapting the
StatsGini
R function from here:This works for large vectors, at least up to 10M rows:
It also produces the same result as the function provided in the question, for example giving 0.2553 for this example: