I have a list of data points. For the full run of my program, I'll use all of the data points, but for testing of the code, I want to use only a small percentage of them in order that the program run in a short time. I do not want simply to take the first n elements of the list, though; I want to select an even distribution of the elements from the list. So, if I'm using 50% of the data points, I might want to select from the list of data points every second data points.
Basically, I want to have a function that takes as arguments a list and a percentage and returns a list consisting of an even distribution of elements from the input list, the number of which corresponds as closely as possible to the percentage requested.
What would be a good way to do this?
For completeness, consider the following.
The problem can be split up in two part:
Determine the number of elements to pick, given a certain percentage or fraction.
Choose which elements from the list should be picked.
The first point is straight forward. If you would like to have
percentage = 35. #%
of your list, you'd ideally chooseround(len(my_list) * (percentage / 100.))
elements. Note that you'd get the percentage exactly right only iflen(my_list)
is a multiple of(percentage / 100.)
. This inaccuracy is inevitable, as a continuous measure (percentage) is converted into a discrete one (nbr. of elements).The second point will depend upon the special requirements you have towards which element should be returned. Choosing elements as equally distributed as possible is doable but certainly not the easiest way.
Here is how you would do this conceptually (see below for an implementation):
If you have a list of length
l
whereof you'd like a certain equally distributed fractionf
(f = percentage / 100.
) you will have to bin the indexes of you list intoround(l * f)
bins of the sizel / round(l * f)
. What you want is the list with the most central elements for each bin.Why does this work?
For the first point, note that if we make bins of the size
l / round(l * f)
we will getl / l / round(l * f) = round(l * f)
bins at the end. Which is the ideal amount (see point 1 above). If for each of these equally sized bins we then choose the most central element, we get a list of elements that are as equally distributed as possible.Here is a simple (and neither speed optimized nor very pretty) implementation of this:
We can now compare this ideal algorithm (equal_dist_els) with the slicing approach by @jonrsharpe:
See further below for the code.
Along the x-axis is the wished fraction of elements to be returned and on the y-axis is the absolute difference between the wanted fraction and the fraction returned by the two methods. We see that for fractions around 0.7 (~70%) the deviation of the slicing method is remarkable, i.e. if you ask for ~70% the slicing method returns all the elements (100%) which is a deviation of almost 45%.
To conclude we can say that the slicing method by @jonrsharpe works well for small fractions (
>>0.1
) but becomes increasingly inaccurate when choosing bigger fractions. Also note that the inaccuracy is independent of the length of the list. The binning algorithm is certainly slightly more complicated to implement and most probably also much slower. However its inaccuracy is just given by the inevitable inaccuracy mentioned above which decreases with increasing length of the list.code for the plots:
This can trivially be achieved by setting a slice with a step:
In use:
You could also add
round
, asint
will truncate:If you want to use a proportion rather than a percentage (e.g.
0.5
instead of50
), simply replace100.0
with1
.