I have a list a_tot with 1500 elements and I would like to divide this list into two lists in a random way. List a_1 would have 1300 and list a_2 would have 200 elements. My question is about the best way to randomize the original list with 1500 elements. When I have randomized the list, I could take one slice with 1300 and another slice with 200. One way is to use the random.shuffle, another way is to use the random.sample. Any differences in the quality of the randomization between the two methods? The data in list 1 should be a random sample as well as the data in list2. Any recommendations? using shuffle:
random.shuffle(a_tot) #get a randomized list
a_1 = a_tot[0:1300] #pick the first 1300
a_2 = a_tot[1300:] #pick the last 200
using sample
new_t = random.sample(a_tot,len(a_tot)) #get a randomized list
a_1 = new_t[0:1300] #pick the first 1300
a_2 = new_t[1300:] #pick the last 200
The source for shuffle:
The source for sample:
As you can see, in both cases, the randomization is essentially done by the line
int(random() * n)
. So, the underlying algorithm is essentially the same.The randomization should be just as good with both option. I'd say go with
shuffle
, because it's more immediately clear to the reader what it does.random.shuffle()
shuffles the givenlist
in-place. Its length stays the same.random.sample()
picksn
items out of the given sequence without replacement (which also might be a tuple or whatever, as long as it has a__len__()
) and returns them in randomized order.There are two major differences between shuffle() and sample():
1) Shuffle will alter data in-place, so its input must be a mutable sequence. In contrast, sample produces a new list and its input can be much more varied (tuple, string, xrange, bytearray, set, etc).
2) Sample lets you potentially do less work (i.e. a partial shuffle).
It is interesting to show the conceptual relationships between the two by demonstrating that is would have been possible to implement shuffle() in terms of sample():
Or vice-versa, implementing sample() in terms of shuffle():
Neither of these are as efficient at the real implementation of shuffle() and sample() but it does show their conceptual relationships.
I think they are quite the same, except that one updated the original list, one use (read only) it. No differences in quality.