What is the difference between random.sample and r

2020-04-07 16:16发布

I have a list a_tot with 1500 elements and I would like to divide this list into two lists in a random way. List a_1 would have 1300 and list a_2 would have 200 elements. My question is about the best way to randomize the original list with 1500 elements. When I have randomized the list, I could take one slice with 1300 and another slice with 200. One way is to use the random.shuffle, another way is to use the random.sample. Any differences in the quality of the randomization between the two methods? The data in list 1 should be a random sample as well as the data in list2. Any recommendations? using shuffle:

random.shuffle(a_tot)    #get a randomized list
a_1 = a_tot[0:1300]     #pick the first 1300
a_2 = a_tot[1300:]      #pick the last 200

using sample

new_t = random.sample(a_tot,len(a_tot))    #get a randomized list
a_1 = new_t[0:1300]     #pick the first 1300
a_2 = new_t[1300:]      #pick the last 200

标签: python random
6条回答
你好瞎i
2楼-- · 2020-04-07 16:39

The source for shuffle:

def shuffle(self, x, random=None, int=int):
    """x, random=random.random -> shuffle list x in place; return None.

    Optional arg random is a 0-argument function returning a random
    float in [0.0, 1.0); by default, the standard random.random.
    """

    if random is None:
        random = self.random
    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random() * (i+1))
        x[i], x[j] = x[j], x[i]

The source for sample:

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)
    """

    # XXX Although the documentation says `population` is "a sequence",
    # XXX attempts are made to cater to any iterable with a __len__
    # XXX method.  This has had mixed success.  Examples from both
    # XXX sides:  sets work fine, and should become officially supported;
    # XXX dicts are much harder, and have failed in various subtle
    # XXX ways across attempts.  Support for mapping types should probably
    # XXX be dropped (and users should pass mapping.keys() or .values()
    # XXX explicitly).

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    n = len(population)
    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"
    random = self.random
    _int = int
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in xrange(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in xrange(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                result[i] = population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            return self.sample(tuple(population), k)
    return result

As you can see, in both cases, the randomization is essentially done by the line int(random() * n). So, the underlying algorithm is essentially the same.

查看更多
SAY GOODBYE
3楼-- · 2020-04-07 16:40
from random import shuffle
from random import sample 
x = [[i] for i in range(10)]
shuffle(x)
sample(x,10)

shuffle update the output in same list but sample return the update list sample provide the no of argument in pic facility but shuffle provide the list of same length of input

查看更多
Melony?
4楼-- · 2020-04-07 16:42

The randomization should be just as good with both option. I'd say go with shuffle, because it's more immediately clear to the reader what it does.

查看更多
The star\"
5楼-- · 2020-04-07 16:43

random.shuffle() shuffles the given list in-place. Its length stays the same.

random.sample() picks n items out of the given sequence without replacement (which also might be a tuple or whatever, as long as it has a __len__()) and returns them in randomized order.

查看更多
时光不老,我们不散
6楼-- · 2020-04-07 16:43

There are two major differences between shuffle() and sample():

1) Shuffle will alter data in-place, so its input must be a mutable sequence. In contrast, sample produces a new list and its input can be much more varied (tuple, string, xrange, bytearray, set, etc).

2) Sample lets you potentially do less work (i.e. a partial shuffle).

It is interesting to show the conceptual relationships between the two by demonstrating that is would have been possible to implement shuffle() in terms of sample():

def shuffle(p):
   p[:] = sample(p, len(p))

Or vice-versa, implementing sample() in terms of shuffle():

def sample(p, k):
   p = list(p)
   shuffle(p)
   return p[:k]

Neither of these are as efficient at the real implementation of shuffle() and sample() but it does show their conceptual relationships.

查看更多
We Are One
7楼-- · 2020-04-07 16:50

I think they are quite the same, except that one updated the original list, one use (read only) it. No differences in quality.

查看更多
登录 后发表回答