best way to pick a random subset from a collection

I have a set of objects in a Vector from which I'd like to select a random subset (e.g. 100 items coming back; pick 5 randomly). In my first (very hasty) pass I did an extremely simple and perhaps overly clever solution:

Vector itemsVector = getItems();

Collections.shuffle(itemsVector);
itemsVector.setSize(5);

While this has the advantage of being nice and simple, I suspect it's not going to scale very well, i.e. Collections.shuffle() must be O(n) at least. My less clever alternative is

Vector itemsVector = getItems();

Random rand = new Random(System.currentTimeMillis()); // would make this static to the class    

List subsetList = new ArrayList(5);
for (int i = 0; i < 5; i++) {
     // be sure to use Vector.remove() or you may get the same item twice
     subsetList.add(itemsVector.remove(rand.nextInt(itemsVector.size())));
}

Any suggestions on better ways to draw out a random subset from a Collection?

标签： java algorithm collections random subset

10条回答

Bombasti

2楼-- · 2019-01-05 03:14

Jon Bentley discusses this in either 'Programming Pearls' or 'More Programming Pearls'. You need to be careful with your N of M selection process, but I think the code shown works correctly. Rather than randomly shuffle all the items, you can do the random shuffle only shuffling the first N positions - which is a useful saving when N << M.

Knuth also discusses these algorithms - I believe that would be Vol 3 "Sorting and Searching", but my set is packed pending a move of house so I can't formally check that.

0人赞添加讨论(0) 举报

地球回转人心会变

3楼-- · 2019-01-05 03:14

Your second solution of using Random to pick element seems sound, however:

Depending on how sensitive your data is, I suggest using some sort of hashing method to scramble the random number seed. For a good case study, see How We Learned to Cheat at Online Poker (but this link is 404 as of 2015-12-18). Alternative URLs (found via a Google search on the article title in double quotes) include:
- How We Learned to Cheat at Online Poker — apparently the original publisher.
- How We Learned to Cheat at Online Poker
- How We Learned to Cheat at Online Poker
Vector is synchronized. If possible, use ArrayList instead to improve performance.

0人赞添加讨论(0) 举报

不美不萌又怎样

4楼-- · 2019-01-05 03:17

I wrote an efficient implementation of this a few weeks back. It's in C# but the translation to Java is trivial (essentially the same code). The plus side is that it's also completely unbiased (which some of the existing answers aren't) - a way to test that is here.

It's based on a Durstenfeld implementation of the Fisher-Yates shuffle.

0人赞添加讨论(0) 举报

手持菜刀，她持情操

5楼-- · 2019-01-05 03:22

I'd personal opt for your initial implementation: very concise. Performance testing will show how well it scales. I've implemented a very similar block of code in a decently abused method and it scaled sufficiently. The particular code relied on arrays containing >10,000 items as well.

0人赞添加讨论(0) 举报

beautiful°

6楼-- · 2019-01-05 03:26

Set<Integer> s = new HashSet<Integer>()
// add random indexes to s
while(s.size() < 5)
{
    s.add(rand.nextInt(itemsVector.size()))
}
// iterate over s and put the items in the list
for(Integer i : s)
{
    out.add(itemsVector.get(i));
}

0人赞添加讨论(0) 举报

可以哭但决不认输i

7楼-- · 2019-01-05 03:27

@Jonathan,

I believe this is the solution you're talking about:

void genknuth(int m, int n)
{    for (int i = 0; i < n; i++)
         /* select m of remaining n-i */
         if ((bigrand() % (n-i)) < m) {
             cout << i << "\n";
             m--;
         }
}

It's on page 127 of Programming Pearls by Jon Bentley and is based off of Knuth's implementation.

EDIT: I just saw a further modification on page 129:

void genshuf(int m, int n)
{    int i,j;
     int *x = new int[n];
     for (i = 0; i < n; i++)
         x[i] = i;
     for (i = 0; i < m; i++) {
         j = randint(i, n-1);
         int t = x[i]; x[i] = x[j]; x[j] = t;
     }
     sort(x, x+m);
     for (i = 0; i< m; i++)
         cout << x[i] << "\n";
}

This is based on the idea that "...we need shuffle only the first m elements of the array..."

0人赞添加讨论(0) 举报

1 2 下一页

best way to pick a random subset from a collection

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间