I have a set of objects in a Vector from which I'd like to select a random subset (e.g. 100 items coming back; pick 5 randomly). In my first (very hasty) pass I did an extremely simple and perhaps overly clever solution:
Vector itemsVector = getItems();
Collections.shuffle(itemsVector);
itemsVector.setSize(5);
While this has the advantage of being nice and simple, I suspect it's not going to scale very well, i.e. Collections.shuffle() must be O(n) at least. My less clever alternative is
Vector itemsVector = getItems();
Random rand = new Random(System.currentTimeMillis()); // would make this static to the class
List subsetList = new ArrayList(5);
for (int i = 0; i < 5; i++) {
// be sure to use Vector.remove() or you may get the same item twice
subsetList.add(itemsVector.remove(rand.nextInt(itemsVector.size())));
}
Any suggestions on better ways to draw out a random subset from a Collection?
If you're trying to select k distinct elements from a list of n, the methods you gave above will be O(n) or O(kn), because removing an element from a Vector will cause an arraycopy to shift all the elements down.
Since you're asking for the best way, it depends on what you are allowed to do with your input list.
If it's acceptable to modify the input list, as in your examples, then you can simply swap k random elements to the beginning of the list and return them in O(k) time like this:
If the list must end up in the same state it began, you can keep track of the positions you swapped, and then return the list to its original state after copying your selected sublist. This is still an O(k) solution.
If, however, you cannot modify the input list at all and k is much less than n (like 5 from 100), it would be much better not to remove selected elements each time, but simply select each element, and if you ever get a duplicate, toss it out and reselect. This will give you O(kn / (n-k)) which is still close to O(k) when n dominates k. (For example, if k is less than n / 2, then it reduces to O(k)).
If k not dominated by n, and you cannot modify the list, you might as well copy your original list, and use your first solution, because O(n) will be just as good as O(k).
As others have noted, if you are depending on strong randomness where every sublist is possible (and unbiased), you'll definitely need something stronger than
java.util.Random
. Seejava.security.SecureRandom
.two solutions I don't think appear here - the corresponds is quite long, and contains some links, however, I don't think all of the posts relate to the problem of choosing a subst of K elemetns out of a set of N elements. [By "set", I refer to the mathematical term, i.e. all elements appear once, order is not important].
Sol 1:
This looks similar to the answer daniel gave, but it actually is very different. It is of O(k) run time.
Another solution is to use some math: consider the array indexes as Z_n and so we can choose randomly 2 numbers, x which is co-prime to n, i.e. chhose gcd(x,n)=1, and another, a, which is "starting point" - then the series: a % n,a+x % n, a+2*x % n,...a+(k-1)*x%n is a sequence of distinct numbers (as long as k<=n).
How much does remove cost? Because if that needs to rewrite the array to a new chunk of memory, then you've done O(5n) operations in the second version, rather than the O(n) you wanted before.
You could create an array of booleans set to false, and then:
This approach works if your subset is smaller than your total size by a significant margin. As those sizes get close to one another (ie, 1/4 the size or something), you'd get more collisions on that random number generator. In that case, I'd make a list of integers the size of your larger array, and then shuffle that list of integers, and pull off the first elements from that to get your (non-colliding) indeces. That way, you have the cost of O(n) in building the integer array, and another O(n) in the shuffle, but no collisions from an internal while checker and less than the potential O(5n) that remove may cost.
This is a very similar question on stackoverflow.
To summarize my favorite answers from that page (furst one from user Kyle):
Here is some pseudopython -
I'm saying that the time complexity is O(k2) or O(k log k) because it depends on how quickly you can search and insert into your container for s. If s is a normal list, one of those operations is linear, and you get k^2. However, if you're willing to build s as a balanced binary tree, you can get out the O(k log k) time.