If we have a Map<T, Integer>
, let's say the Integer value represents "how many" Ts there are. Thus, I want to uniformly select a T based on its Integer value. If the map contains Strings with "a"=4 and "b"=6, then I want it so that 40% of the time "a" is selected and 60% of the time "b" is selected.
Most importantly, I'd like this in O(n), n being two (not ten) in my previous example. I originally made an ArrayList containing the keys by how many values it had (and simply returning any random index), but this process is not only very slow, but completely counterintuitive for what the Map<T, Integer>
represents.
OP here.
I came up with an elegant solution! For any misunderstandings: My original idea of storing all the keys by how many values in an ArrayList was completely disregarding the point of using a Map to store "instances of the Key using Integers"; any similar solutions are counterproductive! Assuming the Map is unordered, here is my solution:
It compares the
random value
with acurrent sum + the current element's value
. If it is less than that, we return the current key. Else, keep going and add that value to the sum. If it is the case such that the random value is never less than any of the values, we return thelastElement
.Hope this clears it up.
Using an arraylist would actually be even faster than using a Map, because you can do it in O(1).
The only way this is a bad thing is if order matters (A A B B A B vs A B B A B A or something) but it's obvious it doesn't because you're using a Map which has no ordering...
Sorry for the delay, but I think I have a relatively elegant solution with
O(n lg n)
construction time andO(lg n)
fetch-a-random-element time. Here goes.WeightedProbMap: This class implements the random element generator. It is constructed based on an
Iterable
; seeTest.java
below.Pair.java: Just a simple Pair class.
Test.java: This is a very simple test harness for the
WeightedProbMap
(WPM) class. We build an ArrayList of elements with associated weights, use that to construct a WPM, and then get 10,000 samples from the WPM to see if elements appear with the expected frequency.Testing this:
elts.add(...)
lines inTest.java
.Compile with:
$ javac Pair.java WeightedProbMap.java Test.java
Run with (for example, in Unix):
$ java Test | grep "Hello" | wc -l
This will give you the count for that particular execution.
Explanation:
constructor: The
WeightedProbMap
(WPM) class uses ajava.util.SortedMap
to map cumulative weights to elements. A graphical explanation:nextElt()
: ASortedMap
stores its data by key order, which allows it to cheaply provide 'views' of subsets of the map. In particular, the linereturns a view of the original map (
this.elts
) with only the keys that are strictly smaller thanindex
. This operation (headMap
) is constant time:view
takesO(1)
time to construct, and if you were to changethis.elts
later on, the changes would be reflected inview
as well.Once we create the
view
of everything less than a random number, we now just have to find the greatest key in that subset. We do that withSortedMap.lastKey()
, which, for aTreeMap
, should take\Theta(lg n)
time.To do this, you have to cache the relative frequency of each value T. This gives you your O(n) probability-distribution for the price of an O(n) insertion-cost (you have to update the relative frequency of every T upon every insertion).
If you can store the total sum, that's quite easily done:
Just store the pairs (T, int) as a class or whatever in an ordinary array and then go over it:
Can't get much faster considering that looping through an ArrayList is the most efficient way to iterate through n values and you can obviously not do better than O(n). The only overhead is the nextInt() and you need that (or something similar) as well in every solution. Depending on how you organize the ArrayList (sorted or not) other operations get cheaper/more expensive, but it's unimportant for that particular action
Edit: Although thinking about it the "you obviously need O(n)" isn't true. If you change the values in the array rarely and can allow a expensive preparation and memory isn't a problem you can do better by storing a HashMap. If you've got for example a distribution: T0: 2 T1: 3 T2: 1
You could insert (0, T0), (1, T0), (2, T1),.,(4, T1), (5, T2) in the hashmap.
Edit2: Or see phooji's approach which should be feasible for larger sets of data.
Build an inverse map,
Map<Integer,T>
so that every key is the sum of all the weights processed so far.For example if you have this map:
This inverse map is:
(For better performance, you can arrange your weights in descending order first.)
Then generate an evenly distributed random number between 0 and the sum of all weights, and perform a binary search for this number in the key set of the inverse map.