How can I randomly generate letters according to their frequency of use in common speech?
Any pseudo-code appreciated, but an implementation in Java would be fantastic. Otherwise just a poke in the right direction would be helpful.
Note: I don't need to generate the frequencies of usage - I'm sure I can look that up easily enough.
Not even a pseudo-code, but a possible approach is as follows:
Let p1, p2, ..., pk be the frequencies that you want to match.
Depending on how you implement the interval-finding, the procedure is usually more efficient if the p1,p2,... are sorted in decreasing order, because you will usually find the interval containing x sooner.
Using a binary tree gives you a nice, clean way to find the right entry. Here, you start with a
frequency
map, where the keys are the symbols (English letters), and the values are the frequency of their occurrence. This gets inverted, and aNavigableMap
is created where the keys are cumulative probability, and the values are symbols. That makes the lookup easy.What I would do is scale the relative frequencies as floating point numbers such that their sum is 1.0. Then I would create an array of the cumulative totals per letter, i.e. the number that must be topped to get that letter and all those "below" it. Say the frequency of A is 10%, b is 2% and z is 1%; then your table would look something like this:
Then you generate yourself a random number between 0.0 and 1.0 and do a binary search in the array for the first number smaller than your random number. Then pick the letter at that position. Done.
I am assuming that you store the frequencies as floating point numbers between 0 and 1 that total to make 1.
First you should prepare a table of cumulative frequencies, i.e. the sum of the frequency of that letter and all letters before it.
To simplify, if you start with this frequency distribution:
Your cumulative frequency table would be:
Now generate a random number between 0 and 1 and see where in this list that number lies. Choose the letter that has the smallest cumulative frequency larger than your random number. Some examples:
Say you randomly pick 0.612. This lies between 0.4 and 0.8, i.e. between B and C, so you'd choose C.
If your random number was 0.039, that comes before 0.1, i.e. before A, so choose A.
I hope that makes sense, otherwise feel free to ask for clarifications!
One quick way to do it would be to generate a list of letters, where each letter appeared in the list in accordance with its frequency. Say, if "e" was used 25.6% of the time, and your list had length 1000, it would have 256 "e"s.
Then you could just randomly pick spots from the list by using
(int) (Math.random() * 1000)
to generate random numbers between 0 and 999.