I've a set of integers each of which have 8,9 or 10 digits in size. I've millions of them. I want to map each one of them to an integer in the range of 1 to 1000. I cannot do a simple mod on the integers as there are systemic biases in the way these numbers have been issued (for example even numbers are more likely than odd numbers), so
$id % 1000
would yield more frequent even numbers and less frequent odd numbers. Are there any simple functions (either mathematical or tricky functions that do bitwise operations) which would help me get to this mapping either in Perl or R? Thanks much in advance.
You're essentially asking for a hash function that maps numbers to values between 0 and 999.
To construct that, you could first use a hash function to get rid of any systematic pattern in the mapped-to values, and then use mod to restrict the output to values between 0 and 999.
Here's an R implementation of that idea:
Breaking that one-liner down into pieces should make what it does a bit clearer:
Unless you can define the mathematical properties of the available numbers (e.g., they are even, exponentially distributed or whatever) there is no way that any deterministic function would map these numbers into any given range evenly.
Every function you choose will have to map a certain class of numbers into a small region in the output range. If the hash function is complex, it may be difficult to determine a-priori the class that will be mishandled. Of course, this is a general problem of hash functions. You always have to assume something on the input.
In theory, the only solution (if you don't know anything about the numbers or can't analyze them) is to xor the input numbers with a truly random sequence and then use a
mod
operation.In practice, Josh's solution will probably work.
NOTE: If you can analyze the resulting array while you're hashing the numbers you might be able to change the hash function to evenly distribute the results. This might work for creating a hash table for later searching. However, this does not seem to be your application.