I've a set of integers each of which have 8,9 or 10 digits in size. I've millions of them. I want to map each one of them to an integer in the range of 1 to 1000. I cannot do a simple mod on the integers as there are systemic biases in the way these numbers have been issued (for example even numbers are more likely than odd numbers), so
$id % 1000
would yield more frequent even numbers and less frequent odd numbers. Are there any simple functions (either mathematical or tricky functions that do bitwise operations) which would help me get to this mapping either in Perl or R? Thanks much in advance.
You're essentially asking for a hash function that maps numbers to values between 0 and 999.
To construct that, you could first use a hash function to get rid of any systematic pattern in the mapped-to values, and then use mod to restrict the output to values between 0 and 999.
Here's an R implementation of that idea:
library(digest)
set.seed(1)
(x <- sample(1e9, size=6))
# [1] 265508664 372123900 572853364 908207790 201681932 898389685
## To hash R's internal representation of these numbers
strtoi(substr(sapply(x, digest), 28, 32), 16L) %% 1e3
# [1] 552 511 233 293 607 819
## Or, for a hash mapping that's comparable to other programs' md5 hash
## implementations
strtoi(substr(sapply(as.character(x), digest, serialize=FALSE),28,32),16L) %% 1e3
# [1] 153 180 892 294 267 807
Breaking that one-liner down into pieces should make what it does a bit clearer:
## Compute md5 hash of R representation of each input number
(sapply(x, digest))
# [1] "a276b4d73a46e5a827ccc1ad970dc780" "328dd60879c478d49ee9f3488d71a0af"
# [3] "e312c7f09be7f2e8391bee2b85f77c11" "e4ac99a3f0a904b385bfdcd45aca93e5"
# [5] "470d800a40ad5bc34abf2bac4ce88f37" "0008f4edeebbafcc995f7de0d5c0e5cb"
## Only really need the last few hex digits
substr(sapply(x, digest), 28, 32)
# [1] "dc780" "1a0af" "77c11" "a93e5" "88f37" "0e5cb"
## Convert hex strings to decimal integers
strtoi(substr(sapply(x, digest), 28, 32), 16L)
# [1] 903040 106671 490513 693221 560951 58827
## Map those to range between 0 and 999
strtoi(substr(sapply(x, digest), 28, 32), 16L) %% 1e3
# [1] 40 671 513 221 951 827
Unless you can define the mathematical properties of the available numbers (e.g., they are even, exponentially distributed or whatever) there is no way that any deterministic function would map these numbers into any given range evenly.
Every function you choose will have to map a certain class of numbers into a small region in the output range. If the hash function is complex, it may be difficult to determine a-priori the class that will be mishandled. Of course, this is a general problem of hash functions. You always have to assume something on the input.
In theory, the only solution (if you don't know anything about the numbers or can't analyze them) is to xor the input numbers with a truly random sequence and then use a mod
operation.
In practice, Josh's solution will probably work.
NOTE: If you can analyze the resulting array while you're hashing the numbers you might be able to change the hash function to evenly distribute the results. This might work for creating a hash table for later searching. However, this does not seem to be your application.