I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.
The Question:
Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?
For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?
The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.
I will assume that you want to make a single pass, one CSV with ID
numbers as input, another CSV with anonymized numbers as output. I will
also assume the number of unique IDs is somewhere on the order of 10
million or less.
It is my thought that it would be best to use some totally arbitrary
one-to-one function from the set of ID numbers (N) to the set of
de-identified numbers (D). This would be more secure. If you used some
sort of hash function, and an adversary learned what the hash was, the
numbers in N could be recovered without too much trouble with a
dictionary attack. Instead I suggest a simple lookup table: ID 1234567
maps to de-identified number 4672592, etc. The correspondence would be
stored in another file, and an adversary without that file would not be
able to do much.
With 10 million or fewer records, on a machine such as you describe,
this is not a big problem. A sketch program in pseudo-Python:
mapping = {}
unused_numbers = list(range(10000000))
while data:
read record
for each ID number N in record:
if N in mapping:
D = mapping[N]
else:
D = choose_random(unused_numbers)
unused_numbers.del(D)
mapping[N] = D
replace N with D in record
write record
write mapping to lookup table file
It seems you don't care about the ids being reversible, but if it helps, you can try one of the format preserving encryption ideas. They are pretty much designed for this use case.
Otherwise if hashes are too large, you can always just strip the end of it. Even if you replace each digit (of the original ID) with a hex digit (from the hash), the collisions are unlikely. You could first read the file and check for collisions though.
PS. If you end up doing hashing, make sure you prepend salt of a reasonable size. Hashes of IDs in the range [1:10M] would be trivial to bruteforce otherwise.