Anonymization of Account Numbers in 2TB of CSV'

2019-07-05 03:01发布

I have ~2TB of CSV's where the first 2 columns contains two ID numbers. These need to be anonymized so the data can be used in academic research. The anonymization can be (but does not have to be) irreversible. These are NOT medical records, so I do not need the fanciest cryptographic algorithm.

The Question:

Standard hashing algorithms make really long strings, but I will have to do a bunch of ID-matching (i.e. 'for subset of rows in data containing ID XXX, do...)' to process the anonymized data, so this is not ideal. Is there a better way?

For example, If I know there are ~10 million unique account numbers, is there a standard way of using the set of integers [1:10million] as replacement/anonymized ID's?

The computational constraint is that data will likely be anonymized on a 32-core ~500GB server machine.

2条回答
做自己的国王
2楼-- · 2019-07-05 04:02

It seems you don't care about the ids being reversible, but if it helps, you can try one of the format preserving encryption ideas. They are pretty much designed for this use case.

Otherwise if hashes are too large, you can always just strip the end of it. Even if you replace each digit (of the original ID) with a hex digit (from the hash), the collisions are unlikely. You could first read the file and check for collisions though.

PS. If you end up doing hashing, make sure you prepend salt of a reasonable size. Hashes of IDs in the range [1:10M] would be trivial to bruteforce otherwise.

查看更多
你好瞎i
3楼-- · 2019-07-05 04:03

I will assume that you want to make a single pass, one CSV with ID numbers as input, another CSV with anonymized numbers as output. I will also assume the number of unique IDs is somewhere on the order of 10 million or less.

It is my thought that it would be best to use some totally arbitrary one-to-one function from the set of ID numbers (N) to the set of de-identified numbers (D). This would be more secure. If you used some sort of hash function, and an adversary learned what the hash was, the numbers in N could be recovered without too much trouble with a dictionary attack. Instead I suggest a simple lookup table: ID 1234567 maps to de-identified number 4672592, etc. The correspondence would be stored in another file, and an adversary without that file would not be able to do much.

With 10 million or fewer records, on a machine such as you describe, this is not a big problem. A sketch program in pseudo-Python:

mapping = {}
unused_numbers = list(range(10000000))

while data:
    read record
    for each ID number N in record:
        if N in mapping:
            D = mapping[N]
        else:
            D = choose_random(unused_numbers)
            unused_numbers.del(D)
            mapping[N] = D
        replace N with D in record
    write record

write mapping to lookup table file
查看更多
登录 后发表回答