I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.
I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.
Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.
Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:
After this, you can index the table and use SQL:
Yes, I tried most of this code, it should work :)