Find duplicate records in large text file

2019-05-07 13:00发布

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.

7条回答
相关推荐>>
2楼-- · 2019-05-07 13:59

Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.

$ sqlite3 uniqueness.sqlite
create table chk (
  ident char(n), -- n as in first n characters
  lineno integer -- for convenience
);
^D

Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:

import sqlite3 # install pysqlite3 before this
n = ... # how many chars are in the key part
lineno = 0

conn = sqlite3.connect("uniqueness.sqlite")
cur = conn.cursor()
with open("giant-file") as input:
  for line in input:
    lineno +=1
    ident = line[:n]
    cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno])
cur.close()
conn.close()

After this, you can index the table and use SQL:

$ sqlite3 uniqueness.sqlite
create index x_ident on chk(ident); -- may take a bit of time

-- quickly find duplicates, if any
select ident, count(ident) as how_many
from chk
group by ident
having count(ident) > 1;

-- find lines of specific violations, if needed
select lineno 
from chk
where ident = ...; -- insert a duplicate ident

Yes, I tried most of this code, it should work :)

查看更多
登录 后发表回答