可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm on a linux machine (Redhat) and I have an 11GB text file. Each line in the text file contains data for a single record and the first n characters of the line contains a unique identifier for the record. The file contains a little over 27 million records.

I need to verify that there are not multiple records with the same unique identifier in the file. I also need to perform this process on an 80GB text file so any solution that requires loading the entire file into memory would not be practical.

回答1:

Read the file line-by-line, so you don't have to load it all into memory.

For each line (record) create a sha256 hash (32 bytes), unless your identifier is shorter.

Store the hashes/identifiers in an numpy.array. That is probably the most compact way to store them. 27 million records times 32 bytes/hash is 864 MB. That should fit into the memory of decent machine these days.

To speed up access you could use the first e.g. 2 bytes of the hash as the key of a collections.defaultdict and put the rest of the hashes in a list in the value. This would in effect create a hash table with 65536 buckets. For 27e6 records, each bucket would contain on average a list of around 400 entries. It would mean faster searching than a numpy array, but it would use more memory.

d = collections.defaultdict(list)
with open('bigdata.txt', 'r') as datafile:
    for line in datafile:
        id = hashlib.sha256(line).digest()
        # Or id = line[:n]
        k = id[0:2]
        v = id[2:]
        if v in d[k]:
            print "double found:", id
        else:
            d[k].append(v)

回答2:

Rigth tool for the job: put your records into a database. Unless you have a Postgres or MySQL installation handy already, I'd take sqlite.

$ sqlite3 uniqueness.sqlite
create table chk (
  ident char(n), -- n as in first n characters
  lineno integer -- for convenience
);
^D

Then I'd insert the unique identifier and line number into that table, possibly using a Python script like this:

import sqlite3 # install pysqlite3 before this
n = ... # how many chars are in the key part
lineno = 0

conn = sqlite3.connect("uniqueness.sqlite")
cur = conn.cursor()
with open("giant-file") as input:
  for line in input:
    lineno +=1
    ident = line[:n]
    cur.execute("insert into chk(ident, lineno) values(?, ?)", [ident, lineno])
cur.close()
conn.close()

After this, you can index the table and use SQL:

$ sqlite3 uniqueness.sqlite
create index x_ident on chk(ident); -- may take a bit of time

-- quickly find duplicates, if any
select ident, count(ident) as how_many
from chk
group by ident
having count(ident) > 1;

-- find lines of specific violations, if needed
select lineno 
from chk
where ident = ...; -- insert a duplicate ident

Yes, I tried most of this code, it should work :)

回答3:

Assuming I couldn't use a database I'd try something like

# read the file one line at a time http://stackoverflow.com/a/6475407/322909,
#be sure to read the comments

keys = set()

with open("bigfile.txt") as f:
    for line in f:
        key = get_key(line)
        if key in keys:
            print "dup"
        else:
            keys.add(key)

回答4:

Try this:

n=unique identifier size
cat 11gb_file | cut -c-$n | sort | uniq -cd

This will output any duplicate identifiers and how many times they appeared.

回答5:

I haven't tried this on a file quite that large, but ... assuming that the fixed position of the n characters were 7, and that the lines aren't longer than 999+7 characters this might do the job:

 awk  'BEGIN{FIELDWIDTHS="7 999"} ! a[$1]++' file > newfile

回答6:

I would never recommend that you try to filter such a massive text file in Python. It does not matter how you tackle it you will need to go through some complicated steps to make sure that you do not run out of memory.

The first thing that comes to mind is creating a hash of the lines and then using the hash to find duplicates. Since you save the line number as well you can then directly compare the text to make sure that there are no hash collisions.

But, the easiest solution would be to convert the text file into a database that allows you to quickly sort, search and filter out duplicate items. You can then re-create the text file using that if that is really a requirement.

回答7:

Read large text files in Python, line by line without loading it in to memory

The answer to that question was this,

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

Perhaps that will help you somehow, Good luck.

Find duplicate records in large text file

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

Find duplicate records in large text file

问题:

回答1:

回答2:

回答3:

回答4:

回答5:

回答6:

回答7:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮