I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :
domain|ip
yahoo.com|89.45.3.5
bbc.com|45.67.33.2
yahoo.com|89.45.3.5
myname.com|45.67.33.2
etc.
I am trying to parse them using the following python code but I still get Memory error. Does anybody know a more optimal way of parsing such files? (Time is an important element for me)
files = glob(path)
for filename in files:
print(filename)
with open(filename) as f:
for line in f:
try:
domain = line.split('|')[0]
ip = line.split('|')[1].strip('\n')
if ip in d:
d[ip].add(domain)
else:
d[ip] = set([domain])
except:
print (line)
pass
print("this file is finished")
for ip, domains in d.iteritems():
for domain in domains:
print("%s|%s" % (ip, domain), file=output)
Before thinking of using multiprocessing, I would divide the lines to different intervals.
Then, divide Your work to different stages, and process the data taking in cosideration two things:
nbrIterations= l // step
Your second option is to explore the possibilities with multipocessing (it depends on your machine performance).
Python objects take a little more memory than the same value does on disk; there is a little overhead in a reference count, and in sets there is also the cached hash value per value to consider.
Don't read all those objects into (Python) memory; use a database instead. Python comes with a library for the SQLite database, use that to convert your file to a database. You can then build your output file from that:
This handles your input data in batches of 10000, and produces an index to sort on after inserting. Producing the index is going to take some time, but it'll all fit in your available memory.
The
UNIQUE
index created at the start ensures that only unique domain - ip address pairs are inserted (so only unique domains per ip address are tracked); theINSERT OR IGNORE
statement skips any pair that is already present in the database.Short demo with just the sample input you gave:
A different, simpler, solution might be using the
sort(1)
utility:This will sort the file by the second column, where columns are separated by a
|
;-T
sets the directory for temporary files to the current directory, the default is/tmp/
, which is often a memory device. The-u
flag removes duplicates, and the other flags may (or may not...) increase performance.I tested this with a 5.5GB file, and that took ~200 seconds on my laptop; I don't know how well that ranks vs. the other solutions posted. You may also get better performance with a different
--batch-size
or--buffer-size
.In any case, this is certainly the simplest solution as it requires no programming at all :)