How to parse files larger than 100GB in Python?

2019-04-13 00:46发布

I have text files with about 100 Gb size with the below format (with duplicate records of line and ips and domains) :

domain|ip 
yahoo.com|89.45.3.5
bbc.com|45.67.33.2
yahoo.com|89.45.3.5
myname.com|45.67.33.2
etc.

I am trying to parse them using the following python code but I still get Memory error. Does anybody know a more optimal way of parsing such files? (Time is an important element for me)

files = glob(path)
for filename in files:
    print(filename)
    with open(filename) as f:
        for line in f:
            try:
                domain = line.split('|')[0]
                ip = line.split('|')[1].strip('\n')
                if ip in d:
                    d[ip].add(domain)
                else:
                    d[ip] = set([domain])
            except:
                print (line)
                pass

    print("this file is finished")

for ip, domains in d.iteritems():
    for domain in domains:
        print("%s|%s" % (ip, domain), file=output)

3条回答
家丑人穷心不美
2楼-- · 2019-04-13 00:52

Before thinking of using multiprocessing, I would divide the lines to different intervals.

  • Calculate the number of lines inside the file
  l= len(files.readlines())
  #l= sum(1 for _ in files)

Then, divide Your work to different stages, and process the data taking in cosideration two things:

  1. Load/Store your data to file (Use DB, CVS, Json ..) whatever you find it useful.
  2. Divide the data processing work to different stages, increment the number of lines you process every-time until you finish the work (to reuse the code you wrote);

nbrIterations= l // step

  • Pack the code inside a function giving a number interval, and increment it everytime.
 def dataProcessing(numberOfLine) :
    if (numberOfLine>l):
        print("this file is finished")
        return False
    else: 

        files = glob(path)
        for filename in files:
            print(filename)
            with open(filename) as f:
                for line in f:
                    if line>numberOfLine and line numberOfLine<numberOfLine+step:

                        domain = line.split('|')[0]
                        ip = line.split('|')[1].strip('\n')
                        if ip in d:
                            d[ip].add(domain)
                        else:
                            d[ip] = set([domain])

                    for ip, domains in d.iteritems():
                        for domain in domains:
                            # Better to store it in another file (or load to DB) using Pandas(load it to CSV) or DB Connector to load it to DB
                            print("%s|%s" % (ip, domain), file=output)
        return True
  • define your "step" so you can go through the lines of file
while dataProcessing(numberOfLine): 
  numberOfLine+=step

Your second option is to explore the possibilities with multipocessing (it depends on your machine performance).

查看更多
Juvenile、少年°
3楼-- · 2019-04-13 01:02

Python objects take a little more memory than the same value does on disk; there is a little overhead in a reference count, and in sets there is also the cached hash value per value to consider.

Don't read all those objects into (Python) memory; use a database instead. Python comes with a library for the SQLite database, use that to convert your file to a database. You can then build your output file from that:

import csv
import sqlite3
from itertools import islice

conn = sqlite3.connect('/tmp/ipaddresses.db')
conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
conn.execute('''\
    CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx 
    ON ipaddress(domain, ip)''')

for filename in files:
    print(filename)
    with open(filename, 'rb') as f:
        reader = csv.reader(f, delimiter='|')
        cursor = conn.cursor()
        while True:
            with conn:
                batch = list(islice(reader, 10000))
                if not batch:
                    break
                cursor.executemany(
                    'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
                    batch)

conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
with open(outputfile, 'wb') as outfh:
    writer = csv.writer(outfh, delimiter='|')
    cursor = conn.cursor()
    cursor.execute('SELECT ip, domain from ipaddress order by ip')
    writer.writerows(cursor)

This handles your input data in batches of 10000, and produces an index to sort on after inserting. Producing the index is going to take some time, but it'll all fit in your available memory.

The UNIQUE index created at the start ensures that only unique domain - ip address pairs are inserted (so only unique domains per ip address are tracked); the INSERT OR IGNORE statement skips any pair that is already present in the database.

Short demo with just the sample input you gave:

>>> import sqlite3
>>> import csv
>>> import sys
>>> from itertools import islice
>>> conn = sqlite3.connect('/tmp/ipaddresses.db')
>>> conn.execute('CREATE TABLE IF NOT EXISTS ipaddress (domain, ip)')
<sqlite3.Cursor object at 0x106c62730>
>>> conn.execute('''\
...     CREATE UNIQUE INDEX IF NOT EXISTS domain_ip_idx 
...     ON ipaddress(domain, ip)''')
<sqlite3.Cursor object at 0x106c62960>
>>> reader = csv.reader('''\
... yahoo.com|89.45.3.5
... bbc.com|45.67.33.2
... yahoo.com|89.45.3.5
... myname.com|45.67.33.2
... '''.splitlines(), delimiter='|')
>>> cursor = conn.cursor()
>>> while True:
...     with conn:
...         batch = list(islice(reader, 10000))
...         if not batch:
...             break
...         cursor.executemany(
...             'INSERT OR IGNORE INTO ipaddress VALUES(?, ?)',
...             batch)
... 
<sqlite3.Cursor object at 0x106c62810>
>>> conn.execute('CREATE INDEX IF NOT EXISTS ip_idx ON ipaddress(ip)')
<sqlite3.Cursor object at 0x106c62960>
>>> writer = csv.writer(sys.stdout, delimiter='|')
>>> cursor = conn.cursor()
>>> cursor.execute('SELECT ip, domain from ipaddress order by ip')
<sqlite3.Cursor object at 0x106c627a0>
>>> writer.writerows(cursor)
45.67.33.2|bbc.com
45.67.33.2|myname.com
89.45.3.5|yahoo.com
查看更多
Lonely孤独者°
4楼-- · 2019-04-13 01:16

A different, simpler, solution might be using the sort(1) utility:

sort input -u -t\| -k2 -T . --batch-size=50 --buffer-size=1G > output 

This will sort the file by the second column, where columns are separated by a |; -T sets the directory for temporary files to the current directory, the default is /tmp/, which is often a memory device. The -u flag removes duplicates, and the other flags may (or may not...) increase performance.

I tested this with a 5.5GB file, and that took ~200 seconds on my laptop; I don't know how well that ranks vs. the other solutions posted. You may also get better performance with a different --batch-size or --buffer-size.

In any case, this is certainly the simplest solution as it requires no programming at all :)

查看更多
登录 后发表回答