Parsing Gigantic Log File in Python

I am trying to parse a gigantic log file (around 5 GB).

I only want to parse the first 500,000 lines and I don't want to read the whole file into memory.

Basically, I want to do what the below is code is doing but with a while loop instead of a for loop and if conditional. I also want to be sure not read the entire file into memory.

import re
from collections import defaultdict
FILE = open('logs.txt', 'r')
count_words=defaultdict(int)
import pickle
i=0
for line in FILE.readlines():
    if i < 500000:
        m = re.search('key=([^&]*)', line)
        count_words[m.group(1)]+=1
    i+=1

csv=[]
for k, v in count_words.iteritems():
    csv.append(k+","+str(v))
print "\n".join(csv)

标签： python parsing text logging

3条回答

地球回转人心会变

2楼-- · 2019-04-15 22:11

Here is a simple way to do it:

with open('logs.txt', 'r') as f:
    for line_number, line in enumerate(f, start=1):
        do_stuff(line)
        if line_number > 500000:
            break

0人赞添加讨论(0) 举报

成全新的幸福

3楼-- · 2019-04-15 22:12

Replace

for line in FILE.readlines():

with

for line in FILE:

to avoid reading it into memory in its entirety. Then, to process only the first 500000 lines, do

from itertools import islice

for line in islice(FILE, 500000):
    m = re.search('key=([^&]*)', line)
    count_words[m.group(1)] += 1

so that you only actually load the prefix of the file you're working with. (Your current program will actually loop through the entire file, whether or not it loads it into memory entirely.)

There's no need for a while loop with an if check to solve this problem.

0人赞添加讨论(0) 举报

▲ chillily

4楼-- · 2019-04-15 22:25

Calling readlines() will call the entire file into memory, so you'll have to read line by line until you reach line 500,000 or hit the EOF, whichever comes first. Here's what you should do instead:

i = 0
while i < 500000:
    line = FILE.readline()
    if line == "": # Cuts off if end of file reached
        break
    m = re.search('key=([^&]*)', line)
    count_words[m.group(1)]+=1
    i += 1

0人赞添加讨论(0) 举报

Parsing Gigantic Log File in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间