Reading Huge File in Python

I have a 384MB text file with 50 million lines. Each line contains 2 space-separated integers: a key and a value. The file is sorted by key. I need an efficient way of looking up the values of a list of about 200 keys in Python.

My current approach is included below. It takes 30 seconds. There must be more efficient Python foo to get this down to a reasonable efficiency of a couple of seconds at most.

# list contains a sorted list of the keys we need to lookup
# there is a sentinel at the end of list to simplify the code
# we use pointer to iterate through the list of keys
for line in fin:
  line = map(int, line.split())
  while line[0] == list[pointer].key:
    list[pointer].value = line[1]
    pointer += 1
  while line[0] > list[pointer].key:
    pointer += 1
  if pointer >= len(list) - 1:
    break # end of list; -1 is due to sentinel

Coded binary search + seek solution (thanks kigurai!):

entries = 24935502 # number of entries
width   = 18       # fixed width of an entry in the file padded with spaces
                   # at the end of each line
for i, search in enumerate(list): # list contains the list of search keys
  left, right = 0, entries-1 
  key = None
  while key != search and left <= right:
    mid = (left + right) / 2
    fin.seek(mid * width)
    key, value = map(int, fin.readline().split())
    if search > key:
      left = mid + 1
    else:
      right = mid - 1
  if key != search:
    value = None # for when search key is not found
  search.result = value # store the result of the search

标签： python performance file-io large-files

8条回答

啃猪蹄的小仙女

2楼-- · 2019-01-22 04:32

If you have any control over the format of the file, the "sort and binary search" responses are correct. The detail is that this only works with records of a fixed size and offset (well, I should say it only works easily with fixed length records).

With fixed length records, you can easily seek() around the sorted file to find your keys.

0人赞添加讨论(0) 举报

来，给爷笑一个

3楼-- · 2019-01-22 04:37

It's not clear what "list[pointer]" is all about. Consider this, however.

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys
for line in fin:
    key, value = map( int, line.split())
    if key in targetKeys:
        keyValues[key].append( value )

0人赞添加讨论(0) 举报

叛逆

4楼-- · 2019-01-22 04:38

If you only need 200 of 50 million lines, then reading all of it into memory is a waste. I would sort the list of search keys and then apply binary search to the file using seek() or something similar. This way you would not read the entire file to memory which I think should speed things up.

0人赞添加讨论(0) 举报

Bombasti

5楼-- · 2019-01-22 04:45

You need to implement binary search using seek()

0人赞添加讨论(0) 举报

啃猪蹄的小仙女

6楼-- · 2019-01-22 04:46

One possible optimization is to do a bit of buffering using the sizehint option in file.readlines(..). This allows you to load multiple lines in memory totaling to approximately sizehint bytes.

0人赞添加讨论(0) 举报

虎瘦雄心在

7楼-- · 2019-01-22 04:48

Slight optimization of S.Lotts answer:

from collections import defaultdict
keyValues= defaultdict(list)
targetKeys= # some list of keys as strings
for line in fin:
    key, value = line.split()
    if key in targetKeys:
        keyValues[key].append( value )

Since we're using a dictionary rather than a list, the keys don't have to be numbers. This saves the map() operation and a string to integer conversion for each line. If you want the keys to be numbers, do the conversion a the end, when you only have to do it once for each key, rather than for each of 50 million lines.

0人赞添加讨论(0) 举报

1 2 下一页

Reading Huge File in Python

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间