How to a read large file, line by line in Python

2018-12-31 03:02发布

I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.

My code so far:

for each_line in fileinput.input(input_file):
    do_something(each_line)

    for each_line_again in fileinput.input(input_file):
        do_something(each_line_again)

Executing this code gives an error message: device active.

Any suggestions?

The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.

10条回答
无与为乐者.
2楼-- · 2018-12-31 03:07

Katrielalex provided the way to open & read one file.

However the way your algorithm goes it reads the whole file for each line of the file. That means the overall amount of reading a file - and computing the Levenshtein distance - will be done N*N if N is the amount of lines in the file. Since you're concerned about file size and don't want to keep it in memory, I am concerned about the resulting quadratic runtime. Your algorithm is in the O(n^2) class of algorithms which often can be improved with specialization.

I suspect that you already know the tradeoff of memory versus runtime here, but maybe you would want to investigate if there's an efficient way to compute multiple Levenshtein distances in parallel. If so it would be interesting to share your solution here.

How many lines do your files have, and on what kind of machine (mem & cpu power) does your algorithm have to run, and what's the tolerated runtime?

Code would look like:

with f_outer as open(input_file, 'r'):
    for line_outer in f_outer:
        with f_inner as open(input_file, 'r'):
            for line_inner in f_inner:
                compute_distance(line_outer, line_inner)

But the questions are how do you store the distances (matrix?) and can you gain an advantage of preparing e.g. the outer_line for processing, or caching some intermediate results for reuse.

查看更多
无色无味的生活
3楼-- · 2018-12-31 03:19

this is a possible way of reading a file in python:

f = open(input_file)
for line in f:
    do_stuff(line)
f.close()

it does not allocate a full list. It iterates over the lines.

查看更多
春风洒进眼中
4楼-- · 2018-12-31 03:21

To strip newlines:

with open(file_path, 'rU') as f:
    for line_terminated in f:
        line = line_terminated.rstrip('\n')
        ...

With universal newline support all text file lines will seem to be terminated with '\n', whatever the terminators in the file, '\r', '\n', or '\r\n'.

EDIT - To specify universal newline support:

  • Python 2 on Unix - open(file_path, mode='rU') - required [thanks @Dave]
  • Python 2 on Windows - open(file_path, mode='rU') - optional
  • Python 3 - open(file_path, newline=None) - optional

The newline parameter is only supported in Python 3 and defaults to None. The mode parameter defaults to 'r' in all cases. The U is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate \r\n to \n.

Docs:

To preserve native line terminators:

with open(file_path, 'rb') as f:
    with line_native_terminated in f:
        ...

Binary mode can still parse the file into lines with in. Each line will have whatever terminators it has in the file.

Thanks to @katrielalex's answer, Python's open() doc, and iPython experiments.

查看更多
倾城一夜雪
5楼-- · 2018-12-31 03:26

The correct, fully Pythonic way to read a file is the following:

with open(...) as f:
    for line in f:
        # Do something with 'line'

The with statement handles opening and closing the file, including if an exception is raised in the inner block. The for line in f treats the file object f as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.

There should be one -- and preferably only one -- obvious way to do it.

查看更多
初与友歌
6楼-- · 2018-12-31 03:26

Best way to read large file, line by line is to use python enumerate function

with open(file_name, "rU") as read_file:
    for i, row in enumerate(read_file, 1):
        #do something
        #i in line of that line
        #row containts all data of that line
查看更多
情到深处是孤独
7楼-- · 2018-12-31 03:27

I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).

http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html

https://store.continuum.io/cshop/iopro/

Then you can break your pairwise operation into chunks:

import numpy as np
import math

lines_total = n    
similarity = np.zeros(n,n)
lines_per_chunk = m
n_chunks = math.ceil(float(n)/m)
for i in xrange(n_chunks):
    for j in xrange(n_chunks):
        chunk_i = (function of your choice to read lines i*lines_per_chunk to (i+1)*lines_per_chunk)
        chunk_j = (function of your choice to read lines j*lines_per_chunk to (j+1)*lines_per_chunk)
        similarity[i*lines_per_chunk:(i+1)*lines_per_chunk,
                   j*lines_per_chunk:(j+1)*lines_per_chunk] = fast_operation(chunk_i, chunk_j) 

It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!

查看更多
登录 后发表回答