I want to iterate over each line of an entire file. One way to do this is by reading the entire file, saving it to a list, then going over the line of interest. This method uses a lot of memory, so I am looking for an alternative.
My code so far:
for each_line in fileinput.input(input_file):
do_something(each_line)
for each_line_again in fileinput.input(input_file):
do_something(each_line_again)
Executing this code gives an error message: device active
.
Any suggestions?
The purpose is to calculate pair-wise string similarity, meaning for each line in file, I want to calculate the Levenshtein distance with every other line.
Katrielalex provided the way to open & read one file.
However the way your algorithm goes it reads the whole file for each line of the file. That means the overall amount of reading a file - and computing the Levenshtein distance - will be done N*N if N is the amount of lines in the file. Since you're concerned about file size and don't want to keep it in memory, I am concerned about the resulting quadratic runtime. Your algorithm is in the O(n^2) class of algorithms which often can be improved with specialization.
I suspect that you already know the tradeoff of memory versus runtime here, but maybe you would want to investigate if there's an efficient way to compute multiple Levenshtein distances in parallel. If so it would be interesting to share your solution here.
How many lines do your files have, and on what kind of machine (mem & cpu power) does your algorithm have to run, and what's the tolerated runtime?
Code would look like:
But the questions are how do you store the distances (matrix?) and can you gain an advantage of preparing e.g. the outer_line for processing, or caching some intermediate results for reuse.
this is a possible way of reading a file in python:
it does not allocate a full list. It iterates over the lines.
To strip newlines:
With universal newline support all text file lines will seem to be terminated with
'\n'
, whatever the terminators in the file,'\r'
,'\n'
, or'\r\n'
.EDIT - To specify universal newline support:
open(file_path, mode='rU')
- required [thanks @Dave]open(file_path, mode='rU')
- optionalopen(file_path, newline=None)
- optionalThe
newline
parameter is only supported in Python 3 and defaults toNone
. Themode
parameter defaults to'r'
in all cases. TheU
is deprecated in Python 3. In Python 2 on Windows some other mechanism appears to translate\r\n
to\n
.Docs:
To preserve native line terminators:
Binary mode can still parse the file into lines with
in
. Each line will have whatever terminators it has in the file.Thanks to @katrielalex's answer, Python's open() doc, and iPython experiments.
The correct, fully Pythonic way to read a file is the following:
The
with
statement handles opening and closing the file, including if an exception is raised in the inner block. Thefor line in f
treats the file objectf
as an iterable, which automatically uses buffered I/O and memory management so you don't have to worry about large files.Best way to read large file, line by line is to use python enumerate function
I would strongly recommend not using the default file loading as it is horrendously slow. You should look into the numpy functions and the IOpro functions (e.g. numpy.loadtxt()).
http://docs.scipy.org/doc/numpy/user/basics.io.genfromtxt.html
https://store.continuum.io/cshop/iopro/
Then you can break your pairwise operation into chunks:
It's almost always much faster to load data in chunks and then do matrix operations on it than to do it element by element!!