I have quite a big text file to parse. The main pattern is as follows:
step 1
[n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 2
[n2 != n1 lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
step 3
[(n3 != n1) and (n3 !=n2) lines of headers]
3 3 2
0.25 0.43 12.62 1.22 8.97
12.89 89.72 34.87 55.45 17.62
4.25 16.78 98.01 1.16 32.26
0.90 0.78 11.87
in other words:
A separator: step #
Headers of known length (line numbers, not bytes)
Data 3-dimensional shape: nz, ny, nx
Data: fortran formating, ~10 floats/line in the original dataset
I just want to extract the data, convert them to floats, put it in a numpy array and ndarray.reshape it to the shapes given.
I've already done a bit of programming... The main idea is
- to get the offsets of each separator first ("step X")
- skip nX (n1, n2...) lines + 1 to reach the data
- read bytes from there all the way to the next separator.
I wanted to avoid regex at first since these would slow things down a lot. It already takes 3-4 minutes just to get the first step done (browsing the file to get the offset of each part).
The problem is that I'm basically using file.tell()
method to get the separator positions:
[file.tell() - len(sep) for line in file if sep in line]
The problem is two-fold:
- for smaller files,
file.tell()
gives the right separator positions, for longer files, it does not. I suspect thatfile.tell()
should not be used in loops neither using explicitfile.readline()
nor using the implicitfor line in file
(I tried both). I don't know, but the result is there: with big files,[file.tell() for line in file if sep in line]
does not give systematically the position of the line right after a separator. - len(sep) does not give the right offset correction to go back at the beginning of the "separator" line.
sep
is a string (bytes) containing the first line of the file (the first separator).
Does anyone knows how I should parse that?
NB: I find the offsets first because I want to be able to browse inside the file: I might just want the 10th dataset or the 50000th one...
1- Finding the offsets
sep = "step "
with open("myfile") as f_in:
offsets = [fin.tell() for line in fin if sep in line]
As I said, this is working in the simple example, but not on the big file.
New test:
sep = "step "
offsets = []
with open("myfile") as f_in:
for line in f_in:
if sep in line:
print line
offsets.append(f_in.tell())
The line printed corresponds to the separators, no doubt about it. But the offsets obtained with f_in.tell()
do not correspond to the next line. I guess the file is buffered in memory and as I try to use f_in.tell()
in the implicit loop, I do not get the current position but the end of the buffer. This is just a wild guess.
I got the answer:
for
-loops on a file andtell()
do not get along very well. Just like mixingfor i in file
andfile.readline()
raises an error.So, use
file.tell()
withfile.readline()
orfile.read()
only.Never ever use:
This is really a shame but that's the way it is.