Imagine you have a .txt
file of the following structure:
>>> header
>>> header
>>> header
K L M
200 0.1 1
201 0.8 1
202 0.01 3
...
800 0.4 2
>>> end of file
50 0.1 1
75 0.78 5
...
I would like to read all the data except lines denoted by >>>
and lines below the >>> end of file
line.
So far I've solved this using read.table(comment.char = ">", skip = x, nrow = y)
(x
and y
are currently fixed). This reads the data between the header and >>> end of file
.
However, I would like to make my function a bit more plastic regarding the number of rows. Data may have values larger than 800, and consequently more rows.
I could scan
or readLines
the file and see which row corresponds to the >>> end of file
and calculate the number of lines to be read. What approach would you use?
Here are a couple of ways.
1)
readLine
reads in the lines of the file intoL
and setsskip
to the number of lines to skip at the beginning andend.of.file
to the line number of the row marking the end of the data. Theread.table
command then uses these two variables to re-read the data.A variation would be to use
textConnection
in place ofFile
in theread.table
line:2) Another possibility is to use sed or awk/gawk. Consider this one line gawk program. The program exits if it sees the line marking the end of the data; otherwise, it skips the current line if that line starts with >>> and if neither of those happen it prints the line. We can pipe
foo.txt
through the gawk program and read it usingread.table
.A variation of this is that we could omit the
/^>>>/ {next};
portion of the gawk program, which skips over the>>>
lines at the beginning, and usecomment = ">" in the
read.table` call instead.Here is one way to do it:
Which gives:
On the data snippet you provide (in file
foo.txt
, and after removing the ... lines).