Imagine you have a .txt
file of the following structure:
>>> header
>>> header
>>> header
K L M
200 0.1 1
201 0.8 1
202 0.01 3
...
800 0.4 2
>>> end of file
50 0.1 1
75 0.78 5
...
I would like to read all the data except lines denoted by >>>
and lines below the >>> end of file
line.
So far I've solved this using read.table(comment.char = ">", skip = x, nrow = y)
(x
and y
are currently fixed). This reads the data between the header and >>> end of file
.
However, I would like to make my function a bit more plastic regarding the number of rows. Data may have values larger than 800, and consequently more rows.
I could scan
or readLines
the file and see which row corresponds to the >>> end of file
and calculate the number of lines to be read. What approach would you use?
Here is one way to do it:
Lines <- readLines("foo.txt")
markers <- grepl(">", Lines)
want <- rle(markers)$lengths[1:2]
want <- seq.int(want[1] + 1, sum(want), by = 1)
read.table(textConnection(Lines[want]), sep = " ", header = TRUE)
Which gives:
> read.table(textConnection(Lines[want]), sep = " ", header = TRUE)
K L M
1 200 0.10 1
2 201 0.80 1
3 202 0.01 3
4 800 0.40 2
On the data snippet you provide (in file foo.txt
, and after removing the ... lines).
Here are a couple of ways.
1) readLine
reads in the lines of the file into L
and sets skip
to the number of lines to skip at the beginning and end.of.file
to the line number of the row marking the end of the data. The read.table
command then uses these two variables to re-read the data.
File <- "foo.txt"
L <- readLines(File)
skip <- grep("^.{0,2}[^>]", L)[1] - 1
end.of.file <- grep("^>>> end of file", L)
read.table(File, header = TRUE, skip = skip, nrow = end.of.file - skip - 2)
A variation would be to use textConnection
in place of File
in the read.table
line:
read.table(textConnection(L), header = TRUE,
skip = skip, nrow = end.of.file - skip - 2)
2) Another possibility is to use sed or awk/gawk. Consider this one line gawk program. The program exits if it sees the line marking the end of the data; otherwise, it skips the current line if that line starts with >>> and if neither of those happen it prints the line. We can pipe foo.txt
through the gawk program and read it using read.table
.
cat("/^>>> end of file/ { exit }; /^>>>/ { next }; 1\n", file = "foo.awk")
read.table(pipe('gawk -f foo.awk foo.txt'), header = TRUE)
A variation of this is that we could omit the /^>>>/ {next};
portion of the gawk program, which skips over the >>>
lines at the beginning, and use comment = ">" in the
read.table` call instead.