An algorithm for filtering text files

2019-03-13 03:38发布

问题:

Imagine you have a .txt file of the following structure:

>>> header
>>> header
>>> header
K L M
200 0.1 1
201 0.8 1
202 0.01 3
...
800 0.4 2
>>> end of file
50 0.1 1
75 0.78 5
...

I would like to read all the data except lines denoted by >>> and lines below the >>> end of file line. So far I've solved this using read.table(comment.char = ">", skip = x, nrow = y) (x and y are currently fixed). This reads the data between the header and >>> end of file.

However, I would like to make my function a bit more plastic regarding the number of rows. Data may have values larger than 800, and consequently more rows.

I could scan or readLines the file and see which row corresponds to the >>> end of file and calculate the number of lines to be read. What approach would you use?

回答1:

Here is one way to do it:

Lines <- readLines("foo.txt")
markers <- grepl(">", Lines)
want <- rle(markers)$lengths[1:2]
want <- seq.int(want[1] + 1, sum(want), by = 1)
read.table(textConnection(Lines[want]), sep = " ", header = TRUE)

Which gives:

> read.table(textConnection(Lines[want]), sep = " ", header = TRUE)
    K    L M
1 200 0.10 1
2 201 0.80 1
3 202 0.01 3
4 800 0.40 2

On the data snippet you provide (in file foo.txt, and after removing the ... lines).



回答2:

Here are a couple of ways.

1) readLine reads in the lines of the file into L and sets skip to the number of lines to skip at the beginning and end.of.file to the line number of the row marking the end of the data. The read.table command then uses these two variables to re-read the data.

File <- "foo.txt"

L <- readLines(File)
skip <- grep("^.{0,2}[^>]", L)[1] - 1
end.of.file <- grep("^>>> end of file", L)

read.table(File, header = TRUE, skip = skip, nrow = end.of.file - skip - 2)

A variation would be to use textConnection in place of File in the read.table line:

read.table(textConnection(L), header = TRUE, 
   skip = skip, nrow = end.of.file - skip - 2)

2) Another possibility is to use sed or awk/gawk. Consider this one line gawk program. The program exits if it sees the line marking the end of the data; otherwise, it skips the current line if that line starts with >>> and if neither of those happen it prints the line. We can pipe foo.txt through the gawk program and read it using read.table.

cat("/^>>> end of file/ { exit }; /^>>>/ { next }; 1\n", file = "foo.awk")
read.table(pipe('gawk -f foo.awk foo.txt'), header = TRUE)

A variation of this is that we could omit the /^>>>/ {next}; portion of the gawk program, which skips over the >>> lines at the beginning, and use comment = ">" in theread.table` call instead.



标签: r import