An algorithm for filtering text files

Imagine you have a .txt file of the following structure:

>>> header
>>> header
>>> header
K L M
200 0.1 1
201 0.8 1
202 0.01 3
...
800 0.4 2
>>> end of file
50 0.1 1
75 0.78 5
...

I would like to read all the data except lines denoted by >>> and lines below the >>> end of file line. So far I've solved this using read.table(comment.char = ">", skip = x, nrow = y) (x and y are currently fixed). This reads the data between the header and >>> end of file.

However, I would like to make my function a bit more plastic regarding the number of rows. Data may have values larger than 800, and consequently more rows.

I could scan or readLines the file and see which row corresponds to the >>> end of file and calculate the number of lines to be read. What approach would you use?

标签： r import

2条回答

Fickle 薄情

2楼-- · 2019-03-13 03:43

Here are a couple of ways.

1) readLine reads in the lines of the file into L and sets skip to the number of lines to skip at the beginning and end.of.file to the line number of the row marking the end of the data. The read.table command then uses these two variables to re-read the data.

File <- "foo.txt"

L <- readLines(File)
skip <- grep("^.{0,2}[^>]", L)[1] - 1
end.of.file <- grep("^>>> end of file", L)

read.table(File, header = TRUE, skip = skip, nrow = end.of.file - skip - 2)

A variation would be to use textConnection in place of File in the read.table line:

read.table(textConnection(L), header = TRUE, 
   skip = skip, nrow = end.of.file - skip - 2)

2) Another possibility is to use sed or awk/gawk. Consider this one line gawk program. The program exits if it sees the line marking the end of the data; otherwise, it skips the current line if that line starts with >>> and if neither of those happen it prints the line. We can pipe foo.txt through the gawk program and read it using read.table.

cat("/^>>> end of file/ { exit }; /^>>>/ { next }; 1\n", file = "foo.awk")
read.table(pipe('gawk -f foo.awk foo.txt'), header = TRUE)

A variation of this is that we could omit the /^>>>/ {next}; portion of the gawk program, which skips over the >>> lines at the beginning, and use comment = ">" in theread.table` call instead.

0人赞添加讨论(0) 举报

对你真心纯属浪费

3楼-- · 2019-03-13 03:45

Here is one way to do it:

Lines <- readLines("foo.txt")
markers <- grepl(">", Lines)
want <- rle(markers)$lengths[1:2]
want <- seq.int(want[1] + 1, sum(want), by = 1)
read.table(textConnection(Lines[want]), sep = " ", header = TRUE)

Which gives:

> read.table(textConnection(Lines[want]), sep = " ", header = TRUE)
    K    L M
1 200 0.10 1
2 201 0.80 1
3 202 0.01 3
4 800 0.40 2

On the data snippet you provide (in file foo.txt, and after removing the ... lines).

0人赞添加讨论(0) 举报

An algorithm for filtering text files

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间