Importing only every Nth row from a .csv file in R

2019-01-17 23:24发布

问题:

just a quick question. Is there a way to use read.csv to import every Nth row from a large file:

Example, a 50-60 million line file where you only need every 4th row starting at row 2.

I thought about maybe incorporating the 'seq' function, but I am not sure if that is possible.

Any suggestions?

回答1:

For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:

write.csv(1:1000, file='test.csv')

file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res

> res
     X3 X3.1
1     7    7
2    11   11
3    15   15
4    19   19
5    23   23
6    27   27
7    31   31
8    35   35


回答2:

read.csv("filename.csv")[c(FALSE, TRUE, FALSE, FALSE), ]

will do the trick.

This works since the logical vector is recycled until it matches the number of rows of the data frame returned by read.csv.



回答3:

As @df239 suggested, its much better to filter the rows beforehand using a commandline tool.

Here's a simpler version using sed:

df <- read.csv(pipe("sed -n '2~4p' test.csv")) 

The 2~4p tells sed to get every 4th line, starting at line 2.



回答4:

Sven gave a great answer for moderately sized files. But if the reason that you are doing this is because reading the entire file does not fit into memory then you need to take a different approach.

It may be simplest to use an external tool like Perl or AWK to preprocess the file to only have the lines that you want, you can use pipe to read from the output of another program so that you do not have to create an intermediate file.

Another approach would be to transfer the file to a database, then select just the rows that you want from the database.

You can also loop through the file. If you explicitly open the file, then you can read a few rows at a time, keep just the ones that you want, then read the next chunk starting where you left off. The options to read.csv to skip lines and limit the number of lines to read would be helpful here.



回答5:

While the sed and awk solutions are great, it might be nice to do this within R itself (say on Windows machines or to avoid GNU sed vs BSD sed differences). Using readr::read_*_chunked from the tidyverse with a callback that samples every nth row works rather well:

read_tsv_sample <- function(fn, nth, ...) {
  sample_df_cb <- function(df, idx) {
    df[seq(1, nrow(df), nth), ]
  }

  read_tsv_chunked(fn,
                   ...,
                   chunk_size = 10000,
                   callback = DataFrameCallback$new(sample_df_cb)
  ) %>%
    bind_rows()
}

For example...

iris %>% write_tsv("iris.tsv")

iris %>% dim
#> [1] 150   5

"iris.tsv" %>%
    read_tsv_sample(10,
                    col_types = cols(.default = col_double())
                    ) %>%
    dim
#> [1] 15  5


标签: r csv text import