just a quick question. Is there a way to use read.csv to import every Nth row from a large file:
Example, a 50-60 million line file where you only need every 4th row starting at row 2.
I thought about maybe incorporating the 'seq' function, but I am not sure if that is possible.
Any suggestions?
For a large data file the best option is to filter out unnecessary row before they get imported into R. The simplest way to do this is by the means of the OS commands, like sed, awk, grep etc. The following code reads every 4th line from the file: for example:
write.csv(1:1000, file='test.csv')
file.pipe <- pipe("awk 'BEGIN{i=0}{i++;if (i%4==0) print $1}' < test.csv ")
res <- read.csv(file.pipe)
res
> res
X3 X3.1
1 7 7
2 11 11
3 15 15
4 19 19
5 23 23
6 27 27
7 31 31
8 35 35
read.csv("filename.csv")[c(FALSE, TRUE, FALSE, FALSE), ]
will do the trick.
This works since the logical vector is recycled until it matches the number of rows of the data frame returned by read.csv
.
As @df239 suggested, its much better to filter the rows beforehand using a commandline tool.
Here's a simpler version using sed
:
df <- read.csv(pipe("sed -n '2~4p' test.csv"))
The 2~4p
tells sed
to get every 4th line, starting at line 2.
Sven gave a great answer for moderately sized files. But if the reason that you are doing this is because reading the entire file does not fit into memory then you need to take a different approach.
It may be simplest to use an external tool like Perl or AWK to preprocess the file to only have the lines that you want, you can use pipe
to read from the output of another program so that you do not have to create an intermediate file.
Another approach would be to transfer the file to a database, then select just the rows that you want from the database.
You can also loop through the file. If you explicitly open the file, then you can read a few rows at a time, keep just the ones that you want, then read the next chunk starting where you left off. The options to read.csv
to skip lines and limit the number of lines to read would be helpful here.
While the sed
and awk
solutions are great, it might be nice to do this within R
itself (say on Windows machines or to avoid GNU sed
vs BSD sed
differences). Using readr::read_*_chunked
from the tidyverse
with a callback that samples every nth
row works rather well:
read_tsv_sample <- function(fn, nth, ...) {
sample_df_cb <- function(df, idx) {
df[seq(1, nrow(df), nth), ]
}
read_tsv_chunked(fn,
...,
chunk_size = 10000,
callback = DataFrameCallback$new(sample_df_cb)
) %>%
bind_rows()
}
For example...
iris %>% write_tsv("iris.tsv")
iris %>% dim
#> [1] 150 5
"iris.tsv" %>%
read_tsv_sample(10,
col_types = cols(.default = col_double())
) %>%
dim
#> [1] 15 5