I am using read.csv.sql
to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size prior to reading the data in). I was running into memory issues by reading in the full data and then filtering it so that is why it is important that I use the conditional read so that the subset is read in, versus the full data set.
Here is a small data set so my problem can be reproduced:
write.csv(iris, "iris.csv", row.names = F)
I am finding that the notation you have to use is extremely awkward using read.csv.sql
the following is the first way I tried reading in the file and it works but it is messy:
library(sqldf)
csvFile <- "iris.csv"
spec <- 'setosa'
sql <- paste0("select * from file where Species = '\"", spec,"\"'")
d1 <- read.csv.sql(file = csvFile, sql = sql)
I then found another way of writing the same notation in a slightly cleaner manner is:
sql <- paste0("select * from file where Species = '", spec,"'")
d2 <- read.csv.sql(file = csvFile, sql = sql,
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
Next, I wanted to read in a case where I select multiple values from the same column, so I tried this and it works:
d3 <- read.csv.sql(file = csvFile,
sql = "select * from file where Species in
('\"setosa\"', '\"versicolor\"') ")
However, I want to avoid hard coding the values like that so I tried:
spec2 <- c('setosa', 'versicolor')
sql2 <- paste0("select * from file where Species in '", spec2,"'")
d4 <- read.csv.sql(file = csvFile, sql = sql2,
filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))
But this does not work (it seems to only read the first value from the vector and tries to match it as a table). I'm sure this is another notation issue again and would like help to clear up this chunk of code.
Also, if you have any tips/tricks on using read.csv.sql
and dealing with the notation issues, I would like to hear them!
The problem is that sqldf provides text preprocessing faciliities but the code shown in the question does not use them making it overly complex.
1) Regarding text substitution, use fn$
(from gsubfn which sqldf automatically loads) as discussed on the github page for sqldf. Assuming that we used quote = FALSE in the write.csv since sqlite does not handle quotes natively:
spec <- 'setosa'
out <- fn$read.csv.sql("iris.csv", "select * from file where Species = '$spec' ")
spec <- c("setosa", "versicolor")
string <- toString(sprintf("'%s'", spec)) # add quotes and make comma-separated
out <- fn$read.csv.sql("iris.csv", "select * from file where Species in ($string) ")
2) Regarding deleting double quotes, a simpler way would be to use the following filter=
argument:
read.csv.sql("iris.csv", filter = "tr -d \\042") # Windows
or
read.csv.sql("iris.csv", filter = "tr -d \\\\042") # Linux / bash
depending on your shell. The first one worked for me on Windows (with Rtools installed and on the PATH) and the second worked for me on Linux with bash. It is possible that other variations could be needed for other shells.
2a) Another possibility for removing quotes is to install the free csvfix utility (available on Windows, Linux and Mac) on your system and then use the following filter=
argument which should work in all shells since it does not involve any characters that are typically interpreted specially by either R or most shells. Thus the following should work on all platforms.
read.csv.sql("iris.csv", filter = "csvfix echo -smq")
2b) Another cross platform utility that could be used is xsv. The eol=
argument is only needed on Windows since xsv
produces UNIX style line endings but won't hurt on other platforms so the following line should work on all platforms.
read.csv.sql("iris.csv", eol = "\n", filter = "xsv fmt")
2c) sqldf also includes an awk program (csv.awk) that can be used. It outputs UNIX style newlines so specify eol = "\n" on Windows. On other platforms it won't hurt if you specify it but you can omit it if you wish since that is the default on those platforms.
csv.awk <- system.file("csv.awk", package = "sqldf")
rm_quotes_cmd <- sprintf('gawk -f "%s"', csv.awk)
read.csv.sql("iris.csv", eol = "\n", filter = rm_quotes_cmd)
3) Regarding general tips, note that the verbose=TRUE
argument to read.csv.sql
can be useful to see what it is going on.
read.csv.sql("iris.csv", verbose = TRUE)