Quotation issues reading data into R

2019-06-21 22:21发布

问题:

I have some data from and I am trying to load it into R. It is in .csv files and I can view the data in both Excel and OpenOffice. (If you are curious, it is the 2011 poll results data from Elections Canada data available here).

The data is coded in an unusual manner. A typical line is:

12002,Central Nova","Nova-Centre"," 1","River John",N,N,"",1,299,"Chisholm","","Matthew","Green Party","Parti Vert",N,N,11

There is a " on the end of the Central-Nova but not at the beginning. So in order to read in the data, I suppressed the quotes, which worked fine for the first few files. ie.

test<-read.csv("pollresults_resultatsbureau11001.csv",header = TRUE,sep=",",fileEncoding="latin1",as.is=TRUE,quote="")

Now here is the problem: in another file (eg. pollresults_resultatsbureau12002.csv), there is a line of data like this:

12002,Central Nova","Nova-Centre"," 6-1","Pictou, Subd. A",N,N,"",0,168,"Parker","","David K.","NDP-New Democratic Party","NPD-Nouveau Parti democratique",N,N,28

Because I need to suppress the quotes, the entry "Pictou, Subd. A" makes R wants to split this into 2 variables. The data can't be read in since it wants to add a column half way through constructing the dataframe.

Excel and OpenOffice both can open these files no problem. Somehow, Excel and OpenOffice know that quotation marks only matter if they are at the beginning of a variable entry.

Do you know what option I need to enable on R to get this data in? I have >300 files that I need to load (each with ~1000 rows each) so a manual fix is not an option...

I have looked all over the place for a solution but can't find one.

回答1:

Building on my comments, here is a solution that would read all the CSV files into a single list.

# Deal with French properly
options(encoding="latin1")

# Set your working directory to where you have
#   unzipped all of your 308 CSV files
setwd("path/to/unzipped/files")

# Get the file names
temp <- list.files()

# Extract the 5-digit code which we can use as names
Codes <- gsub("pollresults_resultatsbureau|.csv", "", temp)

# Read all the files into a single list named "pollResults"
pollResults <- lapply(seq_along(temp), function(x) {
  T0 <- readLines(temp[x])
  T0[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', T0[-1])
  final <- read.csv(text = T0, header = TRUE)
  final
})
names(pollResults) <- Codes

You can easily work with this list in different ways. If you wanted to just see the 90th data.frame you can access it by using pollResults[[90]] or by using pollResults[["24058"]] (in other words, either by index number or by district number).

Having the data in this format means you can also do a lot of other convenient things. For instance, if you wanted to fix all 308 of the CSVs in one go, you can use the following code, which will create new CSVs with the file name prefixed with "Corrected_".

invisible(lapply(seq_along(pollResults), function(x) {
  NewFilename <- paste("Corrected", temp[x], sep = "_")
  write.csv(pollResults[[x]], file = NewFilename, 
            quote = TRUE, row.names = FALSE)
}))

Hope this helps!



回答2:

This answer is mainly to @AnandaMahto (see comments to the original question).

First, it helps to set some options globally because of the french accents in the data:

options(encoding="latin1")

Next, read in the data verbatim using readLines():

temp <- readLines("pollresults_resultatsbureau13001.csv")

Following this, simply replace the first comma in each line of data with a comma+quotation. This works because the first field is always 5 characters long. Note that it leaves the header untouched.

temp[-1] <- gsub('^(.{6})(.*)$', '\\1\\"\\2', temp[-1])

Penultimately, write over the original file.

fileConn<-file("pollresults_resultatsbureau13001.csv") writeLines(temp,fileConn) close(fileConn)

Finally, simply read the data back into R:

data<-read.csv(file="pollresults_resultatsbureau13001.csv",header = TRUE,sep=",")

There is probably a more parsimonious way to do this (and one that can be iterated more easily) but this process made sense to me.