I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513"
instead of 1513
. What is the simplest way to read the data into R?
I can use read.csv(..., colClasses="character")
, but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.
I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using
gsub
, I think this is about as neat as I can do:This question is several years old, but I stumbled upon it, which means maybe others will.
The
readr
library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.This yields
Source: local data frame [4 x 1]
An important point when reading in files: you either have to pre-process, like the comment above regarding
sed
, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)For instance, if I had not flagged the
col_types
, I would have gotten this:(Notice that it is now a
chr
(character
) instead of anumeric
.)Or, more dangerously, if it were long enough and most of the early elements did not contain commas:
(such that the last few elements look like:)
Then you'll find trouble reading that comma at all!
a
dplyr
solution usingmutate_each
and pipessay you have the following:
and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)
mutate_each
applies the function(s) insidefuns
to the specified columnsI did it sequentially, one function at a time (if you use multiple functions inside
funs
then you create additional, unnecessary columns)If number is separated by "." and decimals by "," (1.200.000,00) in calling
gsub
you mustset fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))
A very convenient way is
readr::read_delim
-family. Taking the example from here: Importing csv with multiple separators into R you can do it as follows:Which results in the expected result:
Another solution:
It will be considerably slower than
gsub
,though.