How to read data when some numbers contain commas

2018-12-31 09:56发布

I have a csv file where some of the numerical values are expressed as strings with commas as thousand separator, e.g. "1,513" instead of 1513. What is the simplest way to read the data into R?

I can use read.csv(..., colClasses="character"), but then I have to strip out the commas from the relevant elements before converting those columns to numeric, and I can't find a neat way to do that.

标签: r csv
12条回答
明月照影归
2楼-- · 2018-12-31 10:28

I want to use R rather than pre-processing the data as it makes it easier when the data are revised. Following Shane's suggestion of using gsub, I think this is about as neat as I can do:

x <- read.csv("file.csv",header=TRUE,colClasses="character")
col2cvt <- 15:41
x[,col2cvt] <- lapply(x[,col2cvt],function(x){as.numeric(gsub(",", "", x))})
查看更多
泪湿衣
3楼-- · 2018-12-31 10:33

This question is several years old, but I stumbled upon it, which means maybe others will.

The readr library / package has some nice features to it. One of them is a nice way to interpret "messy" columns, like these.

library(readr)
read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5",
          col_types = list(col_numeric())
        )

This yields

Source: local data frame [4 x 1]

  numbers
    (dbl)
1   800.0
2  1800.0
3  3500.0
4     6.5

An important point when reading in files: you either have to pre-process, like the comment above regarding sed, or you have to process while reading. Often, if you try to fix things after the fact, there are some dangerous assumptions made that are hard to find. (Which is why flat files are so evil in the first place.)

For instance, if I had not flagged the col_types, I would have gotten this:

> read_csv("numbers\n800\n\"1,800\"\n\"3500\"\n6.5")
Source: local data frame [4 x 1]

  numbers
    (chr)
1     800
2   1,800
3    3500
4     6.5

(Notice that it is now a chr (character) instead of a numeric.)

Or, more dangerously, if it were long enough and most of the early elements did not contain commas:

> set.seed(1)
> tmp <- as.character(sample(c(1:10), 100, replace=TRUE))
> tmp <- c(tmp, "1,003")
> tmp <- paste(tmp, collapse="\"\n\"")

(such that the last few elements look like:)

\"5\"\n\"9\"\n\"7\"\n\"1,003"

Then you'll find trouble reading that comma at all!

> tail(read_csv(tmp))
Source: local data frame [6 x 1]

     3"
  (dbl)
1 8.000
2 5.000
3 5.000
4 9.000
5 7.000
6 1.003
Warning message:
1 problems parsing literal data. See problems(...) for more details. 
查看更多
公子世无双
4楼-- · 2018-12-31 10:36

a dplyr solution using mutate_each and pipes

say you have the following:

> dft
Source: local data frame [11 x 5]

   Bureau.Name Account.Code   X2014   X2015   X2016
1       Senate          110 158,000 211,000 186,000
2       Senate          115       0       0       0
3       Senate          123  15,000  71,000  21,000
4       Senate          126   6,000  14,000   8,000
5       Senate          127 110,000 234,000 134,000
6       Senate          128 120,000 159,000 134,000
7       Senate          129       0       0       0
8       Senate          130 368,000 465,000 441,000
9       Senate          132       0       0       0
10      Senate          140       0       0       0
11      Senate          140       0       0       0

and want to remove commas from the year variables X2014-X2016, and convert them to numeric. also, let's say X2014-X2016 are read in as factors (default)

dft %>%
    mutate_each(funs(as.character(.)), X2014:X2016) %>%
    mutate_each(funs(gsub(",", "", .)), X2014:X2016) %>%
    mutate_each(funs(as.numeric(.)), X2014:X2016)

mutate_each applies the function(s) inside funs to the specified columns

I did it sequentially, one function at a time (if you use multiple functions inside funs then you create additional, unnecessary columns)

查看更多
明月照影归
5楼-- · 2018-12-31 10:38

If number is separated by "." and decimals by "," (1.200.000,00) in calling gsub you must set fixed=TRUE as.numeric(gsub(".","",y,fixed=TRUE))

查看更多
谁念西风独自凉
6楼-- · 2018-12-31 10:38

A very convenient way is readr::read_delim-family. Taking the example from here: Importing csv with multiple separators into R you can do it as follows:

txt <- 'OBJECTID,District_N,ZONE_CODE,COUNT,AREA,SUM
1,Bagamoyo,1,"136,227","8,514,187,500.000000000000000","352,678.813105723350000"
2,Bariadi,2,"88,350","5,521,875,000.000000000000000","526,307.288878142830000"
3,Chunya,3,"483,059","30,191,187,500.000000000000000","352,444.699742995200000"'

require(readr)
read_csv(txt) # = read_delim(txt, delim = ",")

Which results in the expected result:

# A tibble: 3 × 6
  OBJECTID District_N ZONE_CODE  COUNT        AREA      SUM
     <int>      <chr>     <int>  <dbl>       <dbl>    <dbl>
1        1   Bagamoyo         1 136227  8514187500 352678.8
2        2    Bariadi         2  88350  5521875000 526307.3
3        3     Chunya         3 483059 30191187500 352444.7
查看更多
几人难应
7楼-- · 2018-12-31 10:38

Another solution:

 y <- c("1,200","20,000","100","12,111") 

 as.numeric(unlist(lapply( strsplit(y,","),paste, collapse="")))

It will be considerably slower than gsub,though.

查看更多
登录 后发表回答