Count values separated by a comma in a character s

2019-01-18 20:39发布

问题:

I have this example data

d<-"30,3"
class(d)

I have this character objects in one column in my work data frame and I need to be able to identify how many numbers it has.

I have tried to use length(d), but it says 1

After looking for solution here I have tried

eval(parse(text='d'))
as.numeric(d)
as.vector.character(d)

But it still doesn't work.

Any straightforward approach to solve this problem?

回答1:

These two approaches are each short, work on vectors of strings, do not involve the expense of explicitly constructing the split string and do not use any packages. Here d is a vector of strings such as d <- c("1,2,3", "5,2") :

1) count.fields

count.fields(textConnection(d), sep = ",")

2) gregexpr

lengths(gregexpr(",", d)) + 1


回答2:

You could use scan.

 v1 <- scan(text=d, sep=',', what=numeric(), quiet=TRUE)
 v1
 #[1] 30  3

Or using stri_split from stringi. This should take both character and factor class without converting explicitly to character using as.character

library(stringi)
v2 <- as.numeric(unlist(stri_split(d,fixed=',')))
v2
#[1] 30  3

You can do the count using base R by

length(v1)
#[1] 2

Or

nchar(gsub('[^,]', '', d))+1
#[1] 2

Visualize the regex

 [^,]

Debuggex Demo

Update

If d is a column in a dataset df and want to subset rows with number of digits equals 2

  d<-c("30,3,5","30,5") 
  df <- data.frame(d,stringsAsFactors=FALSE)
  df[nchar(gsub('[^,]', '',df$d))+1==2,,drop=FALSE]
  #    d
  #2 30,5

Just to test

  df[nchar(gsub('[^,]', '',df$d))+1==10,,drop=FALSE]
  #[1] d
  #<0 rows> (or 0-length row.names)


回答3:

Here is a possibility

> as.numeric(unlist(strsplit("30,3", ",")))
# 30  3


回答4:

You could also try stringi package stri_count_* funcitons (should be very effcient)

library(stringi)
stri_count_regex(d, "\\d+")
## [1] 2
stri_count_fixed(d, ",") + 1
## [1] 2

stringr package has a similar functionality

library(stringr)
str_count(d, "\\d+")
## [1] 2

Update:

If you want to subset your data set by length 2 vectors, could try

df[stri_count_regex(df$d, "\\d+") == 2,, drop = FALSE]
#      d
# 2 30,5

Or simpler

subset(df, stri_count_regex(d, "\\d+") == 2)
#      d
# 2 30,5

Update #2

Here's a benchmark that illustrates why one should consider using external packages (@rengis answer wasn't included because it doesn't answer the question)

library(microbenchmark)
library(stringi)
d <- rep("30,3", 1e4)

microbenchmark( akrun = nchar(gsub('[^,]', '', d))+1,
                GG1 = count.fields(textConnection(d), sep = ","),
                GG2 = sapply(gregexpr(",", d), length) + 1,
                DA1 = stri_count_regex(d, "\\d+"),
                DA2 = stri_count_fixed(d, ",") + 1)

# Unit: microseconds
#  expr       min         lq       mean     median        uq       max neval
# akrun  8817.950  9479.9485 11489.7282 10642.4895 12480.845  46538.39   100
#   GG1 55451.474 61906.2460 72324.0820 68783.9935 78980.216 150673.72   100
#   GG2 33026.455 43349.5900 60960.8762 51825.6845 72293.923 203126.27   100
#   DA1  4730.302  5120.5145  6206.8297  5550.7930  7179.536  10507.09   100
#   DA2   380.147   418.2395   534.6911   448.2405   597.259   2278.11   100