Chopping a string into a vector of fixed width cha

2019-01-02 17:59发布

I have an object containing a text string:

x <- "xxyyxyxy"

and I want to split that into a vector with each element containing two letters:

[1] "xx" "yy" "xy" "xy"

It seems like the strsplit should be my ticket, but since I have no regular expression foo, I can't figure out how to make this function chop the string up into chunks the way I want it. How should I do this?

标签: r strsplit
12条回答
墨雨无痕
2楼-- · 2019-01-02 18:43

Using C++ one can be even faster. Comparing with GSee's version:

GSee <- function(x) {
  sst <- strsplit(x, "")[[1]]
  paste0(sst[c(TRUE, FALSE)], sst[c(FALSE, TRUE)])
}

rstub <- Rcpp::cppFunction( code = '
CharacterVector strsplit2(const std::string& hex) {
  unsigned int length = hex.length()/2;
  CharacterVector res(length);
  for (unsigned int i = 0; i < length; ++i) {
    res(i) = hex.substr(2*i, 2);
  }
  return res;
}')

x <- "xxyyxyxy"
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: microseconds
#>      expr   min     lq      mean median     uq       max neval
#>   GSee(x) 4.272 4.4575  41.74284 4.5855 4.7105  3702.289   100
#>  rstub(x) 1.710 1.8990 139.40519 2.0665 2.1250 13722.075   100

set.seed(42)
x <- paste(sample(c("xx", "yy", "xy"), 1e5, replace = TRUE), collapse = "")
all.equal(GSee(x), rstub(x))
#> [1] TRUE
microbenchmark::microbenchmark(GSee(x), rstub(x))
#> Unit: milliseconds
#>      expr       min        lq      mean    median       uq       max neval
#>   GSee(x) 17.931801 18.431504 19.282877 18.738836 19.47943 27.191390   100
#>  rstub(x)  3.197587  3.261109  3.404973  3.341099  3.45852  4.872195   100
查看更多
刘海飞了
3楼-- · 2019-01-02 18:44

How about

strsplit(gsub("([[:alnum:]]{2})", "\\1 ", x), " ")[[1]]

Basically, add a separator (here " ") and then use strsplit

查看更多
回忆,回不去的记忆
4楼-- · 2019-01-02 18:45

Here's one way, but not using regexen:

a <- "xxyyxyxy"
n <- 2
sapply(seq(1,nchar(a),by=n), function(x) substr(a, x, x+n-1))
查看更多
无色无味的生活
5楼-- · 2019-01-02 18:45

Here is one option using stringi::stri_sub(). Try:

x <- "xxyyxyxy"
stringi::stri_sub(x, seq(1, stringi::stri_length(x), by = 2), length = 2)
# [1] "xx" "yy" "xy" "xy"
查看更多
倾城一夜雪
6楼-- · 2019-01-02 18:52

ATTENTION with substring, if string length is not a multiple of your requested length, then you will need a +(n-1) in the second sequence:

substring(x,seq(1,nchar(x),n),seq(n,nchar(x)+n-1,n)) 
查看更多
爱死公子算了
7楼-- · 2019-01-02 18:55

From my testing, the code below is faster than the previous methods that were benchmarked. stri_sub is pretty fast, and seq.int is better than seq. It's also easy to change the size of the strings by changing all the 2Ls to something else.

library(stringi)

split_line <- function(x) {
  row_length <- stri_length(x)
  stri_sub(x, seq.int(1L, row_length, 2L), seq.int(2L, row_length, 2L))
}

I didn't notice a difference when string chunks were 2 characters long, but for bigger chunks this is slightly better.

split_line <- function(x) {
  stri_sub(x, seq.int(1L, stri_length(x), 109L), length = 109L)
}
查看更多
登录 后发表回答