Split a vector into chunks in R

2018-12-31 06:53发布

I have to split a vector into n chunks of equal size in R. I couldn't find any base function to do that. Also Google didn't get me anywhere. So here is what I came up with, hopefully it helps someone some where.

x <- 1:10
n <- 3
chunk <- function(x,n) split(x, factor(sort(rank(x)%%n)))
chunk(x,n)
$`0`
[1] 1 2 3

$`1`
[1] 4 5 6 7

$`2`
[1]  8  9 10

Any comments, suggestions or improvements are really welcome and appreciated.

Cheers, Sebastian

标签: r vector
19条回答
与君花间醉酒
2楼-- · 2018-12-31 07:45

Try the ggplot2 function, cut_number:

library(ggplot2)
x <- 1:10
n <- 3
cut_number(x, n) # labels = FALSE if you just want an integer result
#>  [1] [1,4]  [1,4]  [1,4]  [1,4]  (4,7]  (4,7]  (4,7]  (7,10] (7,10] (7,10]
#> Levels: [1,4] (4,7] (7,10]

# if you want it split into a list:
split(x, cut_number(x, n))
#> $`[1,4]`
#> [1] 1 2 3 4
#> 
#> $`(4,7]`
#> [1] 5 6 7
#> 
#> $`(7,10]`
#> [1]  8  9 10
查看更多
何处买醉
3楼-- · 2018-12-31 07:48

I need a function that takes the argument of a data.table (in quotes) and another argument that is the upper limit on the number of rows in the subsets of that original data.table. This function produces whatever number of data.tables that upper limit allows for:

library(data.table)    
split_dt <- function(x,y) 
    {
    for(i in seq(from=1,to=nrow(get(x)),by=y)) 
        {df_ <<- get(x)[i:(i + y)];
            assign(paste0("df_",i),df_,inherits=TRUE)}
    rm(df_,inherits=TRUE)
    }

This function gives me a series of data.tables named df_[number] with the starting row from the original data.table in the name. The last data.table can be short and filled with NAs so you have to subset that back to whatever data is left. This type of function is useful because certain GIS software have limits on how many address pins you can import, for example. So slicing up data.tables into smaller chunks may not be recommended, but it may not be avoidable.

查看更多
梦该遗忘
4楼-- · 2018-12-31 07:50

Simple function for splitting a vector by simply using indexes - no need to over complicate this

vsplit <- function(v, n) {
    l = length(v)
    r = l/n
    return(lapply(1:n, function(i) {
        s = max(1, round(r*(i-1))+1)
        e = min(l, round(r*i))
        return(v[s:e])
    }))
}
查看更多
素衣白纱
5楼-- · 2018-12-31 07:50

Yet another possibility is the splitIndices function from package parallel:

library(parallel)
splitIndices(20, 3)

Gives:

[[1]]
[1] 1 2 3 4 5 6 7

[[2]]
[1]  8  9 10 11 12 13

[[3]]
[1] 14 15 16 17 18 19 20
查看更多
闭嘴吧你
6楼-- · 2018-12-31 07:54

This will split it differently to what you have, but is still quite a nice list structure I think:

chunk.2 <- function(x, n, force.number.of.groups = TRUE, len = length(x), groups = trunc(len/n), overflow = len%%n) { 
  if(force.number.of.groups) {
    f1 <- as.character(sort(rep(1:n, groups)))
    f <- as.character(c(f1, rep(n, overflow)))
  } else {
    f1 <- as.character(sort(rep(1:groups, n)))
    f <- as.character(c(f1, rep("overflow", overflow)))
  }

  g <- split(x, f)

  if(force.number.of.groups) {
    g.names <- names(g)
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
  } else {
    g.names <- names(g[-length(g)])
    g.names.ordered <- as.character(sort(as.numeric(g.names)))
    g.names.ordered <- c(g.names.ordered, "overflow")
  }

  return(g[g.names.ordered])
}

Which will give you the following, depending on how you want it formatted:

> x <- 1:10; n <- 3
> chunk.2(x, n, force.number.of.groups = FALSE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1] 7 8 9

$overflow
[1] 10

> chunk.2(x, n, force.number.of.groups = TRUE)
$`1`
[1] 1 2 3

$`2`
[1] 4 5 6

$`3`
[1]  7  8  9 10

Running a couple of timings using these settings:

set.seed(42)
x <- rnorm(1:1e7)
n <- 3

Then we have the following results:

> system.time(chunk(x, n)) # your function 
   user  system elapsed 
 29.500   0.620  30.125 

> system.time(chunk.2(x, n, force.number.of.groups = TRUE))
   user  system elapsed 
  5.360   0.300   5.663 

EDIT: Changing from as.factor() to as.character() in my function made it twice as fast.

查看更多
姐姐魅力值爆表
7楼-- · 2018-12-31 07:57

I needed the same function and have read the previous solutions, however i also needed to have the unbalanced chunk to be at the end i.e if i have 10 elements to split them into vectors of 3 each, then my result should have vectors with 3,3,4 elements respectively. So i used the following (i left the code unoptimised for readability, otherwise no need to have many variables):

chunk <- function(x,n){
  numOfVectors <- floor(length(x)/n)
  elementsPerVector <- c(rep(n,numOfVectors-1),n+length(x) %% n)
  elemDistPerVector <- rep(1:numOfVectors,elementsPerVector)
  split(x,factor(elemDistPerVector))
}
set.seed(1)
x <- rnorm(10)
n <- 3
chunk(x,n)
$`1`
[1] -0.6264538  0.1836433 -0.8356286

$`2`
[1]  1.5952808  0.3295078 -0.8204684

$`3`
[1]  0.4874291  0.7383247  0.5757814 -0.3053884
查看更多
登录 后发表回答