How can I partition a matrix or dataframe into N equally-sized chunks with R? I want to cut the matrix or dataframe horizontally.
For example, given:
r = 8
c = 10
number_of_chunks = 4
data = matrix(seq(r*c), nrow = r, ncol=c)
>>> data
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 9 17 25 33 41 49 57 65 73
[2,] 2 10 18 26 34 42 50 58 66 74
[3,] 3 11 19 27 35 43 51 59 67 75
[4,] 4 12 20 28 36 44 52 60 68 76
[5,] 5 13 21 29 37 45 53 61 69 77
[6,] 6 14 22 30 38 46 54 62 70 78
[7,] 7 15 23 31 39 47 55 63 71 79
[8,] 8 16 24 32 40 48 56 64 72 80
I would like to have to cut data
into a list of 4 elements:
Element 1:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 9 17 25 33 41 49 57 65 73
[2,] 2 10 18 26 34 42 50 58 66 74
Element 2:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[3,] 3 11 19 27 35 43 51 59 67 75
[4,] 4 12 20 28 36 44 52 60 68 76
Element 3:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[5,] 5 13 21 29 37 45 53 61 69 77
[6,] 6 14 22 30 38 46 54 62 70 78
Element 4:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[7,] 7 15 23 31 39 47 55 63 71 79
[8,] 8 16 24 32 40 48 56 64 72 80
With numpy in python, I can use numpy.array_split
.
Here's an attempt in base R. Calculate "pretty" cut values for the sequence of rows using pretty
. Categorized the sequence of row numbers with cut
and return a list of the the sequence split at the cut values with split
. Finally, run through a list of the split row values using lapply
extract the matrix subsets with [
.
lapply(split(seq_len(nrow(data)),
cut(seq_len(nrow(data)), pretty(seq_len(nrow(data)), number_of_chunks))),
function(x) data[x, ])
$`(0,2]`
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 9 17 25 33 41 49 57 65 73
[2,] 2 10 18 26 34 42 50 58 66 74
$`(2,4]`
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 3 11 19 27 35 43 51 59 67 75
[2,] 4 12 20 28 36 44 52 60 68 76
$`(4,6]`
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 5 13 21 29 37 45 53 61 69 77
[2,] 6 14 22 30 38 46 54 62 70 78
$`(6,8]`
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 7 15 23 31 39 47 55 63 71 79
[2,] 8 16 24 32 40 48 56 64 72 80
Roll this into a function:
array_split <- function(data, number_of_chunks) {
rowIdx <- seq_len(nrow(data))
lapply(split(rowIdx, cut(rowIdx, pretty(rowIdx, number_of_chunks))), function(x) data[x, ])
}
Then, you can use
array_split(data=data, number_of_chunks=number_of_chunks)
to return the same result as above.
A nice simplification suggested by @user20650 is
split.data.frame(data,
cut(seq_len(nrow(data)), pretty(seq_len(nrow(data)), number_of_chunks)))
A surprise to me, split.data.frame
returns a list of matrices when its first argument is a matrix.
number_of_chunks = 4
lapply(seq(1, NROW(data), ceiling(NROW(data)/number_of_chunks)),
function(i) data[i:min(i + ceiling(NROW(data)/number_of_chunks) - 1, NROW(data)),])
OR
lapply(split(data, rep(1:number_of_chunks, each = NROW(data)/number_of_chunks)),
function(a) matrix(a, ncol = NCOL(data)))
Try to not split the data explicitly, because it's another copy. You'd rather split the indices you want to access.
With this function, you can split by number of chunks (for parallelism) or by size of the chunks.
CutBySize <- function(m, block.size, nb = ceiling(m / block.size)) {
int <- m / nb
upper <- round(1:nb * int)
lower <- c(1, upper[-nb] + 1)
size <- c(upper[1], diff(upper))
cbind(lower, upper, size)
}
CutBySize(nrow(data), nb = number_of_chunks)
lower upper size
[1,] 1 2 2
[2,] 3 4 2
[3,] 5 6 2
[4,] 7 8 2