可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
I need to split a sorted unknown length vector in R into "top 10%,..., bottom 10%"
So, for example if I have vector <- order(c(1:98928))
, I want to split it into 10 different vectors, each one representing approximately 10% of the total length.
Ive tried using split <- split(vector, 1:10)
but as I dont know the length of the vector, I get this error if its not multiple
data length is not a multiple of split variable
And even if its multiple and the function works, split()
does not keep the order of my original vector. This is what split gives:
split(c(1:10) , 1:2)
$`1`
[1] 1 3 5 7 9
$`2`
[1] 2 4 6 8 10
And this is what I want:
$`1`
[1] 1 2 3 4 5
$`2`
[1] 6 7 8 9 10
Im newbie in R and Ive been trying lots of things without success, does anyone knows how to do this?
回答1:
Problem statement
Break a sorted vector x
every 10% into 10 chunks.
Note there are two interpretation for this:
Cutting by vector index:
split(x, floor(10 * seq.int(0, length(x) - 1) / length(x)))
Cutting by vector values (say, quantiles):
split(x, cut(x, quantile(x, prob = 0:10 / 10, names = FALSE), include = TRUE))
In the following, I will make demonstration using data:
set.seed(0); x <- sort(round(rnorm(23),1))
Particularly, our example data are Normally distributed rather than uniformly distributed, so cutting by index and cutting by value are substantially different.
Result
cutting by index
#$`0`
#[1] -1.5 -1.2 -1.1
#
#$`1`
#[1] -0.9 -0.9
#
#$`2`
#[1] -0.8 -0.4
#
#$`3`
#[1] -0.3 -0.3 -0.3
#
#$`4`
#[1] -0.3 -0.2
#
#$`5`
#[1] 0.0 0.1
#
#$`6`
#[1] 0.3 0.4 0.4
#
#$`7`
#[1] 0.4 0.8
#
#$`8`
#[1] 1.3 1.3
#
#$`9`
#[1] 1.3 2.4
cutting by quantile
#$`[-1.5,-1.06]`
#[1] -1.5 -1.2 -1.1
#
#$`(-1.06,-0.86]`
#[1] -0.9 -0.9
#
#$`(-0.86,-0.34]`
#[1] -0.8 -0.4
#
#$`(-0.34,-0.3]`
#[1] -0.3 -0.3 -0.3 -0.3
#
#$`(-0.3,-0.2]`
#[1] -0.2
#
#$`(-0.2,0.14]`
#[1] 0.0 0.1
#
#$`(0.14,0.4]`
#[1] 0.3 0.4 0.4 0.4
#
#$`(0.4,0.64]`
#numeric(0)
#
#$`(0.64,1.3]`
#[1] 0.8 1.3 1.3 1.3
#
#$`(1.3,2.4]`
#[1] 2.4
回答2:
x <- 1:98
y <- split(x, ((seq(length(x))-1)*10)%/%length(x)+1)
Explanation:
seq(length(x)) = 1..98
seq(length(x))-1 = 0..97
(seq(length(x))-1)*10 = (0, 10, ..., 970)
# each number about 10% of values, totally 98
((seq(length(x))-1)*10)%/%length(x) = (0, ..., 0, 1, ..., 1, ..., 9, ..., 9)
# each number about 10% of values, totally 98
seq(length(x))-1)*10)%/%length(x)+1 = (1, ..., 1, 2, ..., 2, ..., 10, ..., 10)
# splits first ~10% of numbers to 1, next ~10% of numbers to 2 etc.
split(x, ((seq(length(x))-1)*10)%/%length(x)+1)
回答3:
If you have your vector as a column (named vec
) in a data frame, you can simply do something like this:
df$new_vec <- cut(df$vec , breaks = quantile(df$vec, c(0, .1,.., 1)),
labels=1:10, include.lowest=TRUE)
回答4:
If the vector is sorted, then you could just create a group variable with the same length of vector and split on it. In real case, it will require a little more effort since the length of the vector may not be a multiple of 10 but for your toy example, you can do:
n = 2
split(x, rep(1:n, each = length(x)/n))
# $`1`
# [1] 1 2 3 4 5
# $`2`
# [1] 6 7 8 9 10
A real case example, where the vector's length is not a multiple of the number of groups:
vec = 1:13
n = 3
split(vec, sort(seq_along(vec)%%n))
# $`0`
# [1] 1 2 3 4
# $`1`
# [1] 5 6 7 8 9
# $`2`
# [1] 10 11 12 13