Random sample of rows from subset of an R datafram

2020-02-26 07:03发布

站内文章 / 前沿技术

47 0

做个烂人

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

This question already has answers here:

Sample random rows in dataframe (10 answers)

Closed 6 years ago.

Is there a good way of getting a sample of rows from part of a dataframe?

If I just have data such as

gender <- c("F", "M", "M", "F", "F", "M", "F", "F")
age    <- c(23, 25, 27, 29, 31, 33, 35, 37)

then I can easily sample the ages of three of the Fs with

sample(age[gender == "F"], 3)

and get something like

[1] 31 35 29

but if I turn this data into a dataframe

mydf <- data.frame(gender, age)

I cannot use the obvious

sample(mydf[mydf$gender == "F", ], 3)

though I can concoct something convoluted with an absurd number of brackets like

mydf[sample((1:nrow(mydf))[mydf$gender == "F"], 3), ]

and get what I want which is something like

  gender age
7      F  35
4      F  29
1      F  23

Is there a better way that takes me less time to work out how to write?

回答1:

Your convoluted way is pretty much how to do it - I think all the answers will be variations on that theme.

For example, I like to generate the mydf$gender=="F" indices first:

idx <- which(mydf$gender=="F")

Then I sample from that:

mydf[ sample(idx,3), ]

So in one line (although, you reduce the absurd number of brackets and possibly make your code easier to understand by having multiple lines):

mydf[ sample( which(mydf$gender=='F'), 3 ), ]

While the "wheee I'm a hacker!" part of me prefers the one-liner, the sensible part of me says that even though the two-liner is two lines, it is much more understandable - it's just your choice.

回答2:

You say I cannot use the obvious:

sample(mydf[mydf$gender == "F", ], 3)

but you could write your own function for doing it:

sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]

then run it on your subset selection:

sample.df(mydf[mydf$gender == "F", ], 3)
#   gender age
# 5      F  31
# 4      F  29
# 1      F  23

(Personally I find sample.df(subset(mydf, gender == "F"), 3) easier to read.)

回答3:

This is now simpler with the enhanced version of sample in my package:

library(devtools); install_github('kimisc', 'krlmlr')

library(kimisc)
sample.rows(subset(mydf, gender == "F"), 3)

See also this related answer for more detail.

标签： r dataframe sample

做个烂人

女 | 书童

私信

收藏的人(0)

Ta的文章更多文章

0条评论

还没有人评论过~

Random sample of rows from subset of an R datafram

问题:

回答1:

回答2:

回答3:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮