If I have a data frame like the following:
v2 <- c(4.5, 2.5, 3.5, 5.5, 7.5, 6.5, 2.5, 1.5, 3.5)
v1 <- c(2.2, 3.2, 1.2, 4.2, 2.2, 3.2, 2.2, 1.2, 5.2)
lvl <- c("a","a","a","b","b","b","c","c","c")
d <- data.frame(v1,v2,lvl)
> d
v1 v2 lvl
1 2.2 4.5 a
2 3.2 2.5 a
3 1.2 3.5 a
4 4.2 5.5 b
5 2.2 7.5 b
6 3.2 6.5 b
7 2.2 2.5 c
8 1.2 1.5 c
9 5.2 3.5 c
Within each level of d$lvl
, I want to extract the row with value of d$v1
being median (for the simplest case, each level of d$lvl
has three rows). So I want to get:
v1 v2 l
1 2.2 4.5 a
6 3.2 6.5 b
7 2.2 2.5 c
For groups with odd number of rows this works. You need to think of how to handle groups with even number of rows. For example you may wish to round the median in one or the other direction, see ?round
.
library(plyr)
d2 <- ddply(.data = d, .variables = .(lvl), function(x)
x[which(x$v1 == median(x$v1)), ])
# v1 v2 lvl
# 1 2.2 4.5 a
# 2 3.2 6.5 b
# 3 2.2 2.5 c
There are a couple of ways you can do this:
Check out the plyr
package, which is really useful for operating on subsets of data:
library(plyr)
ddply(d, .(lvl), summarize, v1 = median(v1), v2 = median(v2))
Or if you're comfortable with SQL
queries, you can use the sqldf
package:
library(sqldf)
sqldf("SELECT median(v1) as v1, median(v2) as v2, lvl FROM d GROUP BY lvl")
First, calculate the median of v1 by lvl with the function ddply (rounded with 1 decimal)
(install.packages("plyr")
df <- ddply(d, .(lvl), summarize, v1 = round(median(v1),1))
Second, merge the original df (d) with the calculated one (df), the merge compares where lvl and v1 are the same in the original data (d) and only takes those rows
df1 <- merge(df, d, by = c("lvl","v1"))
View(df1)
lvl v1 v2
1 a 2.2 4.5
2 b 3.2 6.5
3 c 2.2 2.5
I like to present a approach that is working with odd and even numbers of rows:
## example data
v2 <- c(4.5, 2.5, 3.5, 5.5, 7.5, 6.5, 2.5, 1.5, 3.5, 1, 1, 1, 1)
v1 <- c(2.2, 3.2, 1.2, 4.2, 2.2, 3.2, 2.2, 1.2, 5.2, 1.5, 2.5, 3.5, 4.5)
lvl <- c("a","a","a","b","b","b","c","c","c", "d", "d", "d", "d")
d <- data.frame(v1,v2,lvl)
## define own median index function
medIdx <- function(x) {
n <- length(x)
## even: p == n/2
## odd: p == (n+1)/2
p <- ceiling(n/2)
return(which(x == sort(x, partial=p)[p])[1])
}
## run blockwise (blocks defined by d$lvl) and bind results
do.call(rbind, by(d, INDICES=d$lvl, FUN=function(x){ return(x[medIdx(x$v1), ]) }))
# v1 v2 lvl
#a 2.2 4.5 a
#b 3.2 6.5 b
#c 2.2 2.5 c
#d 2.5 1.0 d