I would like to select a subset of elements from a whole that satisfy certain conditions. There are about 20 elements, each having multiple attributes. I would like to select five elements that offer the least amount of discrepancy from a fixed criterion on one attribute, and offers the highest average value on another attribute.
Lastly, I would like to apply the function over multiple sets of 20 elements.
Thus far, I have been able to identify the subsets "by hand," but I'd like to be able to return the index of the values in addition to returning the values themselves.
Objectives:
I would like to find the set of five values for X1 that are the least discrepant from a fixed value (55), and provide the largest value for the average of X2.
I would like to do this for multiple sets.
##### generating example data
##### this has five groups, each with two variables x1 and x2
set.seed(271828)
grp <- gl(5,20)
x1 <- round(rnorm(100,45, 12), digits=0)
x2 <- round(rbeta(100,2,4), digits = 2)
id <- seq(1,100,1)
##### this is how the data would arrive for me to analyze
dat <- as.data.frame(cbind(id,grp,x1,x2))
The data would arrive in this format, with id
as a unique identifier for each element.
##### pulling out the first group for demonstration
dat.grp.1 <- dat[ which(grp == 1), ]
crit <- 55
x <- t(combn(dat.grp.1$x1, 5))
y <- t(combn(dat.grp.1$x2, 5))
mean.x <- rowMeans(x)
mean.y <- rowMeans(y)
k <- (mean.x - crit)^2
out <- cbind(x, mean.x, k, y, mean.y)
##### finding the sets with the least amount of discrepancy
pick <- out[ which(k == min(k)), ]
pick
##### finding the sets with low discrepancy and high values of y (means of X2) by "hand"
sorted <- out[order(k), ]
head(sorted, n=20)
With respect to the values in pick
, I can see that the values of X1 are:
> pick
mean.x k mean.y
[1,] 55 47 48 48 52 50 25 0.62 0.08 0.31 0.18 0.54 0.346
[2,] 55 48 48 47 52 50 25 0.62 0.31 0.18 0.48 0.54 0.426
I would like to return the id
value for these elements, so that I know that I pick elements: 3, 8, 10, 11, and 18 (choosing set 2 since the discrepancy from k
is the same, but the mean for y
is higher).
> dat.grp.1
id grp x1 x2
1 1 1 45 0.12
2 2 1 27 0.34
3 3 1 55 0.62
4 4 1 39 0.32
5 5 1 41 0.18
6 6 1 29 0.47
7 7 1 47 0.08
8 8 1 48 0.31
9 9 1 35 0.48
10 10 1 48 0.18
11 11 1 47 0.48
12 12 1 31 0.29
13 13 1 39 0.15
14 14 1 36 0.54
15 15 1 36 0.20
16 16 1 38 0.40
17 17 1 30 0.31
18 18 1 52 0.54
19 19 1 44 0.37
20 20 1 31 0.20
Doing this "by hand" works for now, but it would be good to make this as "hands-off" as possible.
Any help is greatly appreciated.
You are almost there. You can change your definition of
sorted
toAnd then
sorted[1,]
(or if you prefersorted[1,,drop=FALSE]
) is your selected set.If you want the indexes rather than/in addition to the points, then you can include that earlier. Replace:
with
and include
idx
inout
later.Putting int all together:
which gives
EDIT: description of applying over
idx
was requested; I want more options than just what i can do in a comment so I'm adding it to my answer. Will also address looping over subsets.idx
is a matrix (15504 x 5), each row of which is a set of (5) indexes for the dataframe.apply
allows going through row-by-row (row-by-row is margin 1) to do something with each row. That something is take the values and use them to index the desired rows ofdat.grp.1
and pull out the correspondingx1
values. I could have writtendat.grp.1[i,"x1"]
asdat.grp.1$x1[i]
. Each row ofidx
becomes a column and the results of indexing intodat.grp.1
are the rows, so the whole thing needs to be transposed.You can break the loop apart to see how each step works if you like. Make the function into a non-anonymous function.
and pass row at a time of
idx
to it.These are what get bundled into
x
As for looping over subsets, the
plyr
library is very handy for this. The way you have set it up (assign the subset of interest to a variable and work with that) makes the transformation easy. Everything you do to create the answer for one subset goes into a function with that part as a parameter.This is basically what you had before, but getting rid of some unnecessary assignments.
Now wrap this in a
plyr
call.which gives
I don't know that that is the best format for your results, but it mirrors the example you gave.