I have some data which looks like this (fake data for example's sake):
dressId color
6 yellow
9 red
10 green
10 purple
10 yellow
12 purple
12 red
where color is a factor vector. It is not guaranteed that all possible levels of the factor actually appear in the data (e.g. the color "blue" could also be one of the levels).
I need a list of vectors which groups the available colors of each dress:
[[1]]
yellow
[[2]]
red
[[3]]
green purple yellow
[[4]]
purple red
Preserving the IDs of the dresses would be nice (e.g. a dataframe where this list is the second column and the IDs are the first), but not necessary.
I wrote a loop which goes through the dataframe row for row, and while the next ID is the same, it adds the color to a vector. (I am sure that the data is sorted by ID). When the ID in the first column changes, it adds the vector to a list:
result <- NULL
while(blah blah)
{
some code which creates the vector called "colors"
result[[dressCounter]] <- colors
dressCounter <- dressCounter + 1
}
After wrestling with getting all the necessary counting variables correct, I found out to my dismay that it doesn't work. The first time, colors
is
[1] yellow
Levels: green yellow purple red blue
and it gets coerced into an integer, so result
becomes 2
.
In the second loop repetition, colors
only contains red, and result
becomes a simple integer vector, [1] 2 4
.
In the third repetition, colors
is a vector now,
[1] green purple yellow
Levels: green yellow purple red blue
and I get
result[[3]] <- colors
Error in result[[3]] <- colors :
more elements supplied than there are to replace
What am I doing wrong? Is there a way to initialize result
so it doesn't get converted into a numeric vector, but becomes a list of vectors?
Also, is there another way to do the whole thing than "roll my own"?
split.data.frame
is a good way to organize this; then extract the color component.
d <- data.frame(dressId=c(6,9,10,10,10,12,12),
color=factor(c("yellow","red","green",
"purple","yellow",
"purple","red"),
levels=c("red","orange","yellow",
"green","blue","purple")))
I think the version you want is actually this:
ss <- split.data.frame(d,d$dressId)
You can get something more like the list you requested by extracting the color component:
lapply(ss,"[[","color")
In addition to split
, you should consider aggregate
. Use c
or I
as the aggregation function to get your list
column:
out <- aggregate(color ~ dressId, mydf, c)
out
# dressId color
# 1 6 yellow
# 2 9 red
# 3 10 green, purple, yellow
# 4 12 purple, red
str(out)
# 'data.frame': 4 obs. of 2 variables:
# $ dressId: int 6 9 10 12
# $ color :List of 4
# ..$ 0: chr "yellow"
# ..$ 1: chr "red"
# ..$ 2: chr "green" "purple" "yellow"
# ..$ 3: chr "purple" "red"
out$color
# $`0`
# [1] "yellow"
#
# $`1`
# [1] "red"
#
# $`2`
# [1] "green" "purple" "yellow"
#
# $`3`
# [1] "purple" "red"
Note: This works even if the "color" variable is a factor
, as in Ben's sample data (I missed that point when I posted the answer above) but you need to use I
as the aggregation function instead of c
:
out <- aggregate(color ~ dressId, d, I)
str(out)
# 'data.frame': 4 obs. of 2 variables:
# $ dressId: num 6 9 10 12
# $ color :List of 4
# ..$ 0: Factor w/ 6 levels "red","orange",..: 3
# ..$ 1: Factor w/ 6 levels "red","orange",..: 1
# ..$ 2: Factor w/ 6 levels "red","orange",..: 4 6 3
# ..$ 3: Factor w/ 6 levels "red","orange",..: 6 1
out$color
# $`0`
# [1] yellow
# Levels: red orange yellow green blue purple
#
# $`1`
# [1] red
# Levels: red orange yellow green blue purple
#
# $`2`
# [1] green purple yellow
# Levels: red orange yellow green blue purple
#
# $`3`
# [1] purple red
# Levels: red orange yellow green blue purple
Strangely, however, the default display shows the integer values:
out
# dressId color
# 1 6 3
# 2 9 1
# 3 10 4, 6, 3
# 4 12 6, 1
Assuming your data frame is saved in a variable called df
, then you can use simply group_by
and summarize
with list
function of dplyr
package like this
library('dplyr')
df %>%
group_by(dressId) %>%
summarize(colors = list(color))
Applied to your example:
df <- tribble(
~dressId, ~color,
6, 'yellow',
9, 'red',
10, 'green',
10, 'purple',
10, 'yellow',
12, 'purple',
12, 'red'
)
df %>%
group_by(dressId) %>%
summarize(colors = list(color))
# dressId colors
# 6 yellow
# 9 red
# 10 green, purple, yellow
# 12 purple, red