Suppose I have some count data that looks like this:
library(tidyr)
library(dplyr)
X.raw <- data.frame(
x = as.factor(c("A", "A", "A", "B", "B", "B")),
y = as.factor(c("i", "ii", "ii", "i", "i", "i")),
z = 1:6)
X.raw
# x y z
# 1 A i 1
# 2 A ii 2
# 3 A ii 3
# 4 B i 4
# 5 B i 5
# 6 B i 6
I'd like to tidy and summarise like this:
X.tidy <- X.raw %>% group_by(x,y) %>% summarise(count=sum(z))
X.tidy
# Source: local data frame [3 x 3]
# Groups: x
#
# x y count
# 1 A i 1
# 2 A ii 5
# 3 B i 15
I know that for x=="B"
and y=="ii"
we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:
X.fill <- X.tidy %>% spread(y, count, fill=0) %>% gather(y, count, -x)
X.fill
# Source: local data frame [4 x 3]
#
# x y count
# 1 A i 1
# 2 B i 15
# 3 A ii 5
# 4 B ii 0
But that seems a little bit of a roundabout way of doing things. Is their a cleaner idiom for this?
Just to clarify: My code already does what I need it to do, using spread
then gather
, so what I'm interested in is finding a more direct route within tidyr
and dplyr
.
You can use tidyr's
expand
to make all combinations of levels of factors, and thenleft_join
:Then you may keep values as NAs or replace them with 0 or any other value. That way isn't a complete solution of the problem too, but it's faster and more RAM-friendly than
spread
&gather
.You could explicitly make all possible combinations and then joining it with the tidy summary:
plyr
has the functionality you're looking for, butdplyr
doesn't (yet), so you need some extra code to include the zero-count groups, as shown by @momeara. Also see this question. Inplyr::ddply
you just add.drop=FALSE
to keep zero-count groups in the final result. For example:Since
dplyr 0.8
you can do it by setting the parameter.drop = FALSE
ingroup_by
:The
complete
function from tidyr is made for just this situation.From the docs:
You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of
x
andy
, and fillingz
with 0 (you could use the defaultNA
fill
and usena.rm = TRUE
insum
).You can also use
complete
on your pre-summarized dataset. Note thatcomplete
respects grouping.X.tidy
is grouped, so you can eitherungroup
and complete the dataset byx
andy
or just list the variable you want completed within each group - in this case,y
.The result is the same for each option: