Can dplyr
perform chained summarise
operations on a data.frame
?
My data.frame has the structure:
data_df = tbl_df(data)
data_df %.%
group_by(col_1) %.%
summarise(number_of= length(col_2)) %.%
summarise(sum_of = sum(col_3))
This causes RStudio to encounter a fatal error - R Session Aborted
message
Usually with plyr
I would include these summarise
functions without problems.
UPDATE
Data are here.
Code is:
library(dplyr)
orth <- read.csv('orth0106.csv')
orth_df = tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure)) %.%
summarise(SSIs = sum(SSI))
I can reproduce the error on Windows 7 machine running RStudio 0.97.551
It may be because you're calling summarise
and chaining onto something that's not there. You can summarise
with 2 different columns as I've done here.
url <- "https://raw.github.com/johnmarquess/some.data/master/orth0106.csv"
library(dplyr)
orth <- read.csv(url)
orth_df <- tbl_df(orth)
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure), SSIs = sum(SSI))
## Source: local data frame [18 x 3]
##
## Hospital Procs SSIs
## 1 A 865 80
## 2 B 1069 38
## 3 C 796 24
## 4 D 891 35
## 5 E 997 39
## 6 F 550 30
## 7 G 2598 128
## 8 H 373 27
## 9 I 1079 70
## 10 J 714 30
## 11 K 477 30
## 12 L 227 2
## 13 M 125 6
## 14 N 589 38
## 15 O 292 3
## 16 P 149 9
## 17 Q 1984 52
## 18 R 351 13
In any event this seems like either an RStudio or a dplyr
bug. I'd open up an issue with Hadley as he probably cares either way. https://github.com/hadley/dplyr/issues
EDIT This (your first call) also cause rgui (windows) and the terminal to crash as well on:
R version 3.0.2 (2013-09-25)
Platform: i386-w64-mingw32/i386 (32-bit)
This indicates a dplyr
problem Hadley and Romain will want to know about.
To get my first point we run:
orth_df %.%
group_by(Hospital) %.%
summarise(Procs = length(Procedure))
Source: local data frame [18 x 2]
Hospital Procs
1 A 865
2 B 1069
3 C 796
4 D 891
5 E 997
6 F 550
7 G 2598
8 H 373
9 I 1079
10 J 714
11 K 477
12 L 227
13 M 125
14 N 589
15 O 292
16 P 149
17 Q 1984
18 R 351
Where is %.% summarise(SSIs = sum(SSI))
supposed to find SSI
?
So the chaining you think is happening fails. TO my understanding %.%
isn't exactly like how ggplot2
works but similar. In ggplot2
once you pass the data in the initial mapping you can access it later on. Here %.% seems to modify grab the left chunk and operate on it like this:
So you're grabbing:
Hospital Procs
1 A 865
2 B 1069
3 C 796
.
.
.
17 Q 1984
18 R 351
when you use %.% summarise(SSIs = sum(SSI))
and there is no SSI
to be gotten. So the analogy that comes to mind is serial vs. parallel wiring Christmas lights. %.% = serial
ggplot() + = parallel
. This is a nonprogrammer's understanding of things and the R gurus may come and tell me I'm stupid but for now that's the best theory you've got.