dplyr masks GGally and breaks ggparcoord

2019-04-06 04:20发布

问题:

Given a fresh session, executing a small ggparcoord(.) example provided in the documentation of the function

library(GGally)

data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = diamonds.samp, columns = c(1, 5:10))

results into the following plot:

Again, starting in a fresh session and executing the same script with the loaded dplyr

library(GGally)
library(dplyr)

data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = diamonds.samp, columns = c(1, 5:10))

results in:

Error: (list) object cannot be coerced to type 'double'

Note that the order of the library(.) statements does not matter.

Questions

  1. Is there something wrong with the code samples?
  2. Is there a way to overcome the problem (over some namespace functions)?
  3. Or is this a bug?

I need both dplyr and ggparcoord(.) in a bigger analysis but this minimal example reflects the problem i am facing.

Versions

  • R @ 3.2.3
  • dplyr @ 0.4.3
  • GGally @ 1.0.1
  • ggplot @ 2.0.0

UPDATE

To wrap the excellent answer given by Joran up:

Answers

  1. The code samples are in fact wrong as ggparcoord(.) expects a data.frame not a tbl_df as given by the diamonds data set (if dplyr is loaded).
  2. The problem is solved by coercing the tbl_df to a data.frame.
  3. No it is not a bug.

Working code sample:

library(GGally)
library(dplyr)

data(diamonds, package="ggplot2")
diamonds.samp <- diamonds[sample(1:dim(diamonds)[1], 100), ]
ggparcoord(data = as.data.frame(diamonds.samp), columns = c(1, 5:10))

回答1:

Converting my comments to an answer...

The GGally package here is making the reasonable assumption that using [ on a data frame should behave the way it always does and always has. However, this all being in the Hadley-verse, the diamonds data set is a tbl_df as well as a data.frame.

When dplyr is loaded, the behavior of [ is overridden such that drop = FALSE is always the default for a tbl_df. So there's a place in GGally where data[,"cut"] is expected to return a vector, but instead it returns another data frame.

...specifically, the error is thrown in your example while attempting to execute:

data[, fact.var] <- as.numeric(data[, fact.var]). 

Since data[,fact.var] remains a data frame, and hence a list, as.numeric won't work.

As for your conclusion that this isn't a bug, I'd say....maybe. Probably. At least there probably isn't anything the GGally package author ought to do to address it. You just have to be aware that using tbl_df's with non-Hadley written packages may break things.

As you noted, removing the extra class attributes fixes the problem, as it returns R to using the normal [ method.



回答2:

Workaround: coerce your data for ggparcoord to as.data.table(...) or as.data.table(... , keep.rownames=TRUE) unless you want to lose all your rownames.

Cause: as per @joran's investigating, when dplyr is loaded, tbl_df overrides [ so that drop = FALSE.

Solution: file a pull-request on GGally.