I would like to use spark to generate the output of the combn()
function for a relatively large list of inputs (200 ish), and to varying values of m
(2-5), however I am having trouble including this in spark_apply()
.
A mwe of my current approach (based on this):
names_df <- data.frame(name = c("Alice", "Bob", "Cat"),
types = c("Human", "Human", "Animal"))
combn(names_df$name, 2)
name_tbl <- sdf_copy_to(sc = sc,
x = names_df,
name = "name_table")
name_tbl %>%
select(name) %>%
spark_apply(function(e) combn(e, 2))
The error message output is large, but I am having trouble understanding how to use that information to refine my approach.
I expected an output such as that of the second line of the MWE. Is the problem that combn()
is expecting a "vector source" which is not what I am providing by select()
? Or is it that select is not returning "An object (usually a spark_tbl) coercable to a Spark DataFrame"? Either way, is there a method I can use to achieve the desired result?
I have also tried this in an attempt with no success:
name_tbl %>%
select(name) %>% # removing this also doesn't work
spark_apply(function(e) combn(e$name, 2))
EDIT: so expand.grid
works fine, which suggests to me that there is some issue with the return of combn
not being able to be coerced into a data.frame.
Working expand.grid
:
name_tbl %>%
spark_apply(function(e) expand.grid(e))
EDIT 2:
Having more closely read the documentation, I have now also tried coercing the function into a data.frame as it says:
Your R function should be designed to operate on an R data frame. The R function passed to spark_apply expects a DataFrame and will return an object that can be cast as a DataFrame.
However, the following are also unsuccessful:
name_tbl %>%
spark_apply(function(e) data.frame(combn(e$name, 2)))
name_tbl %>%
select(name) %>%
spark_apply(function(e) data.frame(combn(e, 2)))
The problem seems to be that
combn()
does not work properly with factors, code also needs named columns, as in: