How can I use spark_apply() to generate combinatio

2019-05-30 08:41发布

I would like to use spark to generate the output of the combn() function for a relatively large list of inputs (200 ish), and to varying values of m (2-5), however I am having trouble including this in spark_apply().

A mwe of my current approach (based on this):

names_df <- data.frame(name = c("Alice", "Bob", "Cat"), 
                   types = c("Human", "Human", "Animal"))

combn(names_df$name, 2)

name_tbl <- sdf_copy_to(sc = sc,
                        x = names_df,
                        name = "name_table")

name_tbl %>%
  select(name) %>%
  spark_apply(function(e) combn(e, 2))

The error message output is large, but I am having trouble understanding how to use that information to refine my approach.

I expected an output such as that of the second line of the MWE. Is the problem that combn() is expecting a "vector source" which is not what I am providing by select()? Or is it that select is not returning "An object (usually a spark_tbl) coercable to a Spark DataFrame"? Either way, is there a method I can use to achieve the desired result?

I have also tried this in an attempt with no success:

name_tbl %>%
  select(name) %>% # removing this also doesn't work
  spark_apply(function(e) combn(e$name, 2))

EDIT: so expand.grid works fine, which suggests to me that there is some issue with the return of combn not being able to be coerced into a data.frame.

Working expand.grid:

name_tbl %>%
  spark_apply(function(e) expand.grid(e))

EDIT 2:

Having more closely read the documentation, I have now also tried coercing the function into a data.frame as it says:

Your R function should be designed to operate on an R data frame. The R function passed to spark_apply expects a DataFrame and will return an object that can be cast as a DataFrame.

However, the following are also unsuccessful:

name_tbl %>%
  spark_apply(function(e) data.frame(combn(e$name, 2)))

name_tbl %>%
  select(name) %>%
  spark_apply(function(e) data.frame(combn(e, 2)))

1条回答
Viruses.
2楼-- · 2019-05-30 09:18

The problem seems to be that combn() does not work properly with factors, code also needs named columns, as in:

name_tbl %>%
  spark_apply(
    function(e) data.frame(combn(as.character(e$name), 2)),
    names = c("1", "2", "3")
  )

# Source:   table<sparklyr_tmp_626bc0dd927> [?? x 3]
# Database: spark_connection
    `1`   `2`   `3`
  <chr> <chr> <chr>
1 Alice Alice   Bob
2   Bob   Cat   Cat
查看更多
登录 后发表回答