sparklyr: create new column with mutate function

2019-05-18 15:30发布

I'm very surprised if this kind of problems cannot be solved with sparklyr:

iris_tbl <- copy_to(sc, aDataFrame)

# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
   ...
   aDataFrame %>% mutate(newValue=gsub("-","",d)))
   ...
}

I receive this error:

Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
    at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
    at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
    at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun

But with this line:

aDataFrame %>% mutate(newValue=toupper("hello"))

things work. Some help?

2条回答
地球回转人心会变
2楼-- · 2019-05-18 15:36

I would strongly recommend you read the sparklyr documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use on sparklyr dataframes, and gsub is not one of those functions (but toupper is). If you really need gsub you're going to have to collect the data in to a local dataframe, then gsub it (you can still use mutate), then copy_to back to spark.

查看更多
Melony?
3楼-- · 2019-05-18 15:39

It may be worth adding that the available documentation states:

Hive Functions

Many of Hive’s built-in functions (UDF) and built-in aggregate functions (UDAF) can be called inside dplyr’s mutate and summarize. The Languange Reference UDF page provides the list of available functions.

Hive

As stated in the documentation, a viable solution should be achievable with use of regexp_replace:

Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

sparklyr approach

Considering the above it should be possible to combine sparklyr pipeline with regexp_replace to achieve effect cognate to applying gsub on the desired column. Tested code removing the - character within sparklyr in variable d could be build as follows:

aDataFrame %>% 
  mutate(clnD = regexp_replace(d, "-", "")) %>%
  # ...

where class(aDataFrame ) returns: "tbl_spark" ....

查看更多
登录 后发表回答