I'm very surprised if this kind of problems cannot be solved with sparklyr:
iris_tbl <- copy_to(sc, aDataFrame)
# date_vector is a character vector of element
# in this format: YYYY-MM-DD (year, month, day)
for (d in date_vector) {
...
aDataFrame %>% mutate(newValue=gsub("-","",d)))
...
}
I receive this error:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GSUB'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 2 pos 86
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:787)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:200)
at org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:172)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$6$$anonfun$applyOrElse$39.apply(Analyzer.scala:884)
at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun
But with this line:
aDataFrame %>% mutate(newValue=toupper("hello"))
things work. Some help?
I would strongly recommend you read the
sparklyr
documentation before proceeding. In particular, you're going to want to read the section on how R is translated to SQL (http://spark.rstudio.com/dplyr.html#sql_translation). In short, a very limited subset of R functions are available for use onsparklyr
dataframes, andgsub
is not one of those functions (buttoupper
is). If you really needgsub
you're going to have tocollect
the data in to a local dataframe, thengsub
it (you can still usemutate
), thencopy_to
back to spark.It may be worth adding that the available documentation states:
Hive
As stated in the documentation, a viable solution should be achievable with use of
regexp_replace
:sparklyr
approachConsidering the above it should be possible to combine
sparklyr
pipeline withregexp_replace
to achieve effect cognate to applyinggsub
on the desired column. Tested code removing the-
character withinsparklyr
in variabled
could be build as follows:where
class(aDataFrame )
returns:"tbl_spark" ...
.