I am trying to replace some values in a spark dataframe by using a UDF, but keep on getting the same error.
While debugging I found out it doesn't really depend on the dataframe I am using, nor the function that I write. Here is a MWE that features a simple lambda function that I can't get to execute properly. This should basically modify all the values in the first column by concatenating the value with itself.
l = [('Alice', 1)]
df = sqlContext.createDataFrame(l)
df.show()
#+-----+---+
#| _1| _2|
#+-----+---+
#|Alice| 1|
#+-----+---+
df = df.withColumn("_1", udf(lambda x : lit(x+x), StringType())(df["_1"]))
df.show()
#Alice should now become AliceAlice
This is the error that I get, mentioning a rather cryptic "AttributeError: 'NoneType' object has no attribute '_jvm".
File "/cdh/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/pyspark/worker.py", line 111, in main
process()
File "/cdh/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/pyspark/worker.py", line 106, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/cdh/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/cdh/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/pyspark/sql/functions.py", line 1566, in <lambda>
func = lambda _, it: map(lambda x: returnType.toInternal(f(*x)), it)
File "<stdin>", line 1, in <lambda>
File "/cdh/opt/cloudera/parcels/CDH-5.11.1-1.cdh5.11.1.p0.4/lib/spark/python/pyspark/sql/functions.py", line 39, in _
jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'
I am sure I am getting confused with the syntax and can't get types right (thanks duck typing!), but every example of withColumn and lambda functions that I found seems to be similar to this one.