I am writing a function
- which takes a RDD as input
- splits the comma separated values
- then convert each row into labelled point object
finally fetch the output as a dataframe
code: def parse_points(raw_rdd): cleaned_rdd = raw_rdd.map(lambda line: line.split(",")) new_df = cleaned_rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])).toDF() return new_df output = parse_points(input_rdd)
upto this if I run the code, there is no error it is working fine.
But on adding the line,
output.take(5)
I am getting the error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 129.0 failed 1 times, most recent failure: Lost task 0.0 in s stage 129.0 (TID 152, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-100-a68c448b64b0> in <module>()
20
21 output = parse_points(raw_rdd)
---> 22 print output.show()
Please suggest me what is the mistake.