I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email)
. Thus, when I try to show()
beyond a certain point or save()
my data frame, I get a java.util.NoSuchElementException
because sentiment()
must have returned nothing at that row.
My initial code is loading the data, and applying sentiment()
as shown in spark-corenlp
API.
val customSchema = StructType(Array(
StructField("contactId", StringType, true),
StructField("email", StringType, true))
)
// Load dataframe
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter","\t") // Delimiter is tab
.option("parserLib", "UNIVOCITY") // Parser, which deals better with the email formatting
.schema(customSchema) // Schema of the table
.load("emails") // Input file
val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe
I tried to filter for null and NaN values:
val sentFiltered = sent.filter('sentiment.isNotNull)
.filter(!'sentiment.isNaN)
.filter(col("sentiment").between(0,4))
I even tried to do it via SQL query:
sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")
I don't know what input is making the spark-corenlp crash. How can I find out? Else, how can I filter these non existing values from col("sentiment")? Or else, should I try catching the Exception and ignore the row? Is this even possible?