Spark Scala - java.util.NoSuchElementException & D

2019-09-10 04:44发布

问题:

I have had a similar problem before, but I am looking for a generalizable answer. I am using spark-corenlp to get Sentiment scores on e-mails. Sometimes, sentiment() crashes on some input (maybe it's too long, maybe it had an unexpected character). It does not tell me it crashes on some instances, and just returns the Column sentiment('email). Thus, when I try to show() beyond a certain point or save() my data frame, I get a java.util.NoSuchElementException because sentiment() must have returned nothing at that row.

My initial code is loading the data, and applying sentiment() as shown in spark-corenlp API.

       val customSchema = StructType(Array(
                        StructField("contactId", StringType, true),
                        StructField("email", StringType, true))
                        )

// Load dataframe   
val df = sqlContext.read
                        .format("com.databricks.spark.csv")
                        .option("delimiter","\t")          // Delimiter is tab
                        .option("parserLib", "UNIVOCITY")  // Parser, which deals better with the email formatting
                        .schema(customSchema)              // Schema of the table
                        .load("emails")                        // Input file


    val sent = df.select('contactId, sentiment('email).as('sentiment)) // Add sentiment analysis output to dataframe

I tried to filter for null and NaN values:

val sentFiltered = sent.filter('sentiment.isNotNull)
                .filter(!'sentiment.isNaN)
                .filter(col("sentiment").between(0,4))

I even tried to do it via SQL query:

sent.registerTempTable("sent")
val test = sqlContext.sql("SELECT * FROM sent WHERE sentiment IS NOT NULL")

I don't know what input is making the spark-corenlp crash. How can I find out? Else, how can I filter these non existing values from col("sentiment")? Or else, should I try catching the Exception and ignore the row? Is this even possible?