Spark SQL sql(“”).first().getDouble(0) give

2019-07-30 17:16发布

问题:

I have the below query which is supposed to find an average of the column values and return me the result which is a single number.

val avgVal = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first().getDouble(0)

I'm facing inconsistent behavior at this statement. This often fails with below error however it gives non-NULL results when executed through Hive."

18/05/10 11:01:12 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException: Value at index 0 in null
java.lang.NullPointerException: Value at index 0 in null
    at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:475)
    at org.apache.spark.sql.Row$class.getDouble(Row.scala:243)
    at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:192)

The reason why I use HiveContext instead of SQLContext is that the later doesn't support some of the aggregation functions which I use extensively in my code.

Could you please help me understand why this problem occurs and how to solve?

回答1:

You need to divide query and get into two parts:

var result = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'");
var first = result.first();
if (first != null && !first.isNullAt(0)) {
var avgVal = first.getDouble(0);
}

This would avoid NPE. This would also be needed in List and array.

For insert or update query, you even need to surround with try...catch block to catch runtime exception.



回答2:

Let's analyze the case and possible reasons when this exception can be thrown.

Row row = hiveContext.sql("select info, name, desc, id from users.payment where dt between '2018-05-09' and '2018-05-09'").first();

If the value of the row above returned something like:

[null, Kevin, cash, 300]

Trying to get getDouble(0) will lead to java.lang.NullPointerException: Value at index 0 in null

You can try the following approach:

Row row = hiveContext.sql("select round(avg(amount), 4) from users.payment where dt between '2018-05-09' and '2018-05-09'").first();

if (!row.isNullAt(0))
   double d = row.getDouble(0);
else
   logger.error("Value at index zero is null");

If you'll check the sources, the library class doing opposite:

private static Object getAnyValAs(Row $this, int i) {
    if($this.isNullAt(i)) {
        throw new NullPointerException((new StringContext(scala.Predef..MODULE$.wrapRefArray((Object[])(new String[]{"Value at index ", " is null"})))).s(scala.Predef..MODULE$.genericWrapArray(new Object[]{BoxesRunTime.boxToInteger(i)})));
    } else {
        return $this.getAs(i);
    }
}