Why is predicate pushdown not working?

2019-08-12 09:51发布

问题:

Programm Sketch

  • I create a HiveContext hiveContext.
  • With that context, I create a DataFrame df from a JDBC relational table.
  • I register the DataFrame df via df.registerTempTable("TESTTABLE").
  • I start a HiveThriftServer2 via HiveThriftServer2.startWithContext(hiveContext).

The TESTTABLE contains 1,000,000 entries, columns are ID (INT) and NAME (VARCHAR)

+-----+--------+
| ID  |  NAME  |
+-----+--------+
| 1   | Hello  |
| 2   | Hello  |
| 3   | Hello  |
| ... | ...    |

With Beeline I access the SQL Endpoint (at port 10000) of the HiveThriftServer and perform a query. E.g.

SELECT * FROM TESTTABLE WHERE ID='3'

When I inspect the QueryLog of the DB with the SQL Statements executed I see

/*SQL #:1000000 t:657*/  SELECT \"ID\",\"NAME\" FROM test;

So there happens no predicate pushdown , as the where clause is missing.

Questions

This gives raise to the following questions:

  • Why is no predicate pushdown performed?
  • Can this be changed by not using registerTempTable?
  • If so, how? Or is this a known restriction of the HiveThriftServer?

Counterexample

If I create a DataFrame df in Spark SQLContext and call

df.filter( df("ID") === 3).show()

I observe

/*SQL #:1*/SELECT \"ID\",\"NAME\" FROM test WHERE ID = 3;

as expected.

回答1:

It's probably too late to answer. In this scenario, this did not work because ID is defined as Int and in the original query you are passing a string ('3'). Predicate pushdown looks for the same column name and the type as well.