ExecuteSQL processor returns corrupted data

2019-08-27 07:45发布

问题:

I have a flow in NiFI in which I use the ExecuteSQL processor to get a whole a merge of sub-partitions named dt from a hive table. For example: My table is partitioned by sikid and dt. So I have under sikid=1, dt=1000, and under sikid=2, dt=1000. What I did is select * from my_table where dt=1000.

Unfortunately, what I've got in return from the ExecuteSQL processor is corrupted data, including rows that have dt=NULL while the original table does not have even one row with dt=NULL.

The DBCPConnectionPool is configured to use HiveJDBC4 jar. Later I tried using the compatible jar according to the CDH release, didn't fix it either.

The ExecuteSQL processor is configured as such:

Normalize Table/Column Names: true

Use Avro Logical Types: false

Hive version: 1.1.0

CDH: 5.7.1

Any ideas what's happening? Thanks!

EDIT: Apparently my returned data includes extra rows... a few thousand of them.. which is quite weird.

回答1:

Does HiveJDBC4 (I assume the Simba Hive driver) parse the table name off the column names? This was one place there was an incompatibility with the Apache Hive JDBC driver, it didn't support getTableName() so doesn't work with ExecuteSQL, and even if it did, when the column names are retrieved from the ResultSetMetaData, they had the table names prepended with a period . separator. This is some of the custom code that is in HiveJdbcCommon (used by SelectHiveQL) vs JdbcCommon (used by ExecuteSQL).

If you're trying to use ExecuteSQL because you had trouble with the authentication method, how is that alleviated with the Simba driver? Do you specify auth information on the JDBC URL rather than in a hive-site.xml file for example? If you ask your auth question (using SelectHiveQL) as a separate SO question and link to it here, I will do my best to help out on that front and get you past this.



回答2:

Eventually it was solved by using hive property hive.query.result.fileformat=SequenceFile