可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have some eventlog data in HDFS that, in its raw format, looks like this:

2015-11-05 19:36:25.764 INFO    [...etc...]

An external table points to this HDFS location:

CREATE EXTERNAL TABLE `log_stage`(
  `event_time` timestamp, 
  [...])
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

For performance, we'd like to query this in Impala. The log_stage data is inserted into a Hive/Impala Parquet-backed table by executing a Hive query: INSERT INTO TABLE log SELECT * FROM log_stage. Here's the DDL for the Parquet table:

CREATE TABLE `log`(
  `event_time` timestamp,
  [...])
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'

The problem: when queried in Impala, the timestamps are 7 hours ahead:

Hive time:   2015-11-05 19:36:25.764
Impala time: 2015-11-06 02:36:25.764

> as.POSIXct("2015-11-06 02:36:25") - as.POSIXct("2015-11-05 19:36:25")
Time difference of 7 hours

Note: The timezone of the servers (from /etc/sysconfig/clock) are all set to "America/Denver", which is currently 7 hours behind UTC.

It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours.

Do you know how to sync the times so that the Impala table matches the Hive table?

回答1:

Hive writes timestamps to Parquet differently. You can use the impalad flag -convert_legacy_hive_parquet_utc_timestamps to tell Impala to do the conversion on read. See the TIMESTAMP documentation for more details.

This blog post has a brief description of the issue:

When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time. Impala, however on the other hand, does no conversion when reads the timestamp field out, hence, UTC time is returned instead of local time.

The impalad flag tells Impala to do the conversion when reading timestamps in Parquet produced by Hive. It does incur some small cost, so you should consider writing your timestamps with Impala if that is an issue for you (though it likely is minimal).

回答2:

On a related note, as of Hive v1.2, you can also disable the timezone conversion behaviour with this flag:

hive.parquet.timestamp.skip.conversion

"Current Hive implementation of parquet stores timestamps to UTC, this flag allows skipping of the conversion on reading parquet files from other tools."

This was added in as part of https://issues.apache.org/jira/browse/HIVE-9482

Lastly, not timezone exactly, but for compatibility of Spark (v1.3 and up) and Impala on Parquet files, there's this flag: