I have some eventlog data in HDFS that, in its raw format, looks like this:
2015-11-05 19:36:25.764 INFO [...etc...]
An external table points to this HDFS location:
CREATE EXTERNAL TABLE `log_stage`(
`event_time` timestamp,
[...])
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
For performance, we'd like to query this in Impala. The log_stage
data is inserted into a Hive/Impala Parquet-backed table by executing a Hive query: INSERT INTO TABLE log SELECT * FROM log_stage
. Here's the DDL for the Parquet table:
CREATE TABLE `log`(
`event_time` timestamp,
[...])
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
The problem: when queried in Impala, the timestamps are 7 hours ahead:
Hive time: 2015-11-05 19:36:25.764
Impala time: 2015-11-06 02:36:25.764
> as.POSIXct("2015-11-06 02:36:25") - as.POSIXct("2015-11-05 19:36:25")
Time difference of 7 hours
Note: The timezone of the servers (from /etc/sysconfig/clock
) are all set to "America/Denver", which is currently 7 hours behind UTC.
It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours.
Do you know how to sync the times so that the Impala table matches the Hive table?