I am facing a very weird issue in hive in production environment(cloudera 5.5) which is basically not reproducible in my local server(Don't know why) i.e. for some records I am having wrong timestamp value while inserting from temp table to main table as String "2017-10-21 23" is converted into timestamp "2017-10-21 23:00:00" datatype while insertion.
example::
2017-10-21 23 -> 2017-10-21 22:00:00
2017-10-22 15 -> 2017-10-22 14:00:00
It is happening very very infrequent. Means delta value is about 1% of the data.
Flow::: Data in temp table(External table) is populated hourly by using oozie. Below insert statement is executed hourly to insert from temp table to main table(internal table in ORC) in hive using Oozie workflow.
Flow summary::: Linux logs >> copy logs in temp table(external hive table) >> insert in main hive table.
Insert from temp table to main table:::
FROM
temp
INSERT INTO TABLE
main
PARTITION(event_date,service_id)
SELECT
from_unixtime(unix_timestamp(event_timestamp ,'yyyy-MM-dd HH'), 'yyyy-MM-dd HH:00:00.0'),
col3,
col4,
"" as partner_nw_name,
col5,
"" as country_name,
col6,
col7,
col8,
col9,
col10,
col11,
col12,
col13,
col14,
col15,
kpi_id,
col18,
col19,
col20,
col21,
col23,
col24,
col25,
from_unixtime(unix_timestamp(event_timestamp ,'yyyy-MM-dd HH'), 'yyyy-MM-dd') as event_date,
service_id;
Temp Table:::
hive> desc temp;
OK
event_timestamp string
col2 int
col3 int
col4 int
col5 int
col6 string
col7 string
col8 string
col9 string
col10 string
col11 int
col12 int
col13 string
col14 string
col15 string
service_id int
kpi_id int
col18 bigint
col19 bigint
col20 bigint
col21 bigint
col22 double
col23 string
col24 int
col25 int
Time taken: 0.165 seconds, Fetched: 25 row(s)
Main Table:::
hive> desc main;
OK
event_timestamp timestamp
col3 int
col4 int
partner_nw_name string
col5 int
country_name string
col6 string
col7 string
col8 string
col9 string
col10 int
col11 int
col12 int
col13 string
col14 string
col15 string
kpi_id int
col18 bigint
col19 bigint
col20 bigint
col21 bigint
col23 double
col24 int
col25 int
event_date date
service_id int
# Partition Information
# col_name data_type comment
event_date date
service_id int
Time taken: 0.175 seconds, Fetched: 32 row(s)
If you are writing your data in parquet format using Hive then hive adjust the timestamp by local timezone offset. For more information please go through the below links.
There is a Jira ticket related to that for Impala #2716
Cloudera Impala Timestamp document is here
Seems like you are adding extra 00 for the hrs place too..
try this:
the above query gives:
is this what you are expecting? you can add 'yyyy-MM-dd HH:00:00.0' if needed.