Spark SQL is not converting timezone correctly [du

Using Scala 2.10.4 and spark 1.5.1 and spark 1.6

sqlContext.sql(
  """
    |select id,
    |to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')),
    |from_utc_timestamp(from_unixtime(at), 'US/Pacific'),
    |from_unixtime(at),
    |to_date(from_unixtime(at)),
    | at
    |from events
    | limit 100
  """.stripMargin).collect().foreach(println)

Spark-Submit options: --driver-java-options '-Duser.timezone=US/Pacific'

result:

[56d2a9573bc4b5c38453eae7,2016-02-28,2016-02-27 16:01:27.0,2016-02-28 08:01:27,2016-02-28,1456646487]
[56d2aa1bfd2460183a571762,2016-02-28,2016-02-27 16:04:43.0,2016-02-28 08:04:43,2016-02-28,1456646683]
[56d2aaa9eb63bbb63456d5b5,2016-02-28,2016-02-27 16:07:05.0,2016-02-28 08:07:05,2016-02-28,1456646825]
[56d2aab15a21fa5f4c4f42a7,2016-02-28,2016-02-27 16:07:13.0,2016-02-28 08:07:13,2016-02-28,1456646833]
[56d2aac8aeeee48b74531af0,2016-02-28,2016-02-27 16:07:36.0,2016-02-28 08:07:36,2016-02-28,1456646856]
[56d2ab1d87fd3f4f72567788,2016-02-28,2016-02-27 16:09:01.0,2016-02-28 08:09:01,2016-02-28,1456646941]

The time in US/Pacific should be 2016-02-28 00:01:27 etc but some how it subtracts "8" hours twice

标签： scala apache-spark hive timezone

2条回答

ゆ、 Hurt°

2楼-- · 2020-07-06 05:41

For the record, here we convert Long values like that using an UDF.

For our purpose, we are interested in only the Date string representation of the timestamp (in ms since epoch in UTC)

val udfToDateUTC = udf((epochMilliUTC: Long) => {
  val dateFormatter = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(java.time.ZoneId.of("UTC"))
  dateFormatter.format(java.time.Instant.ofEpochMilli(epochMilliUTC))
})

This way, we control the parsing as well as the rendering of the dates.

0人赞添加讨论(0) 举报

狗以群分

3楼-- · 2020-07-06 05:49

after reading for sometime following are the conclusions:

Spark-Sql doesn't support date-time, and nor timezones
Using timestamp is the only solution
from_unixtime(at) parses the epoch time correctly, just that the printing of it as a string changes it due to timezone. It is safe to assume that the from_unixtime will convert it correctly ( although printing it might show different results)
from_utc_timestamp will shift ( not just convert) the timestamp to that timezone, in this case it will subtract 8 hours to the time since (-08:00)
printing sql results messes up the times with respect to timezone param

0人赞添加讨论(0) 举报

Spark SQL is not converting timezone correctly [du

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间