I have some unix times that I convert to timestamps in sparklyr
and for some reasons I also need to convert them into strings.
Unfortunately, it seems that during the conversion to string hive
converts to EST (my locale).
df_new <- spark_read_parquet(sc, "/mypath/parquet_*",
overwrite = TRUE,
name = "df_new",
memory = FALSE,
options = list(mergeSchema = "true"))
> df_new %>%
mutate(unix_t = from_utc_timestamp(timestamp(t) ,'UTC'),
date_str = date_format(unix_t, 'yyyy-MM-dd HH:mm:ss z'),
date_alt = to_date(from_utc_timestamp(timestamp(t) ,'UTC'))) %>%
select(t, unix_t, date_str, date_alt) %>% head(5)
# Source: lazy query [?? x 4]
# Database: spark_connection
t unix_t date_str date_alt
<dbl> <dttm> <chr> <date>
1 1419547405. 2014-12-25 22:43:25 2014-12-25 17:43:25 EST 2014-12-25
2 1418469714. 2014-12-13 11:21:54 2014-12-13 06:21:54 EST 2014-12-13
3 1419126103. 2014-12-21 01:41:43 2014-12-20 20:41:43 EST 2014-12-20
4 1419389856. 2014-12-24 02:57:36 2014-12-23 21:57:36 EST 2014-12-23
5 1418271811. 2014-12-11 04:23:31 2014-12-10 23:23:31 EST 2014-12-10
As you can see both date_str
and date_alt
use the EST
timezone. I need UTC
here. How can I do that?
Thanks!
From the Hive function reference, date_format uses Java's SimpleDateFormat, which I believe always defaults to the JVM time zone, this explains why this gets you a character string converted to your time zone.
One option is to detect the time zone and manually add the hours to get UTC.
Another option would be to use lubridate
with spark_apply()
:
sdf_len(sc, 1) %>%
mutate(unix_t = from_utc_timestamp(timestamp(1522371003) , 'UDT')) %>%
spark_apply(
function(e) {
dplyr::mutate(
e,
time_str = as.character(
lubridate::with_tz(
as.POSIXct(unix_t, origin="1970-01-01"),
"GMT"
)
)
)
},
columns = c("id", "unix_t", "time_str"))
It's possible that sparklyr is doing some weird translation of timezones into the hive functions. I'd try registering the dataframe as a table and doing the manipulation with pure HQL:
createOrReplaceTempView(df_new, "df_new")
result <- sql("select from_utc_timestamp(timestamp(t) ,'UTC'),
cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING),
cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE)
from df_new")
head(result)
edit
If you're unfamiliar with SQL-languages, you can add any of the variables from df_new
as a comma separated list like so (and rename your selections with as
)
select var1, var2, t,
from_utc_timestamp(timestamp(t) ,'UTC') as unix_t,
cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING) as date_str,
cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE) as date_alt
from df_new
You can also use * to represent all variables from the data frame:
select *,
from_utc_timestamp(timestamp(t) ,'UTC') as unix_t,
cast(from_utc_timestamp(timestamp(t) ,'UTC') as STRING) as date_str,
cast(from_utc_timestamp(timestamp(t) ,'UTC') as DATE) as date_alt
from df_new
Try using as.POSIXct() ?
format(as.POSIXct(unix_t, origin = unix_t, tz = "UTC", usetz=TRUE),"%Y-%m-%d %H:mm:ss")
This will first convert unix timestamp to UTC and then formatted to desired string.