I have an RDD containing a timestamp named time of type long:
root
|-- id: string (nullable = true)
|-- value1: string (nullable = true)
|-- value2: string (nullable = true)
|-- time: long (nullable = true)
|-- type: string (nullable = true)
I am trying to group by value1, value2 and time as YYYY-MM-DD. I tried to group by cast(time as Date) but then I got the following error:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.RuntimeException: [1.21] failure: ``DECIMAL'' expected but identifier Date found
Does that mean there is not way to group by a date? I even tried to add another level of casting to have it as a String:
cast(cast(time as Date) as String)
Which returns the same error.
I've read that I could use probably aggregateByKey on the RDD but I don't understand how to use it for a few columns and convert that long to a YYYY-MM-DD String. How should I proceed?
I'm using Spark 1.4.0 and since 1.2.0
DATE
appears to be present in the Spark SQL API (SPARK-2562).DATE
should allow you to group by the time asYYYY-MM-DD
.I also have a similar data structure, where my
created_on
is analogous to yourtime
field.I solved it using
FROM_UNIXTIME(created_on,'YYYY-MM-dd')
and works well:From here on you can do the normal operations, execute the query into a dataframe and so on.
FROM_UNIXTIME
worked probably because I have Hive included in my Spark installation and it's a Hive UDF. However it will be included as part of the Spark SQL native syntax in future releases (SPARK-8175).Not sure if this is what you meant/needed but I've felt the same struggle-ness dealing with date/timestamp in spark-sql and the only thing I came up with was casting string in timestamp since it seems impossible (to me) having Date type in spark-sql.
Anyway, this is my code to accomplish something similar (Long in place of String) to your need (maybe):
and the output is something like this:
Could this be useful for you as well even though I'm using timestamp and not Date?
Hope it could help
FF
EDIT: in order to test a "single-cast" from Long to Timestamp I've tried this simple change:
and all worked fine with the result:
I solved the issue by adding this functions:
And registering it into the sqlContext like this:
Then I could finally group by date: