I have a column in my Spark DataFrame, open_date
with string type values as below which are two different formats yyyymmdd
and yyyymm
+---------+
|open_date|
+---------+
| 19500102|
| 195001|
+---------+
and my expected output is
+----------+
| open_date|
+----------+
|1950-01-02|
|1950-01-01|
+----------+
I tried converting this string to date format using pyspark.sql.functions.substr
, pyspark.sql.functions.split
and pyspark.sql.functions.regex_extract
. Having limited knowledge on these, none of them succeeded.
How can I convert string to date type on a column with different formats?
You can require that the yyyy
and mm
are present, but make dd
optional. Break each into their own capture group, filter
out if the dd
is missing, then join
using '-'
delimiters.
>>> import re
>>> s = '19500102 195001'
>>> ['-'.join(filter(None, i)) for i in re.findall(r'(\d{4})(\d{2})(\d{2})?', s)]
['1950-01-02', '1950-01']
Update 2019-06-24
You can try each of the valid date formats and use pyspark.sql.functions.coalesce
to return the first non-null result.
import pyspark.sql.functions as f
def date_from_string(date_str, fmt):
try:
# For spark version 2.2 and above, to_date takes in a second argument
return f.to_date(date_str, fmt).cast("date")
except TypeError:
# For spark version 2.1 and below, you'll have to do it this way
return f.from_unixtime(f.unix_timestamp(date_str, fmt)).cast("date")
possible_date_formats = ["yyyyMMdd", "yyyyMM"]
df = df.withColumn(
"open_date",
f.coalesce(*[date_from_string("open_date", fmt) for fmt in possible_date_formats])
)
df.show()
#+----------+
#| open_date|
#+----------+
#|1950-01-02|
#|1950-01-01|
#+----------+
Original Answer
If you're guaranteed to only have strings that are 6 or 8 characters in length, the simplest thing would be to append "01"
to the end of the short strings to specify the first of the month.
Here is an example using pyspark.sql.functions.length()
and pyspark.sql.functions.concat()
:
import pyspark.sql.functions as f
df = df.withColumn(
'open_date',
f.when(
f.length(f.col('open_date')) == 6,
f.concat(f.col('open_date'), "01")
).otherwise(f.col('open_date'))
)
df.show()
#+---------+
#|open_date|
#+---------+
#| 19500102|
#| 19500101|
#+---------+
Then use the techniques described in this post (paraphrased below) to convert to a date.
For Spark 2.1 and below:
df = df.withColumn('open_date', f.from_unixtime(f.unix_timestamp('open_date', 'yyyyMMdd')))
For Spark 2.2+
df = df.withColumn('open_date', f.to_date('open_date', 'yyyyMMdd'))