How to convert string to date on a column with dif

2019-07-28 04:12发布

问题:

I have a column in my Spark DataFrame, open_date with string type values as below which are two different formats yyyymmdd and yyyymm

+---------+
|open_date|
+---------+
| 19500102| 
|   195001| 
+---------+

and my expected output is

+----------+
| open_date|
+----------+
|1950-01-02|
|1950-01-01|
+----------+

I tried converting this string to date format using pyspark.sql.functions.substr, pyspark.sql.functions.split and pyspark.sql.functions.regex_extract. Having limited knowledge on these, none of them succeeded.

How can I convert string to date type on a column with different formats?

回答1:

You can require that the yyyy and mm are present, but make dd optional. Break each into their own capture group, filter out if the dd is missing, then join using '-' delimiters.

>>> import re
>>> s = '19500102 195001'
>>> ['-'.join(filter(None, i)) for i in re.findall(r'(\d{4})(\d{2})(\d{2})?', s)]
['1950-01-02', '1950-01']


回答2:

Update 2019-06-24

You can try each of the valid date formats and use pyspark.sql.functions.coalesce to return the first non-null result.

import pyspark.sql.functions as f

def date_from_string(date_str, fmt):
    try:
        # For spark version 2.2 and above, to_date takes in a second argument
        return f.to_date(date_str, fmt).cast("date")
    except TypeError:
        # For spark version 2.1 and below, you'll have to do it this way
        return f.from_unixtime(f.unix_timestamp(date_str, fmt)).cast("date")

possible_date_formats = ["yyyyMMdd", "yyyyMM"]

df = df.withColumn(
    "open_date",
    f.coalesce(*[date_from_string("open_date", fmt) for fmt in possible_date_formats])
)

df.show()
#+----------+
#| open_date|
#+----------+
#|1950-01-02|
#|1950-01-01|
#+----------+

Original Answer

If you're guaranteed to only have strings that are 6 or 8 characters in length, the simplest thing would be to append "01" to the end of the short strings to specify the first of the month.

Here is an example using pyspark.sql.functions.length() and pyspark.sql.functions.concat():

import pyspark.sql.functions as f

df = df.withColumn(
    'open_date',
    f.when(
        f.length(f.col('open_date')) == 6,
        f.concat(f.col('open_date'), "01")
    ).otherwise(f.col('open_date'))
)
df.show()
#+---------+
#|open_date|
#+---------+
#| 19500102| 
#| 19500101| 
#+---------+

Then use the techniques described in this post (paraphrased below) to convert to a date.

For Spark 2.1 and below:

df = df.withColumn('open_date', f.from_unixtime(f.unix_timestamp('open_date', 'yyyyMMdd')))

For Spark 2.2+

df = df.withColumn('open_date', f.to_date('open_date', 'yyyyMMdd'))