可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a column in my Spark DataFrame, open_date with string type values as below which are two different formats yyyymmdd and yyyymm

+---------+
|open_date|
+---------+
| 19500102| 
|   195001| 
+---------+

and my expected output is

+----------+
| open_date|
+----------+
|1950-01-02|
|1950-01-01|
+----------+

I tried converting this string to date format using pyspark.sql.functions.substr, pyspark.sql.functions.split and pyspark.sql.functions.regex_extract. Having limited knowledge on these, none of them succeeded.

How can I convert string to date type on a column with different formats?

回答1:

You can require that the yyyy and mm are present, but make dd optional. Break each into their own capture group, filter out if the dd is missing, then join using '-' delimiters.

>>> import re
>>> s = '19500102 195001'
>>> ['-'.join(filter(None, i)) for i in re.findall(r'(\d{4})(\d{2})(\d{2})?', s)]
['1950-01-02', '1950-01']

回答2:

Update 2019-06-24

You can try each of the valid date formats and use pyspark.sql.functions.coalesce to return the first non-null result.

import pyspark.sql.functions as f

def date_from_string(date_str, fmt):
    try:
        # For spark version 2.2 and above, to_date takes in a second argument
        return f.to_date(date_str, fmt).cast("date")
    except TypeError:
        # For spark version 2.1 and below, you'll have to do it this way
        return f.from_unixtime(f.unix_timestamp(date_str, fmt)).cast("date")

possible_date_formats = ["yyyyMMdd", "yyyyMM"]

df = df.withColumn(
    "open_date",
    f.coalesce(*[date_from_string("open_date", fmt) for fmt in possible_date_formats])
)

df.show()
#+----------+
#| open_date|
#+----------+
#|1950-01-02|
#|1950-01-01|
#+----------+

Original Answer

If you're guaranteed to only have strings that are 6 or 8 characters in length, the simplest thing would be to append "01" to the end of the short strings to specify the first of the month.

Here is an example using pyspark.sql.functions.length() and pyspark.sql.functions.concat():

import pyspark.sql.functions as f

df = df.withColumn(
    'open_date',
    f.when(
        f.length(f.col('open_date')) == 6,
        f.concat(f.col('open_date'), "01")
    ).otherwise(f.col('open_date'))
)
df.show()
#+---------+
#|open_date|
#+---------+
#| 19500102| 
#| 19500101| 
#+---------+

Then use the techniques described in this post (paraphrased below) to convert to a date.

For Spark 2.1 and below:

df = df.withColumn('open_date', f.from_unixtime(f.unix_timestamp('open_date', 'yyyyMMdd')))

For Spark 2.2+

df = df.withColumn('open_date', f.to_date('open_date', 'yyyyMMdd'))

How to convert string to date on a column with dif

问题:

回答1:

回答2:

收藏的人(0)

How to convert string to date on a column with dif

问题:

回答1:

回答2:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮