Date conversion without specifying the format

2019-08-02 18:27发布

问题:

I do not understand how the "ymd" function from the library "lubridate" works in R. I am trying to build a feature which converts the date correctly without having to specify the format. I am checking for the minimum number of NA's occurring as a result of dmy(), mdy() and ymd() functions.

So ymd() is giving NA sometimes and sometimes not for the same Date value. Are there any other functions or packages in R, which will help me get over this problem.

> data$DTTM[1:5]
[1] "4-Sep-06"  "27-Oct-06" "8-Jan-07"  "28-Jan-07" "5-Jan-07" 

> ymd(data$DTTM[1])
[1] NA
Warning message:
All formats failed to parse. No formats found. 
> ymd(data$DTTM[2])
[1] "2027-10-06 UTC"
> ymd(data$DTTM[3])
[1] NA
Warning message:
All formats failed to parse. No formats found. 
> ymd(data$DTTM[4])
[1] "2028-01-07 UTC"
> ymd(data$DTTM[5])
[1] NA
Warning message:
All formats failed to parse. No formats found. 
> 

> ymd(data$DTTM[1:5])
[1] "2004-09-06 UTC" "2027-10-06 UTC" "2008-01-07 UTC" "2028-01-07 UTC"
[5] "2005-01-07 UTC"

Thanks

回答1:

@user1317221_G has already pointed out that you dates are in day-month-year format, which suggests that you should use dmy instead of ymd. Furthermore, because your month is in %b format ("Abbreviated month name in the current locale"; see ?strptime), your problem may have something to do with your locale. The month names you have seem to be English, which may differ from how they are spelled in the locale you are currently using.

Let's see what happens when I try dmy on the dates in my locale:

date_english <- c("4-Sep-06",  "27-Oct-06", "8-Jan-07",  "28-Jan-07", "5-Jan-07")
dmy(date_english)

# [1] "2006-09-04 UTC" NA               "2007-01-08 UTC" "2007-01-28 UTC" "2007-01-05 UTC"
# Warning message:
#  1 failed to parse.

"27-Oct-06" failed to parse. Let's check my time locale:

Sys.getlocale("LC_TIME")
# [1] "Norwegian (Bokmål)_Norway.1252"

dmy does not recognize "oct" as a valid %b month in my locale.

One way to deal with this issue would be to change "oct" to the corresponding Norwegian abbreviation, "okt":

date_nor <- c("4-Sep-06",  "27-Okt-06", "8-Jan-07",  "28-Jan-07", "5-Jan-07" )
dmy(date_nor)
# [1] "2006-09-04 UTC" "2006-10-27 UTC" "2007-01-08 UTC" "2007-01-28 UTC" "2007-01-05 UTC"

Another possibility is to use the original dates (i.e. in their original 'locale'), and set the locale argument in dmy. Exactly how this is done is platform dependent (see ?locales. Here is how I would do it in Windows:

dmy(date_english, locale = "English")
[1] "2006-09-04 UTC" "2006-10-27 UTC" "2007-01-08 UTC" "2007-01-28 UTC" "2007-01-05 UTC"


回答2:

Using the guess_formats function in the lubridate package would be the closest to what you are after.

library(lubridate)
x <- c("4-Sep-06", "27-Oct-06","8-Jan-07" ,"28-Jan-07","5-Jan-2007")
format <- guess_formats(x, c("mdY", "BdY", "Bdy", "bdY", "bdy", "mdy", "dby"))
strptime(x, format)

HTH



回答3:

from the documentation on ymd on page 70

As long as the order of formats is correct, these functions will parse dates correctly even when the input vectors contain differently formatted dates

ymd() expects year-month-day, you have day-month-year

x <- c("2009-01-01", "2009-01-02", "2009-01-03")
ymd(x)

maybe you need something like

y <- c("4-Sep-06",  "27-Oct-06", "8-Jan-07",  "28-Jan-07", "5-Jan-07" )
as.POSIXct(y, format = "%d-%b-%y")

PS the reason I think you get NAs for some is that you only have a single digit for year and ymd doesn't know what to do with that, but it works when you have two digits for year e.g. "27-Oct-06" "28-Jan-07" but fails for "5-Jan-07" etc