Convert variable with mixed date formats to one fo

2019-01-02 18:57发布

A sample of my dataframe:

                date
1   25 February 1987
2     20 August 1974
3     9 October 1984
4     18 August 1992
5  19 September 1995
6          16-Oct-63
7          30-Sep-65
8        22 Jan 2008
9         13-11-1961
10    18 August 1987
11         15-Sep-70
12    5 October 1994
13   5 December 1984
14          03/23/87
15    30 August 1988
16        26-10-1993
17    22 August 1989
18         13-Sep-97

I have a large dataframe with a date variable that has multiple formats for dates. Most of the formats in the variable are shown above- there are a couple of very rare others too. The reason why there are multiple formats is that the data were pulled together from various websites that each used different formats.

I have tried using straightforward conversions e.g.

strftime(mydf$date,"%d/%m/%Y")

but these sorts of conversion will not work if there are multiple formats. I don't want to resort to multiple gsub type editing. I was wondering if I am missing a more simple solution?

Code for example:

    structure(list(date = structure(c(12L, 8L, 18L, 6L, 7L, 4L, 14L, 
10L, 1L, 5L, 3L, 17L, 16L, 11L, 15L, 13L, 9L, 2L), .Label = c("13-11-1961", 
"13-Sep-97", "15-Sep-70", "16-Oct-63", "18 August 1987", "18 August 1992", 
"19 September 1995", "20 August 1974", "22 August 1989", "22 Jan 2008", 
"03/23/87", "25 February 1987", "26-10-1993", "30-Sep-65", "30 August 1988", 
"5 December 1984", "5 October 1994", "9 October 1984"), class = "factor")), .Names = "date", row.names = c(NA, 
-18L), class = "data.frame")

3条回答
还给你的自由
2楼-- · 2019-01-02 19:24

Here is a base solution:

fmts <- c("%d-%b-%y", "%d %b %Y", "%d-%m-%Y", "%m/%d/%y")
d <- as.Date(as.numeric(apply(outer(DF$date, fmts, as.Date), 1, na.omit)), "1970-01-01")

We have made the simplifying assumption that exactly 1 format works for each input date. That seems to be the case in the example but if not replace na.omit with function(x) c(na.omit(x), NA)[1]).

Note that a two digit year can be ambiguous but here it seems it should always be in the past so we subtract 100 years if not:

 past <- function(x) ifelse(x > Sys.Date(), seq(from=x, length=2, by="-100 year")[2], x)
 as.Date(sapply(d, past), "1970-01-01")

For the sample data the last line gives:

[1] "1987-02-25" "1974-08-20" "1984-10-09" "1992-08-18" "1995-09-19"
[6] "1963-10-16" "1965-09-30" "2008-01-22" "1961-11-13" "1987-08-18"
[11] "1970-09-15" "1994-10-05" "1984-12-05" "1987-03-23" "1988-08-30"
[16] "1993-10-26" "1989-08-22" "1997-09-13"
查看更多
有味是清欢
3楼-- · 2019-01-02 19:24

Try writing a function and then call it later. for example: You have a character string "dd-mm-yyyy" and would like to only extract month out of it, then

month <- function(date_var){

# Store the month value in month

 ay_month<- as.Date(date_var,format = "%d-%m-%Y")
  month <- format(date_var, "%m")
  return(month)
}

Now pass to find month in your vector, change the character format to Date. The output would be 04

month(as.Date("12-04-2014", format = "%d-%m-%Y"))
查看更多
无与为乐者.
4楼-- · 2019-01-02 19:25

You may try parse_date_time in package lubridate which "allows the user to specify several format-orders to handle heterogeneous date-time character representations" using the orders argument. Something like...

library(lubridate)
parse_date_time(x = df$date,
                orders = c("d m y", "d B Y", "m/d/y"),
                locale = "eng")

...should be able to handle most of your formats. Please note that b/B formats are locale sensitive.

查看更多
登录 后发表回答