Subset dates in the wrong format in R

2019-07-25 13:44发布

I have questionnaire data where participants have inputted their date of birth in a wide variety of formats:

ID <- c(101,102,103,104,105,106,107)
dob <- c("20/04/2001","29/10/2000","September 1 2012","15/11/00","20.01.1999","April 20th 1999", "04/08/01")
df <- data.frame(ID, dob)

Before doing any analysis, I need to be able to subset the data when it is not in the correct format (i.e. dd/mm/yr) and then correct each cell in turn manually.

I tried using:

df$dob <- strptime(dob, "%d/%m/%Y")

... to highlight which of my dates were in the correct format, but I just get NAs for the dates that are inputted incorrectly which is not helpful if I want to subsequently change them (using the ID as a reference).

Does anyone have any ideas which may be able to help me?

2条回答
女痞
2楼-- · 2019-07-25 14:03

Disclaimer: I'm not sure if I understood your question completely.

In the snippet below, dob2 will contain a date or NA based on whether dob is in the correct format. You should be able to filter for is.na(dob2) to get the incorrect data. Note that 03/04/2013 can be interpreted as 3rd March or 4th April but you seem to be assuming it to be 3rd April so I went with it.

ID <- c(101,102,103,104,105,106,107)
dob <- c("20/04/2001","29/10/2000","September 1 2012","15/11/00","20.01.1999","April 20th 1999", "04/08/01")
df <- data.table(ID, dob)

df[,dob2 := as.Date(dob, "%d/%m/%Y")]

EDIT- added output. btw, you could also have done something like df[is.na(as.Date(dob, "%d/%m/%Y"))]

    ID              dob       dob2
1: 101       20/04/2001 2001-04-20
2: 102       29/10/2000 2000-10-29
3: 103 September 1 2012       <NA>
4: 104         15/11/00 0000-11-15
5: 105       20.01.1999       <NA>
6: 106  April 20th 1999       <NA>
7: 107         04/08/01 0001-08-04
查看更多
我想做一个坏孩纸
3楼-- · 2019-07-25 14:18

Check out the lubridate package.

library(lubridate)
parse_date_time(dob, c("dmy", "Bdy"))
# [1] "2001-04-20 UTC" "2000-10-29 UTC" "2012-09-01 UTC" "0000-11-15 UTC" "1999-01-20 UTC"
# [6] "1999-04-20 UTC" "0001-08-04 UTC"
查看更多
登录 后发表回答