I have questionnaire data where participants have inputted their date of birth in a wide variety of formats:
ID <- c(101,102,103,104,105,106,107)
dob <- c("20/04/2001","29/10/2000","September 1 2012","15/11/00","20.01.1999","April 20th 1999", "04/08/01")
df <- data.frame(ID, dob)
Before doing any analysis, I need to be able to subset the data when it is not in the correct format (i.e. dd/mm/yr) and then correct each cell in turn manually.
I tried using:
df$dob <- strptime(dob, "%d/%m/%Y")
... to highlight which of my dates were in the correct format, but I just get NAs for the dates that are inputted incorrectly which is not helpful if I want to subsequently change them (using the ID as a reference).
Does anyone have any ideas which may be able to help me?
Disclaimer: I'm not sure if I understood your question completely.
In the snippet below, dob2 will contain a date or NA based on whether dob is in the correct format. You should be able to filter for is.na(dob2) to get the incorrect data. Note that 03/04/2013 can be interpreted as 3rd March or 4th April but you seem to be assuming it to be 3rd April so I went with it.
EDIT- added output. btw, you could also have done something like
df[is.na(as.Date(dob, "%d/%m/%Y"))]
Check out the
lubridate
package.