I'm working on address data and use Regular Expression to detect error. Although my approach works, it's far away from being efficient.
First, let's create a dataset.
try_detection <- data.frame(address = c('444+MLK+Street',
'444+3rd+Avenue',
'5th+MLK+Avenue',
'55th+MLK+Avenue',
'555th+MLK+Avenue',
'5555th+MLK+Avenue',
'55555th+MLK+Avenue'),
stringsAsFactors = FALSE)
The only first 2 observations are correct as it starts with address number. The goal is to flag the first 2 observations as FALSE and the rest is TRUE.
I could see that the incorrect pattern is numeric followed by alphabet. So, here is what I tried.
Method 1
try_detection$summary <- str_detect(try_detection$address, '^[:digit:]{1}[:alpha:]')
The result is only the 3rd observation gets flagged. So, I just thought I could simply use '|' and replace the number in {}.
Method 2
try_detection$summary <- str_detect(try_detection$address,
'^[:digit:]{1}[:alpha:] |
^[:digit:]{2}[:alpha:] |
^[:digit:]{3}[:alpha:] |
^[:digit:]{4}[:alpha:] |
^[:digit:]{5}[:alpha:]')
But all of the observation just gets flagged as FALSE.
Method 3
So, this is what I ended up using.
try_detection$detect1 <- str_detect(try_detection$address, '^[:digit:]{1}[:alpha:]')
try_detection$detect2 <- str_detect(try_detection$address, '^[:digit:]{2}[:alpha:]')
try_detection$detect3 <- str_detect(try_detection$address, '^[:digit:]{3}[:alpha:]']
try_detection$detect4 <- str_detect(try_detection$address, '^[:digit:]{4}[:alpha:]')
try_detection$detect5 <- str_detect(try_detection$address, '^[:digit:]{5}[:alpha:]')
try_detection <- try_detection %>% mutate(summary =
ifelse(detect1 == TRUE |
detect2 == TRUE |
detect3 == TRUE |
detect4 == TRUE |
detect5 == TRUE, "Y", "N"))
Although it works and could correct flag incorrect addresses, it is not efficient at all. Please advise on how I can get things done more efficiently.
You may use
Or
See the regex demo.
Details
^
- start of string[[:digit:]]+
/[0-9]+
- 1 or more (the+
quantifier matches one or more occurrences) digits[[:alpha:]]
- a letter.NOTE: if you plan to only match strings that have 1 to 5 digits at the beginning followed with a letter, you may replace
+
with{1,5}
limiting (or range, interval) quantifier.Although ICU regex allows using bare POSIX character classes (like
[:digit:]
), I suggest using them inside bracket expressions to make them more portable (i.e.[[:digit:]]
).