Need Help Improving Regular Expression Code in R

2019-08-03 19:32发布

I'm working on address data and use Regular Expression to detect error. Although my approach works, it's far away from being efficient.

First, let's create a dataset.

try_detection <- data.frame(address = c('444+MLK+Street', 
                                    '444+3rd+Avenue',
                                    '5th+MLK+Avenue', 
                                    '55th+MLK+Avenue', 
                                    '555th+MLK+Avenue',
                                    '5555th+MLK+Avenue',
                                    '55555th+MLK+Avenue'),
                        stringsAsFactors = FALSE)

The only first 2 observations are correct as it starts with address number. The goal is to flag the first 2 observations as FALSE and the rest is TRUE.

I could see that the incorrect pattern is numeric followed by alphabet. So, here is what I tried.

Method 1

try_detection$summary <- str_detect(try_detection$address, '^[:digit:]{1}[:alpha:]')

The result is only the 3rd observation gets flagged. So, I just thought I could simply use '|' and replace the number in {}.

Method 2

try_detection$summary <- str_detect(try_detection$address, 
                               '^[:digit:]{1}[:alpha:] | 
                                ^[:digit:]{2}[:alpha:] | 
                                ^[:digit:]{3}[:alpha:] | 
                                ^[:digit:]{4}[:alpha:] | 
                                ^[:digit:]{5}[:alpha:]')

But all of the observation just gets flagged as FALSE.

Method 3

So, this is what I ended up using.

try_detection$detect1 <- str_detect(try_detection$address, '^[:digit:]{1}[:alpha:]')
try_detection$detect2 <- str_detect(try_detection$address, '^[:digit:]{2}[:alpha:]')
try_detection$detect3 <- str_detect(try_detection$address, '^[:digit:]{3}[:alpha:]']
try_detection$detect4 <- str_detect(try_detection$address, '^[:digit:]{4}[:alpha:]')
try_detection$detect5 <- str_detect(try_detection$address, '^[:digit:]{5}[:alpha:]')

try_detection <- try_detection %>% mutate(summary = 
                                        ifelse(detect1 == TRUE | 
                                               detect2 == TRUE | 
                                               detect3 == TRUE | 
                                               detect4 == TRUE | 
                                               detect5 == TRUE, "Y", "N"))

Although it works and could correct flag incorrect addresses, it is not efficient at all. Please advise on how I can get things done more efficiently.

标签: r regex
1条回答
爷的心禁止访问
2楼-- · 2019-08-03 20:12

You may use

^[[:digit:]]+[[:alpha:]]

Or

^[0-9]+[[:alpha:]]

See the regex demo.

Details

  • ^ - start of string
  • [[:digit:]]+ / [0-9]+ - 1 or more (the + quantifier matches one or more occurrences) digits
  • [[:alpha:]] - a letter.

NOTE: if you plan to only match strings that have 1 to 5 digits at the beginning followed with a letter, you may replace + with {1,5} limiting (or range, interval) quantifier.

Although ICU regex allows using bare POSIX character classes (like [:digit:]), I suggest using them inside bracket expressions to make them more portable (i.e. [[:digit:]]).

查看更多
登录 后发表回答