Regular expression not working in R but works on w

2019-08-24 19:44发布

问题:

I have a regex which works on the regular expression website but doesn't work when I copy it in R. Below is the code to recreate my data frame:

text <- data.frame(page = c(1,1,2,3), sen = c(1,2,1,1),
                   text = c("Dear Mr case 1",
                            "the value of my property is £500,000.00 and it was built in 1980", 
                            "The protected percentage is 0% for 2 years",
                            "The interest rate is fixed for 2 years at 4.8%"))

regex working on website: https://regex101.com/r/OcVN5r/2

Below is the R codes I have tried so far and neither works.

library(stringr)
patt = "dear\\s+(mr|mrs|miss|ms)\\b[^£]+(£[\\d,.]+)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)"
str_extract(text, patt)
grepl(pattern = patt, x = text)

I'm getting an error saying the regex is wrong but it works on the website. Not sure how to get it to work in r. Basically I am trying to extract pieces of information from the text. Below are the details: From the above dataframe, I need to extract the following:

1: Gender of the person. In this case it would be Male (looking at Mr)

2: The number that represents the property value. in this case would be £500,000.00.

3: The protected percentage value, which in our case would be 0%.

4: The interest rate value and in our case it is 4.8%.

回答1:

I think the issue is your regex isn't giving alternate or "OR" matches. See below based on your bullet list

library(stringi)
rgx <- "(?<=dear\\s?)(m(r(s)?|s|iss))|\\p{S}([0-9]\\S+)|([0-9]+)((\\.[0-9]{1,})?)\\%"
stri_extract_all_regex(
   text$text, rgx, opts_regex = stri_opts_regex(case_insensitive = T)
) %>% unlist()

Which gives

[1] "Mr"          "£500,000.00"      "0%"          "4.8%" 

The pattern says:

  • "(?<=dear\\s?)(m(r(s)?|s|iss))" = find a match where the word dear appears before a mr, ms, mrs or miss... but don't capture the dear or the leading space
  • | = OR
  • "\\p{S}([0-9]\\S+)" = find a match where a sequence of numbers occurs, after a symbol (see ?stringi-search-charclass), until there is a white space. But It must have a symbol at the beginning
  • | = OR
  • "([0-9]+)((\\.[0-9]{1,})?)\\%" = find a match where a number occurs one or more times, that may have a decimal with numbers after it, but will end in a percent sign


回答2:

I think you can do this with regexpr function.

For an example:

text = "Dear Mr case 1, the value of my property is £500,000.00 and it was built in 1980, The protected percentage is 13% for 2 years, The interest rate is fixed for 2 years at 4.8%";

grps <- regexpr (pattern=patt, text = text, perl=TRUE, ignore.case=TRUE);

start_idx <- attr (grps, "capture.start");
end_idx   <- start_idx + attr (grps, "capture.length");

substring (text = text, first = start_idx, last = end_idx); 

This matches: [1] "Mr " "£500,000.00 " "13% " "4.8%"

From the manual:

regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is none, with attribute "match.length", an integer vector giving the length of the matched text (or -1 for no match). The match positions and lengths are in characters unless useBytes = TRUE is used, when they are in bytes (as they are for an ASCII-only matching: in either case an attribute useBytes with value TRUE is set on the result). If named capture is used there are further attributes "capture.start", "capture.length" and "capture.names".

gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.

In your case I think you need to paste the lines together by using

full_line <- paste (text[,"text"], collapse=" ");

Then apply regexpr on full_line