I have a regex which works on the regular expression website but doesn't work when I copy it in R. Below is the code to recreate my data frame:
text <- data.frame(page = c(1,1,2,3), sen = c(1,2,1,1),
text = c("Dear Mr case 1",
"the value of my property is £500,000.00 and it was built in 1980",
"The protected percentage is 0% for 2 years",
"The interest rate is fixed for 2 years at 4.8%"))
regex working on website: https://regex101.com/r/OcVN5r/2
Below is the R codes I have tried so far and neither works.
library(stringr)
patt = "dear\\s+(mr|mrs|miss|ms)\\b[^£]+(£[\\d,.]+)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)(?:\\D|\\d(?![\\d.]*%))+([\\d.]+%)"
str_extract(text, patt)
grepl(pattern = patt, x = text)
I'm getting an error saying the regex is wrong but it works on the website. Not sure how to get it to work in r.
Basically I am trying to extract pieces of information from the text. Below are the details:
From the above dataframe, I need to extract the following:
1: Gender of the person. In this case it would be Male (looking at Mr
)
2: The number that represents the property value. in this case would be £500,000.00
.
3: The protected percentage value, which in our case would be 0%
.
4: The interest rate value and in our case it is 4.8%
.
I think the issue is your regex isn't giving alternate or "OR" matches. See below based on your bullet list
library(stringi)
rgx <- "(?<=dear\\s?)(m(r(s)?|s|iss))|\\p{S}([0-9]\\S+)|([0-9]+)((\\.[0-9]{1,})?)\\%"
stri_extract_all_regex(
text$text, rgx, opts_regex = stri_opts_regex(case_insensitive = T)
) %>% unlist()
Which gives
[1] "Mr" "£500,000.00" "0%" "4.8%"
The pattern says:
"(?<=dear\\s?)(m(r(s)?|s|iss))"
= find a match where the word dear appears before a mr, ms, mrs or miss... but don't capture the dear or the leading space
|
= OR
"\\p{S}([0-9]\\S+)"
= find a match where a sequence of numbers occurs, after a symbol (see ?stringi-search-charclass), until there is a white space. But It must have a symbol at the beginning
|
= OR
"([0-9]+)((\\.[0-9]{1,})?)\\%"
= find a match where a number occurs one or more times, that may have a decimal with numbers after it, but will end in a percent sign
I think you can do this with regexpr
function.
For an example:
text = "Dear Mr case 1, the value of my property is £500,000.00 and it was built in 1980, The protected percentage is 13% for 2 years, The interest rate is fixed for 2 years at 4.8%";
grps <- regexpr (pattern=patt, text = text, perl=TRUE, ignore.case=TRUE);
start_idx <- attr (grps, "capture.start");
end_idx <- start_idx + attr (grps, "capture.length");
substring (text = text, first = start_idx, last = end_idx);
This matches: [1] "Mr " "£500,000.00 " "13% " "4.8%"
From the manual:
regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is
none, with attribute "match.length", an integer vector giving the
length of the matched text (or -1 for no match). The match positions
and lengths are in characters unless useBytes = TRUE is used, when
they are in bytes (as they are for an ASCII-only matching: in either
case an attribute useBytes with value TRUE is set on the result). If
named capture is used there are further attributes "capture.start",
"capture.length" and "capture.names".
gregexpr returns a list of the same length as text each element of
which is of the same form as the return value for regexpr, except that
the starting positions of every (disjoint) match are given.
In your case I think you need to paste the lines together by using
full_line <- paste (text[,"text"], collapse=" ");
Then apply regexpr
on full_line