i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.
for example, if my sentence is
z = "I love stack overflow it is such a cool site"
id like to have my output as being (if i need the first four words)
[1] "I love stack overflow"
or (if i need the last four words)
[1] "such a cool site"
of course, the following works
paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")
but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)
I looked at several links, including
Regex to extract first 3 words from a string and
http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html
so i tried things like
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
i tried other stuff but it usually returned me either the whole string, or the empty string.
another problem with substr is that it returns a list. maybe it looks like the [[]]
operator is slowing things a bit (??) when dealing with large files and doing apply stuff.
it looks like the Syntax used in R is somewhat different ?
thanks !
You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.
There are two problems with your gsub
approach:
You used single backslashes (\
). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\
). If you do nchar("\\")
, you'll see that it returns "1".
You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...)
, and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1"
.
You should have tried something like:
sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"
This is essentially saying:
- Work from the start of the contents of "z".
- Start creating group 1.
- Find non-whitespace (like a word) followed by whitespace (
\S+\s+
) two times {2}
and then the next set of non-whitespaces (\S+
). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2}
to be one less than the number you are actually after.
- End group 1 there.
- Then, just return the contents of group 1 (
\1
) from "z".
To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.
sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"
For getting the first four words.
library(stringr)
str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")
For getting the last four.
str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")