可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

It seems that fixed=FALSE should tell grepl that its a regular expression, but R doesn't like how I am trying to save the regex as:

regex <- "http.*?1-\\d+,\\d+"

My data is organized in a data frame like this:

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

And would hopefully look like:

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013

回答1:

Hadleyverse solution (stringr package) with a decent URL pattern:

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

You can use str_extract_all if there are multiples in Content, but that will involve some extra processing on your end afterwards.

回答2:

Here's one approach using the qdapRegex library:

library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data

##                                            Content       date                     url
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

To see the regular expression used by the function (as qdapRegex aims to help analyze and educate about regexs) you can use the grab function with the function name prefixed with @:

grab("@rm_url")

## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

grepl tells you a logical output of yes this string contains or no it does not. grep tells you the indexes or gives the values but values are the whole string nut the substring you want.

So to pass this regex along to base or the stringi package (qdapRegex wraps stingi for extraction) you could do:

regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))

library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))

I'm sure there's a stringr approach too but am not familiar with the package.

回答3:

Split on space then find "http":

data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
                                 function(i){
                                   x <- i[ grepl("http", i)]
                                   if(length(x) == 0) x <- NA
                                   x
                                 }))


data
#                                            Content       date              ContentURL
# 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
# 3                                 motel is a hotel   1/4/2013                    <NA>