I want to use a regex to extract all URLs from text in a dataframe, into a new column. I have some older code that I have used to extract keywords, so I'm looking to adapt the code for a regex. I want to save a regex as a string variable and apply here:
data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))
It seems that fixed=FALSE
should tell grepl
that its a regular expression, but R doesn't like how I am trying to save the regex as:
regex <- "http.*?1-\\d+,\\d+"
My data is organized in a data frame like this:
data <- read.table(text='"Content" "date"
1 "a house a home https://www.foo.com" "12/31/2013"
2 "cabin ideas https://www.example.com in the woods" "5/4/2013"
3 "motel is a hotel" "1/4/2013"', header=TRUE)
And would hopefully look like:
Content date ContentURL
1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
3 motel is a hotel 1/4/2013
Hadleyverse solution (stringr
package) with a decent URL pattern:
library(stringr)
url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
data$ContentURL <- str_extract(data$Content, url_pattern)
data
## Content date ContentURL
## 1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
## 3 motel is a hotel 1/4/2013 <NA>
You can use str_extract_all
if there are multiples in Content
, but that will involve some extra processing on your end afterwards.
Here's one approach using the qdapRegex
library:
library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data
## Content date url
## 1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
## 3 motel is a hotel 1/4/2013 <NA>
To see the regular expression used by the function (as qdapRegex
aims to help analyze and educate about regexs) you can use the grab
function with the function name prefixed with @
:
grab("@rm_url")
## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"
grepl
tells you a logical output of yes this string contains or no it does not. grep
tells you the indexes or gives the values but values are the whole string nut the substring you want.
So to pass this regex along to base or the stringi package (qdapRegex wraps stingi for extraction) you could do:
regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))
library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))
I'm sure there's a stringr approach too but am not familiar with the package.
Split on space then find "http":
data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
function(i){
x <- i[ grepl("http", i)]
if(length(x) == 0) x <- NA
x
}))
data
# Content date ContentURL
# 1 a house a home https://www.foo.com 12/31/2013 https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods 5/4/2013 https://www.example.com
# 3 motel is a hotel 1/4/2013 <NA>