Since I started using R< not long ago, I've found this site very useful in helping me build my scripts. I have yet again came across a challenge for which I can't seem to find an answer anywhere. Here is my problem: In my data I have a column which contains a different URL in each row. In each of those URL's there is a particular piece of information I want to extract. Currently I do it in excel because I've been told it's impossible to do in R and that no function exists to do it.
The URL will look like this example format and it will be found in the "source" column
http://www.googleclick.com?utm_source=ADX&ID56789
http://www.googleclick.com?utm_source=ADW&ID56009
http://www.googleclick.com?utm_source=ADWords&ID53389
The part of the URL that is of importance to me is the "utm_source=ADX" bit .My data looks something like this:
User / Source
1 / http://www.googleclick.com?utm_source=ADX&ID56789
2 / http://www.googleclick.com?utm_source=ADW&ID56009
3 / http://www.googleclick.com?utm_source=ADWords&ID53389
What I need to do is to capture the utm_source from the URL and transpose the information into a different column, example below:
User / Source / utm_source
1 / googleclick / ADX&ID56789
2 / googleclick / ADW&ID56009
3 / googleclick / ADWords&ID53389
So in essence I need R to search in the entire dataframe for the value "utm_source=" and once it has found them, I want it to transpose the "utm_source=" value into a column name and to copy all the information that comes after "=" in a that column for each individual row. I know that "grep" is a function that locates a specific piece of information in the datafreme , for example data <- total[grepl("utm_source", total$Source), ]. This will give me all the rows that contain the word "utm_source" but what I need is the information that comes after " utm_source". Usually my data can have as many as 500.000 rows. At the moment I use the excel function "text to columns" for this, and I basically split the URL's into little bits and keep the columns that I need, but this can be a very messy and lengthy process.
Is there a way to modify the grepl function to meet the criteria I need?
@Andrie's answer does the trick. Here's another way using using
regmatches
andgregexpr
that might be useful.Nothing is impossible.
First, use
strsplit
Then find a red-hot poker and stick in the eye of your so-called advisor.
EDIT:
As suggested by Paul Hiemstra, you can also use a regular expression directly: