Searching for unique values in dataframe and creat

2020-02-14 08:03发布

问题:

Since I started using R< not long ago, I've found this site very useful in helping me build my scripts. I have yet again came across a challenge for which I can't seem to find an answer anywhere. Here is my problem: In my data I have a column which contains a different URL in each row. In each of those URL's there is a particular piece of information I want to extract. Currently I do it in excel because I've been told it's impossible to do in R and that no function exists to do it.

The URL will look like this example format and it will be found in the "source" column

http://www.googleclick.com?utm_source=ADX&ID56789

http://www.googleclick.com?utm_source=ADW&ID56009

http://www.googleclick.com?utm_source=ADWords&ID53389

The part of the URL that is of importance to me is the "utm_source=ADX" bit .My data looks something like this:

User / Source

1 / http://www.googleclick.com?utm_source=ADX&ID56789

2 / http://www.googleclick.com?utm_source=ADW&ID56009

3 / http://www.googleclick.com?utm_source=ADWords&ID53389

What I need to do is to capture the utm_source from the URL and transpose the information into a different column, example below:

User / Source / utm_source

1 / googleclick / ADX&ID56789

2 / googleclick / ADW&ID56009

3 / googleclick / ADWords&ID53389

So in essence I need R to search in the entire dataframe for the value "utm_source=" and once it has found them, I want it to transpose the "utm_source=" value into a column name and to copy all the information that comes after "=" in a that column for each individual row. I know that "grep" is a function that locates a specific piece of information in the datafreme , for example data <- total[grepl("utm_source", total$Source), ]. This will give me all the rows that contain the word "utm_source" but what I need is the information that comes after " utm_source". Usually my data can have as many as 500.000 rows. At the moment I use the excel function "text to columns" for this, and I basically split the URL's into little bits and keep the columns that I need, but this can be a very messy and lengthy process.

Is there a way to modify the grepl function to meet the criteria I need?

回答1:

Nothing is impossible.

x <- read.csv(text="
User, Source
1, http://www.googleclick.com?utm_source=ADX&ID56789
2, http://www.googleclick.com?utm_source=ADW&ID56009
3, http://www.googleclick.com?utm_source=ADWords&ID53389
", header=TRUE, stringsAsFactors=FALSE)

First, use strsplit

strsplit(x$Source, split="\\?utm_source=")
[[1]]
[1] " http://www.googleclick.com" "ADX&ID56789"                

[[2]]
[1] " http://www.googleclick.com" "ADW&ID56009"                

[[3]]
[1] " http://www.googleclick.com" "ADWords&ID53389" 

Then find a red-hot poker and stick in the eye of your so-called advisor.


EDIT:

As suggested by Paul Hiemstra, you can also use a regular expression directly:

gsub(".*\\?utm_source=", "", x$Source)
[1] "ADX&ID56789"     "ADW&ID56009"     "ADWords&ID53389"


回答2:

@Andrie's answer does the trick. Here's another way using using regmatches and gregexpr that might be useful.

d <- read.table(text="User URL
1 http://www.googleclick.com?utm_source=ADX&ID56789
2 http://www.googleclick.com?utm_source=ADW&ID56009
3 http://www.googleclick.com?utm_source=ADWords&ID53389", header=TRUE)

domain.pat <- '((?<=www.)([[:alnum:]_]+))'
source.pat <- '((?<=utm_source=)([[:alnum:]&]+))' # exclude the '&' here to only grab up to the '&'
all.matches <- gregexpr(paste(domain.pat, source.pat, sep='|'), d$URL, perl=TRUE)
all.substrings <- regmatches(d$URL, all.matches)
do.call(rbind, all.substrings)

#      [,1]          [,2]             
# [1,] "googleclick" "ADX&ID56789"    
# [2,] "googleclick" "ADW&ID56009"    
# [3,] "googleclick" "ADWords&ID53389"