Domain name regex

2020-04-17 02:24发布

问题:

Trying to extract the domain name out of URL. For example:

x <-"https://stackoverflow.com/questions/ask"

to: stackoverflow.com

I found the following regex from this question. regex match main domain name.

regex <- "([0-9A-Za-z]{2,}\\[0-9A-Za-z]{2,3}\\[0-9A-Za-z]{2,3}|[0-9A-Za-z]{2,}\\[0-9A-Za-z]{2,3})$"

But R doesn't seem to understand it when I try to use str_extract from the stringr package.

x2 <- str_extract(x, regex)

回答1:

Why not use parseURI from XML? It breaks a URL into its different elements.

x <- "http://stackoverflow.com/questions/ask"
library(XML)
parseURI(x)$server
# [1] "stackoverflow.com"


回答2:

A TLD extraction is not as simple as you might think. There's a nice list of what are deemed "public TLDs" i.e. what are, effectively, true top-level domains. I work with these every day (mining domains for cybersecurity).

We've got a tldextract R package (more info here) that does a great job parsing these for further data mining. You can use parse_url from httr to extract the hostname component, then run our tldextract function over it:

library(httr)
library(rvest)
library(tldextract)

# get some URLs - I encourage you to bump up "10" to "100" or more to see how
# tldextract deals with "public TLDs"
pg <- html("http://httparchive.org/urls.php?start=1&end=10")

# clean up the <pre> output and make it a character list
urls <- pg %>% html_nodes("pre") %>% html_text() %>% strsplit("\n") %>% unlist
urls <- urls[urls != ""] # that site has a blank first line we don't need

# extract the hostname part
urls <- as.character(unlist(sapply(lapply(urls, parse_url), "[", "hostname")))
urls

##  [1] "www.google.com"    "www.facebook.com"  "www.youtube.com"  
##  [4] "www.yahoo.com"     "www.baidu.com"     "www.wikipedia.org"
##  [7] "www.amazon.com"    "www.twitter.com"   "www.qq.com"       
## [10] "www.taobao.com"

# extract the TLDs
tlds <- tldextract(urls)
tlds

##                 host subdomain    domain tld
## 1     www.google.com       www    google com
## 2   www.facebook.com       www  facebook com
## 3    www.youtube.com       www   youtube com
## 4      www.yahoo.com       www     yahoo com
## 5      www.baidu.com       www     baidu com
## 6  www.wikipedia.org       www wikipedia org
## 7     www.amazon.com       www    amazon com
## 8    www.twitter.com       www   twitter com
## 9         www.qq.com       www        qq com
## 10    www.taobao.com       www    taobao com

# piece what we need together
sprintf("%s.%s", tlds$domain, tlds$tld)

##  [1] "google.com"    "facebook.com"  "youtube.com"   "yahoo.com"    
##  [5] "baidu.com"     "wikipedia.org" "amazon.com"    "twitter.com"  
##  [9] "qq.com"        "taobao.com"


回答3:

hi this code for get domain name and use From Regex

.*(?:\.|\/)(.*)\..*

for example

http://www.stackoverflow.com

result

stackoverflow



标签: xml regex r xpath