Remove subdomain from string in ruby

I'm looping over a series of URLs and want to clean them up. I have the following code:

# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])

# Remove www
new_url = o_url.host.gsub('www.', '').strip

How can I extend this to remove the subdomains that exist in some URLs?

标签： ruby url dns subdomain uri

8条回答

Evening l夕情丶

2楼-- · 2019-01-24 12:30

This is a tricky issue. Some top-level domains do not accept registrations at the second level.

Compare example.com and example.co.uk. If you would simply strip everything except the last two domains, you would end up with example.com, and co.uk, which can never be the intention.

Firefox solves this by filtering by effective top-level domain, and they maintain a list of all these domains. More information at publicsuffix.org.

You can use this list filter out everything except the domain right next to the effective TLD. I don't know of any Ruby library that does this, but it would be a great idea to release one!

Update: there are C, Perl and PHP libraries that do this. Given the C version, you could create a Ruby extension. Alternatively, you could port the code to Ruby.

0人赞添加讨论(0) 举报

不美不萌又怎样

3楼-- · 2019-01-24 12:37

I've wrestled with this a lot in writing various and sundry crawlers and scrapers over the years. My favorite gem for solving this is FuzzyUrl by Pete Gamache: https://github.com/gamache/fuzzyurl . Its available for Ruby, JavaScript and Elixir.

0人赞添加讨论(0) 举报

Bombasti

4楼-- · 2019-01-24 12:41

Why not just strip the .com or .co.uk and then split on '.' and get the last element?

some_url.host.sub(/(\.co\.uk|\.[^.]*)$/).split('.')[-1] + $1

Have to say it feels hacky. Are there any other domains like .co.uk?

0人赞添加讨论(0) 举报

你好瞎i

5楼-- · 2019-01-24 12:42

The regular expression you'll need here can be a bit tricky, because, hostnames can be infinitely complex -- you could have multiple subdomains (ie. foo.bar.baz.com), or the top level domain (TLD) can have multiple parts (ie. www.baz.co.uk).

Ready for a complex regular expression? :)

re = /^(?:(?>[a-z0-9-]*\.)+?|)([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$/i
new_url = o_url.host.gsub(re, '\1').strip

Let's break this into two sections. ^(?:(?>[a-z0-9-]*\.)+?|) will collect subdomains, by matching one or more groups of characters followed by a dot (greedily, so that all subdomains are matched here). The empty alternation is needed in the case of no subdomain (such as foo.com). ([a-z0-9-]+\.(?>[a-z]*(?>\.[a-z]{2})?))$ will collect the actual hostname and the TLD. It allows either for a one-part TLD (like .info, .com or .museum), or a two part TLD where the second part is two characters (like .oh.us or .org.uk).

I tested this expression on the following samples:

foo.com => foo.com
www.foo.com => foo.com
bar.foo.com => foo.com
www.foo.ca => foo.ca
www.foo.co.uk => foo.co.uk
a.b.c.d.e.foo.com => foo.com
a.b.c.d.e.foo.co.uk => foo.co.uk

Note that this regex will not properly match hostnames that have more than two "parts" to the TLD!

0人赞添加讨论(0) 举报

在下西门庆

6楼-- · 2019-01-24 12:42

Detecting the subdomain of a URL is non-trivial to do in a general sense - it's easy if you just consider the basic ones, but once you get into international territory this becomes tricky.

Edit: Consider stuff like http://mylocalschool.k12.oh.us et al.

0人赞添加讨论(0) 举报

我欲成王，谁敢阻挡

7楼-- · 2019-01-24 12:45

For posterity, here's an update from Oct 2014:

I was looking for a more up-to-date dependency to rely on and found the public_suffix gem (RubyGems) (GitHub). It's being actively maintained and handles all the top-level domain and nested-subdomain issues by maintaining a list of the known public suffixes.

In combination with URI.parse for stripping protocol and paths, it works really well:

❯❯❯ 2.1.2 ❯ PublicSuffix.parse(URI.parse('https://subdomain.google.co.uk/path/on/path').host).domain
=> "google.co.uk"

0人赞添加讨论(0) 举报

1 2 下一页

Remove subdomain from string in ruby

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间