I'm looping over a series of URLs and want to clean them up. I have the following code:
# Parse url to remove http, path and check format
o_url = URI.parse(node.attributes['href'])
# Remove www
new_url = o_url.host.gsub('www.', '').strip
How can I extend this to remove the subdomains that exist in some URLs?
Something like:
You still need to add all (root) domains you consider root domain. So '.uk' might be the root domain, but you probably want to keep the host just before the '.co.uk' part.
I just wrote a library to do this called Domainatrix. You can find it here: http://github.com/pauldix/domainatrix