Extract main domain name from a given url

2019-01-24 08:10发布

问题:

I used the following to extract the domain from a url: (They are test cases)

String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
cases.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");

for (String t : cases) {  
    String res = t.replaceAll(regex, "");  
}

I can get the following results:

google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com

The first four cases are good. The last one is not good. What I want is: blogspot.com for the last one, but it gives zoyanailpolish.blogspot.com. What am I doing wrong?

回答1:

As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you.



回答2:

Using Guava library, we can easily get domain name:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details

https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html



回答3:

Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.

You should use instead the functionality provided by JAVA library like this:

URL myUrl = new URL(urlString);
myUrl.getHost();


回答4:

This is 2013 and solution I found is straight forward:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());


回答5:

It is much simpler:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}


回答6:

The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that start with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.