I need to go through a large list of string url's and extract the domain name from them.
For example:
http://www.stackoverflow.com/questions would extract www.stackoverflow.com
I originally was using new URL(theUrlString).getHost()
but the URL object initialization adds a lot of time to the process and seems unneeded.
Is there a faster method to extract the host name that would be as reliable?
Thanks
Edit: My mistake, yes the www. would be included in domain name example above. Also, these urls may be http or https
I wrote a method (see below) which extracts a url's domain name and which uses simple String matching. What it actually does is extract the bit between the first
"://"
(or index0
if there's no"://"
contained) and the first subsequent"/"
(or indexString.length()
if there's no subsequent"/"
). The remaining, preceding"www(_)*."
bit is chopped off. I'm sure there'll be cases where this won't be good enough but it should be good enough in most cases!I read here that the
java.net.URI
class could do this (and was preferred to thejava.net.URL
class) but I encountered problems with theURI
class. Notably,URI.getHost()
gives a null value if the url does not include the scheme, i.e. the"http(s)"
bit.You could try to use regular expressions.
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Here is a question about extracting domain name with regular expressions in Java:
Regular expression to retrieve domain.tld
You could write a regexp? http:// is always the same, and then match everything until you get the first '/'.
You want to be rather careful with implementing a "fast" way unpicking URLs. There is a lot of potential variability in URLs that could cause a "fast" method to fail. For example:
The scheme (protocol) part can be written in any combination of upper and lower case letters; e.g. "http", "Http" and "HTTP" are equivalent.
The authority part can optionally include a user name and / or a port number as in "http://you@example.com:8080/index.html".
Since DNS is case insensitive, the hostname part of a URL is also (effectively) case insensitive.
It is legal (though highly irregular) to %-encode unreserved characters in the scheme or authority components of a URL. You need to take this into account when matching (or stripping) the scheme, or when interpreting the hostname. An hostname with %-encoded characters is defined to be equivalent to one with the %-encoded sequences decoded.
Now, if you have total control of the process that generates the URLs you are stripping, you can probably ignore these niceties. But if they are harvested from documents or web pages, or entered by humans, you would be well advised to consider what might happen if your code encounters an "unusual" URL.
If your concern is the time taken to construct URL objects, consider using URI objects instead. Among other good things, URI objects don't attempt a DNS lookup of the hostname part.
Assuming that they're all well-formed URLs, but you dont' know whether they'll be http://, https://, etc.
Try method : getDomainFromUrl() in that class