Get just the domain name from a URL?

2020-01-25 07:25发布

问题:

I am trying to extract just the domain name from a URL string. I almost have it... I am using URI

I have a string.. my first thought was to use Regex but then i decided to use URI class

http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAgQFjAA&url=http://www.test.com/&rct=j&q=test&ei=G2phS-HdJJWTjAfckvHJDA&usg=AFQjCNFSEAztaqtkaIvEzxmRm2uOARn1kQ

I need to convert the above to google.com and google without the www

I did the following

Uri test = new Uri(referrer);
log.Info("Domain part : " + test.Host);

Basically this returns www.google.com .... i would like to try and return 2 forms if possible... as mentioned...

google.com and google

Is this possible with URI?

回答1:

Yes, it is possible use:

Uri.GetLeftPart( UriPartial.Authority )


回答2:

@Dewfy: flaw is that your method returns "uk" for "www.test.co.uk" but the domain here is clearly "test.co.uk".

@naivists: flaw is that your method returns "beta.microsoft.com" for "www.beta.microsoft.com" but the domain here is clearly "microsoft.com"

I needed the same, so I wrote a class that you can copy and paste into your solution. It uses a hard coded string array of tld's. http://pastebin.com/raw.php?i=VY3DCNhp

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.com/path/page.htm"));

outputs microsoft.com

and

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.co.uk/path/page.htm"));

outputs microsoft.co.uk



回答3:

google.com is not guaranteed to be the same as www.google.com (well, for this example it technically is, but may be otherwise).

maybe what you need is actually remove the "top level" domain and the "www" subodmain? Then just split('.') and take the part before the last part!



回答4:

Below is some code that will give just the SLD plus gTLD or ccTLD extension (note the exception below). I do not care about DNS.

The theory is as follows:

  • Anything under 3 tokens stays as is e.g. "localhost", "domain.com", otherwise: The last token must be a gTLD or ccTLD extension.
  • The penultimate token is considered part of the extension if it's length < 3 OR if included in a list of exceptions.
  • Finally the token before that one is considered the SLD. Anything before that is considered a subdomain or a host qualifier, e.g. Www.

As for the code, short & sweet:

private static string GetDomainName(string url)
{
    string domain = new Uri(url).DnsSafeHost.ToLower();
    var tokens = domain.Split('.');
    if (tokens.Length > 2)
    {
        //Add only second level exceptions to the < 3 rule here
        string[] exceptions = { "info", "firm", "name", "com", "biz", "gen", "ltd", "web", "net", "pro", "org" }; 
        var validTokens = 2 + ((tokens[tokens.Length - 2].Length < 3 || exceptions.Contains(tokens[tokens.Length - 2])) ? 1 : 0);
        domain = string.Join(".", tokens, tokens.Length - validTokens, validTokens);
    }
    return domain;
}

The obvious exception is that this will not deal with 2-letter domain names. So if you're lucky enough to own ab.com you'll need to adapt the code slightly. For us mere mortals this code will cover just about every gTLD and ccTLD, minus a few very exotic ones.



回答5:

I tried pretty much every approach but all of them fell short of the desired result. So here is my approach adjusted from servermanfail.

The tld file is available on https://publicsuffix.org/list/ I have taken the file from https://publicsuffix.org/list/effective_tld_names.dat parse it and search for the tld's. If new tld's are published just download the latest file.

have fun.

using System;
using System.Collections.Generic;
using System.IO;

namespace SearchWebsite
{
internal class NetDomain
{
    static public string GetDomainFromUrl(string Url)
    {
        return GetDomainFromUrl(new Uri(Url));
    }

    static public string GetDomainFromUrl(string Url, bool Strict)
    {
        return GetDomainFromUrl(new Uri(Url), Strict);
    }

    static public string GetDomainFromUrl(Uri Url)
    {
        return GetDomainFromUrl(Url, false);
    }

    static public string GetDomainFromUrl(Uri Url, bool Strict)
    {
        initializeTLD();
        if (Url == null) return null;
        var dotBits = Url.Host.Split('.');
        if (dotBits.Length == 1) return Url.Host; //eg http://localhost/blah.php = "localhost"
        if (dotBits.Length == 2) return Url.Host; //eg http://blah.co/blah.php = "localhost"
        string bestMatch = "";
        foreach (var tld in DOMAINS)
        {
            if (Url.Host.EndsWith(tld, StringComparison.InvariantCultureIgnoreCase))
            {
                if (tld.Length > bestMatch.Length) bestMatch = tld;
            }
        }
        if (string.IsNullOrEmpty(bestMatch))
            return Url.Host; //eg http://domain.com/blah = "domain.com"

        //add the domain name onto tld
        string[] bestBits = bestMatch.Split('.');
        string[] inputBits = Url.Host.Split('.');
        int getLastBits = bestBits.Length + 1;
        bestMatch = "";
        for (int c = inputBits.Length - getLastBits; c < inputBits.Length; c++)
        {
            if (bestMatch.Length > 0) bestMatch += ".";
            bestMatch += inputBits[c];
        }
        return bestMatch;
    }


    static private void initializeTLD()
    {
        if (DOMAINS.Count > 0) return;

        string line;
        StreamReader reader = File.OpenText("effective_tld_names.dat");
        while ((line = reader.ReadLine()) != null)
        {
            if (!string.IsNullOrEmpty(line) && !line.StartsWith("//"))
            {
                DOMAINS.Add(line);
            }
        }
        reader.Close();
    }


    // This file was taken from https://publicsuffix.org/list/effective_tld_names.dat

    static public List<String> DOMAINS = new List<String>();
}

}



回答6:

I think you are displaying a misunderstanding of what constitutes a "domain name" - there is no such thing as a "pure domain name" in common usage - this is something you will need to define if you want consistent results.
Do you just want to strip off the "www" part? And then have another version which strips off the top level domain (eg. strip off the ".com" or the ".co.uk" etc parts?) Another answer mentions split(".") - you will need to use something like this if you want to exclude specific parts of the hostname manually, there's nothing within the .NET framework to meet your requirements exactly - you'll need to implement these things yourself.



回答7:

See Rick Strahl's blog recently as Reference for some c# and .net centric:

Making Sense of ASP.NET paths



回答8:

Yes, ive posted the solution here: http://pastebin.com/raw.php?i=raxNQkCF

If you want to remove the extension just add

if (url.indexof(".")>-1) {url = url.substring(0, url.indexof("."))}



回答9:

Uri's Host always returns domain (www.google.com), including a label (www) and a top-level domain (com). But often you would want to extract the middle bit. Simply I do

Uri uri;
bool result = Uri.TryCreate(returnUri, UriKind.Absolute, out uri);
if (result == false)
    return false;

//if you are sure it's not "localhost"
string domainParts = uri.Host.Split('.');
string topLevel = domainParts[domainParts.Length - 1]
string hostBody = domainParts[domainParts.Length - 2]
string label = domainParts[domainParts.Length - 3]

But you do need to check domainParts.length, as often the given uri is like "google.com".



回答10:

Use Nager.PublicSuffix

install-package Nager.PublicSuffix

var domainParser = new DomainParser(new WebTldRuleProvider());

var domainName = domainParser.Get("sub.test.co.uk");
//domainName.Domain = "test";
//domainName.Hostname = "sub.test.co.uk";
//domainName.RegistrableDomain = "test.co.uk";
//domainName.SubDomain = "sub";
//domainName.TLD = "co.uk";


回答11:

I came up with the below solution (using Linq) :

    public string MainDomainFromHost(string host)
    {
        string[] parts = host.Split('.');
        if (parts.Length <= 2)
            return host; // host is probably already a main domain
        if (parts[parts.Length - 1].All(char.IsNumber))
            return host; // host is probably an IPV4 address
        if (parts[parts.Length - 1].Length == 2 && parts[parts.Length - 2].Length == 2)
            return string.Join(".", parts.TakeLast(3)); // this is the case for co.uk, co.in, etc...
        return string.Join(".", parts.TakeLast(2)); // all others, take only the last 2
    }


回答12:

Because of the numerous variations in domain names and the non-existence of any real authoritative list of what constitutes a "pure domain name" as you describe, I've just resorted to using Uri.Host in the past. To avoid cases where www.google.com and google.com show up as two different domains, I've often resorted to stripping the www. from all domains that contain it, since it's almost guaranteed (ALMOST) to point to the same site. It's really the only simple way to do it without risking losing some data.



回答13:

string domain = new Uri(HttpContext.Current.Request.Url.AbsoluteUri).GetLeftPart(UriPartial.Authority);


标签: c# uri