What are the valid characters that can show up in

2020-07-03 06:56发布

问题:

I'm writing some code that processes URLs, and I want to make sure i'm not leaving some strange case out...

Are there any valid characters for a host other than: A-Z, 0-9, "-" and "."?

(This includes anything that can be in subdomains, etc. Esentially, anything between :// and the first /)

Thanks!

回答1:

Please see Restrictions on valid host names:

Hostnames are composed of series of labels concatenated with dots, as are all domain names1. For example, "en.wikipedia.org" is a hostname. Each label must be between 1 and 63 characters long, and the entire hostname has a maximum of 255 characters.

RFCs mandate that a hostname's labels may contain only the ASCII letters 'a' through 'z' (case-insensitive), the digits '0' through '9', and the hyphen. Hostname labels cannot begin or end with a hyphen. No other symbols, punctuation characters, or blank spaces are permitted.



回答2:

no, that is all that is allowed

here is a reference if you like to read: http://www.ietf.org/rfc/rfc1034.txt



回答3:

Depends at what level you do the validation (before or after the URL escaping). If you try to validate user input, then it can go way beyond ASCII (with big chunks of Unicode).

See http://en.wikipedia.org/wiki/Internationalized_domain_name

If you try to validate after all the escaping and the "punycode" is done, there is no point in validation, since that is already guaranteed to only contain valid characters by the old RFC.



回答4:

Keep in mind that besides the hostname rules of the Internet, DNS systems are free to create any names that they like. DNS servers could accept and reply to 8-bit binary requests: the DNS wire protocol does not forbid it.

This means that for internal LAN URLs you may have different rules, such as the underscore appearing in a host name.



回答5:

If you want to write URL-parsing code that perfectly matches the official W3C spec, see the document at www.w3.org/TR/url-1/ . See section 3 (Hosts) for specific information on hosts in URLs.



回答6:

Valid URL host include ascii letters, numbers, the dot ( . ) and the hyphen ( - ) with max length 255 with dot separated labels with max length 63. The hyphen can delimit alphanumeric sequences e.g. one-two.net but cannot appear at the beginning or end of a dot separated label e.g. -one.two.com, one.two.com- or one-.two.com are invalid host.

See https://tools.ietf.org/html/rfc1123#page-79 and Assumptions part 1 of https://tools.ietf.org/html/rfc952

Also this is a link to an online regex tool to validate URL host which worked as of 5/28/2019 https://www.regextester.com/23

Also when validating a host referencing https://tools.ietf.org/html/rfc1123#page-13 you should check the host syntactically for a dotted-decimal number before looking it up in the DNS.



标签: url host