I'm writing some code that processes URLs, and I want to make sure i'm not leaving some strange case out...
Are there any valid characters for a host other than: A-Z, 0-9, "-" and "."?
(This includes anything that can be in subdomains, etc. Esentially, anything between :// and the first /)
Thanks!
Please see Restrictions on valid host names:
Hostnames are composed of series of
labels concatenated with dots, as are
all domain names1. For example,
"en.wikipedia.org" is a hostname. Each
label must be between 1 and 63
characters long, and the entire
hostname has a maximum of 255
characters.
RFCs mandate that a hostname's labels
may contain only the ASCII letters 'a'
through 'z' (case-insensitive), the
digits '0' through '9', and the
hyphen. Hostname labels cannot begin
or end with a hyphen. No other
symbols, punctuation characters, or
blank spaces are permitted.
no, that is all that is allowed
here is a reference if you like to read:
http://www.ietf.org/rfc/rfc1034.txt
Depends at what level you do the validation (before or after the URL escaping).
If you try to validate user input, then it can go way beyond ASCII (with big chunks of Unicode).
See http://en.wikipedia.org/wiki/Internationalized_domain_name
If you try to validate after all the escaping and the "punycode" is done, there is no point in validation, since that is already guaranteed to only contain valid characters by the old RFC.
Keep in mind that besides the hostname rules of the Internet, DNS systems are free to create any names that they like. DNS servers could accept and reply to 8-bit binary requests: the DNS wire protocol does not forbid it.
This means that for internal LAN URLs you may have different rules, such as the underscore appearing in a host name.
If you want to write URL-parsing code that perfectly matches the official W3C spec, see the document at www.w3.org/TR/url-1/ . See section 3 (Hosts) for specific information on hosts in URLs.
Valid URL host include ascii letters, numbers, the dot ( . ) and the hyphen ( - ) with max length 255 with dot separated labels with max length 63. The hyphen can delimit alphanumeric sequences e.g. one-two.net but cannot appear at the beginning or end of a dot separated label e.g. -one.two.com, one.two.com- or one-.two.com are invalid host.
See https://tools.ietf.org/html/rfc1123#page-79 and Assumptions part 1 of https://tools.ietf.org/html/rfc952
Also this is a link to an online regex tool to validate URL host which worked as of 5/28/2019 https://www.regextester.com/23
Also when validating a host referencing https://tools.ietf.org/html/rfc1123#page-13 you should check the host syntactically for a dotted-decimal number before looking it up in the DNS.