Over the years I have slowly developed a regular expression that validates MOST email addresses correctly, assuming they don't use an IP address as the server part.
I use it in several PHP programs, and it works most of the time. However, from time to time I get contacted by someone that is having trouble with a site that uses it, and I end up having to make some adjustment (most recently I realized that I wasn't allowing 4-character TLDs).
What is the best regular expression you have or have seen for validating emails?
I've seen several solutions that use functions that use several shorter expressions, but I'd rather have one long complex expression in a simple function instead of several short expression in a more complex function.
Quick answer
Use the following regex for input validation:
Addresses matched by this regex:
The second constraint is a restriction on RFC 5321/5322.
Elaborate answer
Using a regular expression that recognizes email addresses could be useful in various situations: for example to scan for email addresses in a document, to validate user input, or as an integrity constraint on a data repository.
It should however be noted that if you want to find out if the address actually refers to an existing mailbox, there's no substitute for sending a message to the address. If you only want to check if an address is grammatically correct then you could use a regular expression, but note that
""@[]
is a grammatically correct email address that certainly doesn't refer to an existing mailbox.The syntax of email addresses has been defined in various RFCs, most notably RFC 822 and RFC 5322. RFC 822 should be seen as the "original" standard and RFC 5322 as the latest standard. The syntax defined in RFC 822 is the most lenient and subsequent standards have restricted the syntax further and further, where newer systems or services should recognize obsolete syntax, but never produce it.
In this answer I’ll take “email address” to mean
addr-spec
as defined in the RFCs (i.e.jdoe@example.org
, but not"John Doe"<jdoe@example.org>
, norsome-group:jdoe@example.org,mrx@exampel.org;
).There's one problem with translating the RFC syntaxes into regexes: the syntaxes are not regular! This is because they allow for optional comments in email addresses that can be infinitely nested, while infinite nesting can't be described by a regular expression. To scan for or validate addresses containing comments you need a parser or more powerful expressions. (Note that languages like Perl have constructs to describe context free grammars in a regex-like way.) In this answer I'll disregard comments and only consider proper regular expressions.
The RFCs define syntaxes for email messages, not for email addresses as such. Addresses may appear in various header fields and this is where they are primarily defined. When they appear in header fields addresses may contain (between lexical tokens) whitespace, comments and even linebreaks. Semantically this has no significance however. By removing this whitespace, etc. from an address you get a semantically equivalent canonical representation. Thus, the canonical representation of
first. last (comment) @ [3.5.7.9]
isfirst.last@[3.5.7.9]
.Different syntaxes should be used for different purposes. If you want to scan for email addresses in a (possibly very old) document it may be a good idea to use the syntax as defined in RFC 822. On the other hand, if you want to validate user input you may want to use the syntax as defined in RFC 5322, probably only accepting canonical representations. You should decide which syntax applies to your specific case.
I use POSIX "extended" regular expressions in this answer, assuming an ASCII compatible character set.
RFC 822
I arrived at the following regular expression. I invite everyone to try and break it. If you find any false positives or false negatives, please post them in a comment and I'll try to fix the expression as soon as possible.
I believe it's fully complient with RFC 822 including the errata. It only recognizes email addresses in their canonical form. For a regex that recognizes (folding) whitespace see the derivation below.
The derivation shows how I arrived at the expression. I list all the relevant grammar rules from the RFC exactly as they appear, followed by the corresponding regex. Where an erratum has been published I give a separate expression for the corrected grammar rule (marked "erratum") and use the updated version as a subexpression in subsequent regular expressions.
As stated in paragraph 3.1.4. of RFC 822 optional linear white space may be inserted between lexical tokens. Where applicable I've expanded the expressions to accommodate this rule and marked the result with "opt-lwsp".
RFC 5322
I arrived at the following regular expression. I invite everyone to try and break it. If you find any false positives or false negatives, please post them in a comment and I'll try to fix the expression as soon as possible.
I believe it's fully complient with RFC 5322 including the errata. It only recognizes email addresses in their canonical form. For a regex that recognizes (folding) whitespace see the derivation below.
The derivation shows how I arrived at the expression. I list all the relevant grammar rules from the RFC exactly as they appear, followed by the corresponding regex. For rules that include semantically irrelevant (folding) whitespace, I give a separate regex marked "(normalized)" that doesn't accept this whitespace.
I ignored all the "obs-" rules from the RFC. This means that the regexes only match email addresses that are strictly RFC 5322 compliant. If you have to match "old" addresses (as the looser grammar including the "obs-" rules does), you can use one of the RFC 822 regexes from the previous paragraph.
Note that some sources (notably w3c) claim that RFC 5322 is too strict on the local part (i.e. the part before the @-sign). This is because "..", "a..b" and "a." are not valid dot-atoms, while they may be used as mailbox names. The RFC, however, does allow for local parts like these, except that they have to be quoted. So instead of
a..b@example.net
you should write"a..b"@example.net
, which is semantically equivalent.Further restrictions
SMTP (as defined in RFC 5321) further restricts the set of valid email addresses (or actually: mailbox names). It seems reasonable to impose this stricter grammar, so that the matched email address can actually be used to send an email.
RFC 5321 basically leaves alone the "local" part (i.e. the part before the @-sign), but is stricter on the domain part (i.e. the part after the @-sign). It allows only host names in place of dot-atoms and address literals in place of domain literals.
The grammar presented in RFC 5321 is too lenient when it comes to both host names and IP addresses. I took the liberty of "correcting" the rules in question, using this draft and RFC 1034 as guidelines. Here's the resulting regex.
Note that depending on the use case you may not want to allow for a "General-address-literal" in your regex. Also note that I used a negative lookahead
(?!IPv6:)
in the final regex to prevent the "General-address-literal" part to match malformed IPv6 addresses. Some regex processors don't support negative lookahead. Remove the substring|(?!IPv6:)[0-9A-Za-z-]*[0-9A-Za-z]:[!-Z^-~]+
from the regex if you want to take the whole "General-address-literal" part out.Here's the derivation:
User input validation
A common use case is user input validation, for example on an html form. In that case it's usually reasonable to preclude address-literals and to require at least two labels in the hostname. Taking the improved RFC 5321 regex from the previous section as a basis, the resulting expression would be:
([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@[0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?(\.[0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?)+
I do not recommend restricting the local part further, e.g. by precluding quoted strings, since we don't know what kind of mailbox names some hosts allow (like
"a..b"@example.net
or even"a b"@example.net
).I also do not recommend explicitly validating against a list of literal top-level domains or even imposing length-constraints (remember how ".museum" invalidated
[a-z]{2,4}
), but if you must:([-!#-'*+/-9=?A-Z^-~]+(\.[-!#-'*+/-9=?A-Z^-~]+)*|"([]!#-[^-~ \t]|(\\[\t -~]))+")@([0-9A-Za-z]([0-9A-Za-z-]{0,61}[0-9A-Za-z])?\.)*(net|org|com|info|
etc...)
Make sure to keep your regex up-to-date if you decide to go down the path of explicit top-level domain validation.
Further considerations
When only accepting host names in the domain part (after the @-sign), the regexes above accept only labels with at most 63 characters, as they should. However, they don't enforce the fact that the entire host name must be at most 253 characters long (including the dots). Although this constraint is strictly speaking still regular, it's not feasible to make a regex that incorporates this rule.
Another consideration, especially when using the regexes for input validation, is feedback to the user. If a user enters an incorrect address, it would be nice to give a little more feedback than a simple "syntactically incorrect address". With "vanilla" regexes this is not possible.
These two considerations could be addressed by parsing the address. The extra length constraint on host names could in some cases also be addressed by using an extra regex that checks it, and matching the address against both expressions.
None of the regexes in this answer are optimized for performance. If performance is an issue, you should see if (and how) the regex of your choice can be optimized.
There are plenty examples of this out on the net (and I think even one that fully validates the RFC - but it's tens/hundreds of lines long if memory serves). People tend to get carried away validating this sort of thing. Why not just check it has an @ and at least one . and meets some simple minimum length. It's trivial to enter a fake email and still match any valid regex anyway. I would guess that false positives are better than false negatives.
RFC 5322 standard:
Allows dot-atom local-part, quoted-string local-part, obsolete (mixed dot-atom and quoted-string) local-part, domain name domain, (IPv4, IPv6, and IPv4-mapped IPv6 address) domain literal domain, and (nested) CFWS.
RFC 5321 standard:
Allows dot-atom local-part, quoted-string local-part, domain name domain, and (IPv4, IPv6, and IPv4-mapped IPv6 address) domain literal domain.
Basic:
Allows dot-atom local-part and domain name domain (requiring at least two domain name labels with the TLD limited to 2-6 alphabetic characters).
Not to mention that non-Latin (Chinese, Arabic, Greek, Hebrew, Cyrillic and so on) domain names are to be allowed in the near future. Everyone has to change the email regex used, because those characters are surely not to be covered by
[a-z]/i
nor\w
. They will all fail.After all, the best way to validate the email address is still to actually send an email to the address in question to validate the address. If the email address is part of user authentication (register/login/etc), then you can perfectly combine it with the user activation system. I.e. send an email with a link with an unique activation key to the specified email address and only allow login when the user has activated the newly created account using the link in the email.
If the purpose of the regex is just to quickly inform the user in the UI that the specified email address doesn't look like in the right format, best is still to check if it matches basically the following regex:
Simple as that. Why on earth would you care about the characters used in the name and domain? It's the client's responsibility to enter a valid email address, not the server's. Even when the client enters a syntactically valid email address like
aa@bb.cc
, this does not guarantee that it's a legit email address. No one regex can cover that.It all depends on how accurate you want to be. For my purposes, where I'm just trying to keep out things like
bob @ aol.com
(spaces in emails) orsteve
(no domain at all) ormary@aolcom
(no period before .com), I useSure, it will match things that aren't valid email addresses, but it's a matter of playing the 90/10 rule.
According to official standard RFC 2822 valid email regex is
if you want to use it in Java its really very easy