There are a number of email regexp questions popping up here, and I'm honestly baffled why people are using these insanely obtuse matching expressions rather than a very simple parser that splits the email up into the name and domain tokens, and then validates those against the valid characters allowed for name (there's no further check that can be done on this portion) and the valid characters for the domain (and I suppose you could add checking for all the world's TLDs, and then another level of second level domains for countries with such (ie, com.uk)).
The real problem is that the tlds and slds keep changing (contrary to popular belief), so you have to keep updating the regexp if you plan on doing all this high level checking whenever the root name servers send down a change.
Why not have a module that simply validates domains, which pulls from a database, or flat file, and optionally checks DNS for matching records?
I'm being serious here, why is everyone so keen on inventing the perfect regexp for this? It doesn't seem to be a suitable solution to the problem...
Convince me that it's not only possible to do in regexp (and satisfy everyone) but that it's a better solution than a custom parser/validator.
-Adam
Regexs that catch most (but not all) common error are relatively easy to setup and deploy. Takes longer to write a custom parser.
I don't believe correct email validation can be done with a single regular expression (now there's a challenge!). One of the issues is that comments can be nested to an arbitrary depth in both the local part and the domain.
If you want to validate an address against RFCs 5322 and 5321 (the current standards) then you'll need a procedural function to do so.
Fortunately, this is a commodity problem. Everybody wants the same result: RFC compliance. There's no need for anybody to write this code ever again once it's been solved by an open source function.
Check out some of the alternatives here: http://www.dominicsayers.com/isemail/
If you know of another function that I can add to the head-to-head, let me know.
They do it because they see "I want to test whether this text matches the spec" and immediately think "I know, I'll use a regex!" without fully understanding the complexity of the spec or the limitations of regexes. Regexes are a wonderful, powerful tool for handling a wide variety of text-matching tasks, but they are not the perfect tool for every such task and it seems that many people who use them lose sight of that fact.
This is not true. For example, "ben..doom@gmail.com" contains only valid characters in the name section, but is not valid.
In languages that do not have libraries for email validation, I generally use regex becasue
I'm sure many built-in libraries do use your approach, and if you want to cover all the possibilities, it does get ridiculous. However, so does your parser. The formal spec for email addresses is absurdly complex. So, we use a regex that gets close enough.
People do it because in most languages it is way easier to write regexp than to write and use a parser in your code (or so it seems, at least).
If you decide to eschew regexes, you will have to either write parsers by hand, or you resort to external tools (like yacc) for lexer/parser generation. This is way more complex than single-line regex match.
One need to have a library that makes it easy to write parsers directly in the language X (where 'X' is C, C++, C#, Java) to be able to build custom parsers with the same ease as regular expression matchers.
Such libraries originated in the functional land (Haskell and ML), but nowadays "parser combinators libraries" exist for Java, C++, C#, Scala and other mainstream languages.
Using regular expressions for this is not a good idea, as has been demonstrated at length in those other posts.
I suppose people keep doing it because they don't know any better or don't care.
Will a parser be any better? Maybe, maybe not.
I maintain that sending a verification e-mail is the best way to validate it. If you want to check anything from JavaScript, then check that it has an '@' sign in there and something before and after it. If you go any stricter than that, you risc running up against some syntax you didn't know about and your validator will become overly restrictive.
Also, be careful with that TLD validation scheme of yours, you might find that you are assuming too much about what is allowed in a TLD.