Email Id validation according to RFC5322 and https

2019-08-25 18:18发布

问题:

Validating E-mail Ids according to RFC5322 and following

https://en.wikipedia.org/wiki/Email_address

Below is the sample code using java and a regular expression to validate E-mail Ids.

public void checkValid() {
    List<String> emails = new ArrayList();
    //Valid Email Ids
    emails.add("simple@example.com");
    emails.add("very.common@example.com");                   
    emails.add("disposable.style.email.with+symbol@example.com");
    emails.add("other.email-with-hyphen@example.com");
    emails.add("fully-qualified-domain@example.com");
    emails.add("user.name+tag+sorting@example.com");
    emails.add("fully-qualified-domain@example.com");
    emails.add("x@example.com");
    emails.add("carlosd'intino@arnet.com.ar");
    emails.add("example-indeed@strange-example.com");
    emails.add("admin@mailserver1");
    emails.add("example@s.example");
    emails.add("\" \"@example.org");
    emails.add("\"john..doe\"@example.org");

    //Invalid emails Ids
    emails.add("Abc.example.com");
    emails.add("A@b@c@example.com");
    emails.add("a\"b(c)d,e:f;g<h>i[j\\k]l@example.com");
    emails.add("just\"not\"right@example.com");
    emails.add("this is\"not\\allowed@example.com");
    emails.add("this\\ still\"not\\allowed@example.com");
                    emails.add("1234567890123456789012345678901234567890123456789012345678901234+x@example.com");
    emails.add("john..doe@example.com");
    emails.add("john.doe@example..com");

    String regex = "^[a-zA-Z0-9_!#$%&'*+/=? \\\"`{|}~^.-]+@[a-zA-Z0-9.-]+$";

    Pattern pattern = Pattern.compile(regex);
    int i=0;
    for(String email : emails){
        Matcher matcher = pattern.matcher(email);
        System.out.println(++i +"."+email +" : "+ matcher.matches());
    }
}

Actual Output:

   1.simple@example.com : true
   2.very.common@example.com : true
   3.disposable.style.email.with+symbol@example.com : true
   4.other.email-with-hyphen@example.com : true
   5.fully-qualified-domain@example.com : true
   6.user.name+tag+sorting@example.com : true
   7.fully-qualified-domain@example.com : true
   8.x@example.com : true
   9.carlosd'intino@arnet.com.ar : true
   10.example-indeed@strange-example.com : true
   11.admin@mailserver1 : true
   12.example@s.example : true
   13." "@example.org : true
   14."john..doe"@example.org : true
   15.Abc.example.com : false
   16.A@b@c@example.com : false
   17.a"b(c)d,e:f;g<h>i[j\k]l@example.com : false
   18.just"not"right@example.com : true
   19.this is"not\allowed@example.com : false
   20.this\ still"not\allowed@example.com : false
   21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com    : true
   22.john..doe@example.com : true
   23.john.doe@example..com : true

Expected Ouput:

1.simple@example.com : true
2.very.common@example.com : true
3.disposable.style.email.with+symbol@example.com : true
4.other.email-with-hyphen@example.com : true
5.fully-qualified-domain@example.com : true
6.user.name+tag+sorting@example.com : true
7.fully-qualified-domain@example.com : true
8.x@example.com : true
9.carlosd'intino@arnet.com.ar : true
10.example-indeed@strange-example.com : true
11.admin@mailserver1 : true
12.example@s.example : true
13." "@example.org : true
14."john..doe"@example.org : true
15.Abc.example.com : false
16.A@b@c@example.com : false
17.a"b(c)d,e:f;g<h>i[j\k]l@example.com : false
18.just"not"right@example.com : false
19.this is"not\allowed@example.com : false
20.this\ still"not\allowed@example.com : false
21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com : false
22.john..doe@example.com : false
23.john.doe@example..com : false

How can I change my regular expression so that it will invalidate the below patterns of email ids.

1234567890123456789012345678901234567890123456789012345678901234+x@example.com
john..doe@example.com
john.doe@example..com 
just"not"right@example.com

Below are the criteria for regular expression:

Local-part

The local-part of the email address may use any of these ASCII characters:

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9;
  3. special characters !#$%&'*+-/=?^_`{|}~
  4. dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. John..Doe@example.com is not allowed but "John..Doe"@example.com is allowed);
  5. space and "(),:;<>@[\] characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash); comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)john.smith@example.com are both equivalent to john.smith@example.com.

Domain

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9, provided that top-level domain names are not all-numeric;
  3. hyphen -, provided that it is not the first or last character. Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and john.smith@example.com(comment) are equivalent to john.smith@example.com.

回答1:

You could RFC5322 like this
( reference regex modified )

"(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|((?:[0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"  

https://regex101.com/r/ObS3QZ/1

 # (?im)^(?=.{1,64}@)(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:(?=.{1,63}\.)[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\w]*))$

 # Note - remove all comments '(comments)' before runninig this regex
 # Find  \([^)]*\)  replace with nothing

 (?im)                                     # Case insensitive
 ^                                         # BOS

                                           # Local part
 (?= .{1,64} @ )                           # 64 max chars
 (?:
      (                                         # (1 start), Quoted
           " [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
           @
      )                                         # (1 end)
   |                                          # or, 
      (                                         # (2 start), Non-quoted
           (?:
                [0-9a-z] 
                (?:
                     \.
                     (?! \. )
                  |                                          # or, 
                     [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
                )*
           )?
           [0-9a-z] 
           @
      )                                         # (2 end)
 )
                                           # Domain part
 (?= .{1,255} $ )                          # 255 max chars
 (?:
      (                                         # (3 start), IP
           \[
           (?: \d{1,3} \. ){3}
           \d{1,3} \]
      )                                         # (3 end)
   |                                          # or,   
      (                                         # (4 start), Others
           (?:                                       # Labels (63 max chars each)
                (?= .{1,63} \. )
                [0-9a-z] [-\w]* [0-9a-z]* 
                \.
           )+
           [a-z0-9] [\-a-z0-9]{0,22} [a-z0-9] 
      )                                         # (4 end)
   |                                          # or,
      (                                         # (5 start), Localdomain
           (?= .{1,63} $ )
           [0-9a-z] [-\w]* 
      )                                         # (5 end)
 )
 $                                         # EOS

How make sudhansu_@gmail.com this as valid email ID – Mihir Feb 7 at 9:34

I think the spec wants the local part to be either encased in quotes
or, to be encased by [0-9a-z].

But, to get around the later and make sudhansu_@gmail.com valid, just
replace group 2 with this:

      (                             # (2 start), Non-quoted
           [0-9a-z] 
           (?:
                \.
                (?! \. )
             |                              # or, 
                [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
           )*
           @

      )                             # (2 end)

New regex

"(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|([0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"

New demo

https://regex101.com/r/ObS3QZ/5



回答2:

It's not the question you asked, but why re-invent the wheel?

Apache commons has a class that covers this already.

org.apache.commons.validator.routines.EmailValidator.getInstance().isValid(email)

This way you aren't responsible for keeping up to date with changing email format standards.



回答3:

A regular expression is the most difficult and error-prone way to validate emails addresses. If you are using an implementation of javax.mail to send the emails, then the simplest way to determine if it will work is by using the provided parser, because whether the email is compliant or not, if the library cannot use it, then it doesn't matter.

public static boolean validateEmail(String address) {
    try {
        // if this fails, the mail library can't send emails to this address
        InternetAddress ia = new InternetAddress(address, true);
        return ia.isGroup() && ia.getAddress().charAt(0) != '@';
    }
    catch (Throwable t) {
        return false;
    }
}

Invoking it with false allows emails without a @domain part when strict parsing. And since the checkAddress function invoked internally is private and we can't just call checkAddress(addr,false,true) since we don't want routing information (a feature practically designed for fraud through server bouncing), we have to check the first letter of the validated address.

Now what you may notice here is that this validation method is actually compliant to RFC 2822, rather than 5822. The reason for this is because unless you are implementing your own SMTP sender library, then you're using one that depends on this one, and if you have an address that is 5822-valid but 2822-invalid, then your 5822-validation is rendered useless. But if you are implementing your own 5822 SMTP library, then you should learn from the existing ones and write a parser function, rather than a regular expression.