How do I extract a list of email/mailbox strings w

2019-09-05 04:13发布

问题:

Given some arbitrary text, I'd like to extract all email addresses and 'mailbox specifiers' (e.g. "Fred Smith" <fred@me.com>). I looked at NSDataDetector, but it does not handle email addresses.

回答1:

The way to approach this is to get a really good algorithm that can detect as many valid addresses as possible, and reject improper ones. Probably the best solution would be a parser constructed using lex and yacc, but reasonable solutions exist using regular expressions.

See this site for both a list of tested regular expressions as well as a more in-depth discussion of the problem and possible solutions.

The regular expressions shown on the above site are formatted for PHP, and have leading and trailing '/' markers, as well as 'flags' indicating case-insensitive etc (see this site for more info), so these need to be stripped off before using the expression in an Objective-C project. Also, any anchors need stripping too, since we want multiple addresses not just one (i.e., '^' and '$').

NSRegularExpression is the class to use here. What I've found helpful is to store the regular expression in a file in my project, so that you don't need to worry about escaping all the backslashes and quotes. The code then reads the expression into a string, and creates the object as follows:

NSString *fullPath = [[NSBundle mainBundle] pathForResource:self.regex ofType:@"txt"];
NSString *pattern = [NSString stringWithContentsOfFile:fullPath encoding:NSUTF8StringEncoding error:NULL];
__autoreleasing NSError *error = nil;
reg = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:&error]; // some patterns may not need NSRegularExpressionCaseInsensitive
assert(reg && !error);

Once you have an initialized expression, you use it to return a list of ranges, each range being an address:

NSArray *ret = [reg matchesInString:str options:0 range:NSMakeRange(0, [str length])];

However, we know that all email addresses contain one '@', so it's probably worthwhile to verify that the string has at least one before processing it. Also, since the text might have line and/or carriage returns in it, you might want to strip those first. It's probably better to strip them completely as some mail program might have split a line at some interior point of the address.

Once you have a list of the address ranges, then for the most part the job is done - if all you wanted was the address. However, often addresses are presented in "mailbox specifier' format, where a name is prepended to the address, and the address wrapped with '<' and '>'. This format is covered in RFC5322, in section 3.4.

To recover the name from a 'mailbox specifier', check to see if the address is wrapped with '<' and '>', and if so then find the string preceding the '<', ignoring white space (until you find the first character). Most names will be wrapped in double quotes (common practice), but actually can be naked alphanumeric strings using a backslash escape to include white space or other special characters (like '"').

This same technique can be used for real time verification - say to enable a submit button when a text string becomes a valid email address. In this case you evaluate the string on each user change, and enable/disable the submit button.

If all this seems like a lot of work to code, you can grab an open source project on github.

EDIT1: for a faster, but less rigorous, method see the comment by CodaFi.

EDIT2: it appears the content of a "mailto: URL can be quite complex, the github project only handles the most simple, and does not un-encode the address. This will be addressed in a future update.

EDIT3: the project was updated to fully handle "mailto:" objects, and returns to, cc, bcc, subject, and body, all URLdecoded.