I've got a webhook posting to a form on my web application and I need to parse out the email header addresses.
Here is the source text:
Thread-Topic: test subject
Thread-Index: AcwE4mK6Jj19Hgi0SV6yYKvj2/HJbw==
From: "Lastname, Firstname" <firstname_lastname@domain.com>
To: <testto@domain.com>, testto1@domain.com, testto2@domain.com
Cc: <testcc@domain.com>, test3@domain.com
X-OriginalArrivalTime: 27 Apr 2011 13:52:46.0235 (UTC) FILETIME=[635226B0:01CC04E2]
I'm looking to pull out the following:
<testto@domain.com>, testto1@domain.com, testto2@domain.com
I'm been struggling with Regex all day without any luck.
There's a breakdown of validating emails with regex here, which references a more practical implementation of RFC 2822 with:
It also looks like you only want the email addresses out of the "To" field, and you've got the <> to worry about as well, so something like the following would likely work:
Again, as others having mentioned, you might not want to do this. But if you want regex that will turn that input into
<testto@domain.com>, testto1@domain.com, testto2@domain.com
, that'll do it.Contrary to some of the posts here I have to agree with mmutz, you cannot parse emails with a regex... see this article:
http://tools.ietf.org/html/rfc2822#section-3.4.1
The idea of "locally interpreted" means that only the receiving server is expected to be able to parse it.
If I were going to try and solve this I would find the "To" line contents, break it apart and attempt to parse each segment with System.Net.Mail.MailAddress.
Output from the above program:
You cannot use regular expressions to parse RFC2822 mails, because their grammar contains a recursive production (off the top of my head, it was for comments
(a (nested) comment)
) which makes the grammar non-regular. Regular expressions (as the name suggests) can only parse regular grammars.See also RegEx match open tags except XHTML self-contained tags for more information.
The RFC 2822-compliant email regex is:
Just run it over your text and you'll get the email addresses.
Of course, there's always the option of not using regex where regex isn't the best option. But up to you!
As Blindy suggests, sometimes you can just parse it out the old-fashioned way.
If you prefer to do that, here is a quick approach assuming the email header text is called 'header':
I may be off by a byte on the subtraction but you can very easily test and modify this. Of course you will also have to be certain you always will have a Cc: row in your header or this won't work.