I need to parse email files with regex in c#, that is parse the email file that contains several emails and parse it into its constituents e.g from, to, bcc etc.
the regex am using for email is
"\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*"
the problem am having is the To, Cc and Bcc sometimes contains more than one email, and occurs in more than one line
To: Me meagain <me@me.com>,
Me1 meagain <me1@me.com>,Me3 meagain <me1@me.com>
Also, which regex will match the message?
http://www.codeproject.com/KB/office/reading_an_outlook_msg.aspx
The above tutorial will give you a decent idea of how to read *.msg files from the file system. If you consider using the System.Net.Mail.MailMessage object you can get all info such as:
senders, recepients, attachements, html email template, text email template, etc...
Thanks,
I created an API called SigParser which does this for you. It breaks reply chain emails into their parts and handles these sorts of problems where lines are splitting. You get a nice array of the email response bodies with who each section of the email was to if that data was in the reply chain header.
Parsing an email message with regular expressions is a terrible idea. You might be able to parse the constituent parts with regular expressions, but finding the constituent parts with regular expressions is going to give you fits.
The normal case, of course, is pretty easy. But then you run across something like a message that has an embedded message within it. That is, the content includes a full email message with From:, To:, Bcc:, etc. And your naive regex parser thinks, "Oh, boy! I found a new message!"
You're better off reading and understanding the Internet Message Format and writing a real parser, or using something already written like OpenPop.NET.
Also, check out the suggestions in Reading Email using Pop3 in C# and https://stackoverflow.com/questions/26606/free-pop3-net-library, among others.
A good example of the difficulty you'll face is that your regular expression for matching email addresses is inadequate. According to section 3.2.4 of RFC2822 (linked above), the following characters are allowed in the "local-part" of the email address:
The domain name can contain any ASCII except whitespace and the "\" character, and has to meet some format requirements. Then there's the "obsolete" stuff that, although deprecated, is still in use. And that's just in parsing email addresses. If you look at the stuff that can be included in the other fields, I think you'll agree that trying to parse it with regular expressions is going to be frustrating at best.