I'm currently having a mess about with catching, parsing and sorting bounced emails. I have the basics set up nicely and it does what I want, which is nice... problem being is that there seems to be no standard to the messages returned in the bounced email.
For example, some servers return the error code as specified by RFC 1893 and I can nine times out of ten pick that up via a simple regex thing. But sometimes servers just respond saying that the email has bounced, with either no reason given or a reason worded entirely different to any standards.
So I guess my question is, has anyone got any solution to this? I don't want to be searching for a billion and one possible strings in the email returned to be honest. Yet it would be nice to not have to resort to 'reason unknown' or something similar.
Has anyone else had any luck with this or ideas?
Cheers
You could set up system lets an operator review messages, select strings, and then categorize from there. Eventually, you could hope to get that 1 in 10 down to 1 in 100 or 1 in 1,000. There are always going to be more and more corner cases here however.
Also not a definitive answer, but in a similar spirit to Kyle's response, you could use a bayes/token based spam filter to "learn" about bounce messages and then automatically route them to whatever you want to handle the bounced mail.
In other words, you have an account where you train spamassassin or spamprobe or whatever that a bunch of different bounce messages (and only bounce messages) are "junk", then let that spam system be a second line of filtering after whatever you've developed.
So, let's say your solution, the first filter, finds 90% of bounced messages. You have your system do whatever it normally does with bounces, then save them to a bounce-messages mailbox, which is periodically scanned by spamassasin/spamprobe to learn those messages as "junk".
You also then have spamassassin or spamprobe or whatever as a second filter (run on anything yours doesn't flag as a bounce) do its own estimation of bounced-ness, and whatever it considers "junk" (because you've trained to to think bounce = junk), you also route to your program etc.
Still requires a little bit of manual review, but in theory it should get better and better over time as you rely on the spam system's learning to account for the edge cases.
We are facing the same problem, but neither did not find any "perfect" solution. I think you
- could either use some service provider (with a proper mail API) - this would let you "outsource" the problem and give you a high detection rate or
- use some simple filter to catch at least (say) 80% of the bounces. In our setup, this was enough to keep our database in a reasonable state.