The Ruby gem rmail
has methods to parse a mailbox file on local disk. Unfortunately this gem has broken (in Ruby 2.0.0). It might not get fixed, because folks are migrating to the gem mail
.
Gem mail
has method Mail.read('filename.txt')
, but that parses only the first message in a mailbox.
That gem, and builtin Net::IMAP
, have flooded the net with tutorials on accessing mailboxes through imap.
So, is there still a way to parse a plain old file, without imap?
As the lone rubyist in my group I'd rather not embarrass myself by resorting to http://docs.python.org/2/library/mailbox.html.
Or, worse yet, PHP's imap_open('/var/mail/www-data', ...)
-- if only Net::IMAP.new
accepted filenames like that.
The good news is the Mbox format is really dead simple, though it's simplicity is why it was eventually replaced. Parsing a large mailbox file to extract a single message is not specially efficient.
If you can split apart the mailbox file into separate strings, you can pass these strings to the Mail library for parsing.
An example starting point:
def parse_message(message)
Mail.new(message)
do_other_stuff!
end
message = nil
while (line = STDIN.gets)
if (line.match(/\AFrom /))
parse_message(message) if (message)
message = ''
else
message << line.sub(/^\>From/, 'From')
end
end
The key is that each message starts with "From "
where the space after it is key. Headers will be defined as From:
and any line that starts with ">From"
is to be treated as
actually being "From"
. It's things like this that make this encoding method really inadequate, but if Maildir isn't an option, this is what you've got to do.
You can use tmail parsing email boxes, but it was replaced by mail, but I can't really find a class that substitutes it. So you might want to keep along with tmail.
EDIT: as @tadman pointed out, it should not be working with ruby 1.9. However you can port this class (and put it on github for everyone else use :-) )
The mbox format is about as simple as you can get. It's simply the concatenation of all the messages, separated by a blank line. The first line of each message starts with the five characters "From "; when messages are added to the file, any line which starts "From" has a >
prefixed, so you can reliably use the fact that a line starts with "From" as an indicator that it is the start of a message.
Of course, since this is an old format and it was never standardized, there are a number of variants. One variant uses the Content-Length
header to determine the length of a message, and some implementations of this variant fail to insert the '>'. However, I think this is rare in practice.
A big problem with mbox format is that the file needs to be modified in place by mail agents; consequently, every implementation has some locking procedure. Of course, there is no standardization there, so you need to watch out for other processes modifying the mailbox while you are reading it. In practice, many mail systems solved this problem by using maildir format instead, in which a mailbox is actually a directory and every message is a single file.
Other things you might want to do include MIME decoding, but you should be able to find utilities which do that.