Reading mailboxes effectively

Gustaf · Nov 3, 2005

I'm converting a program I made in Python once to C#, and while I'm at
it, I want to do some performance improvements. The program takes a set
of mailbox files (in readable formats, like Eudora and Thunderbird) and
extracts every message that is to/from a particular penpal (identified
by one or more email addresses). Messages in a mailbox file are
typically separated with lines such as these:

From ???@??? Fri Feb 06 00:26:08 2004
From - Mon Feb 21 22:33:59 2005

I need to make this really fast and effective. Most mailboxes are about
5 MB, but some are 50 MB or more, and I may need to process up to 200
files together. The overall workings would be something like this in
pseudo-code:

for each mailbox
{
while (more messages)
{
message = get next message with a matching email
}
}

It's this "get next message with a matching email" method I'm not sure
how to construct, but I thought that would be the way to do it, since
creating objects of ALL messages in a mailbox would be too heavy.

I never worked much with the FileStream class, or with buffers and the
Seek() method, but I guess that is what I need here. Ideally, I'd like
to avoid keeping these large files in memory.

Please advice me on what methods and strategy to use, to get this to run
as fast as possible.

Many thanks in advance,

Gustaf

Nicholas Paldino [.NET/C# MVP] · Nov 3, 2005

Gustaf,

I think you are on the right track. You would want to use a FileStream
in this case (maybe even a StreamReader), and process the file in chunks.
It looks like the file is delimited with CRLF (or at least LF). If this is
the case, you could read the file line by line, which is much more effective
than reading it all at once.

You can then parse each line to see if it has the information you
desire, and then properly dispose of the filestream when you make your
determination.

Hope this helps.

Reading mailboxes effectively

Gustaf

Nicholas Paldino [.NET/C# MVP]