how we best parse incoming email messages?

  • Thread starter Thread starter Li-fan Chen
  • Start date Start date
L

Li-fan Chen

Hi,

We find ourselves in the unenviable position of creating an email
reader, may I ask how we best parse incoming messages? Ideally we would
point the parser at a email stored in a POP3--grab the email body's
bytestreams, and get back an array of AttachmentFile collections
(filename, size, mime/type), as well as a HTMLBody and a TextBody*.

Both HTMLBody and TextBody would be filled if it's a multipart/alternate.

Only HTMLBody would be filled if it's HTML only.

Only TextBody would be filled if it's a text/plain message.

Ofcourse it's never this simple, with text-encoding to deal
with--ideally everything looks like unicode String() classes .NET
programmers are familiar with.

We are wondering if we should be studying webmail solutions in PHP siace
to see how they parse it. Or ask if you know of a truly commercial
COM/.NET component.

We are working with an email deployment partner at the moment and we are
having a lot of trouble parsing emails, any help would be greatly
appreciated.

Also bounces and error messages, what's a good component that will catch
all signatures (for most email servers)--it s a whole different (perhaps
bigger) can of worms.

Any advice would be greatly appreciated, thank you ahead of time!

Best regards,
-- Li-fan Chen
 
Sounds to me like you need a POP3 EMail message Mime parser. I've seen
several open-source implementations of POP3 "applications" out there, all you
need to do is tune up those google or MSN search key phrases.
Peter
 
Li-fan Chen said:
We find ourselves in the unenviable position of creating an email
reader, may I ask how we best parse incoming messages? Ideally we would
point the parser at a email stored in a POP3--grab the email body's
bytestreams, and get back an array of AttachmentFile collections
(filename, size, mime/type), as well as a HTMLBody and a TextBody*.

I hate POP3! it seems bad to alienate your non-pop3 audience.

I write an email archiver+reader. To get the emails, (1) I scanned all
available MAPI messages. This will cover all the messages in Outlook
that are in a local PST, and all the ones that are available through
Exchange, and all the ones that are available offline. I told the user
about the ones that weren't available offline. (2) I scanned through
available Outlook Express messages. This covers all the ones that were
available offline. (3) I allowed power-users to download email folders
in Berkely Mail Format, the unix standard (basically just concatenated
RFC822 or whatever messages).

I read up on the RFC specs about multipart/alternative &c.
Unfortunately MAPI doesn't expose the MIME structure of its emails.
Unfortunately it wraps HTML up inside an RTF. Fortunately I was able
to read the RTF and extract out the original HTML.

Some of my code is available online. It's in C++ only, but it may give
you ideas. (C++ let me write very efficient finite-state-machines to
parse the email structure, letting me chew through several megabytes
of email per second).

http://www.wischik.com/lu/programmer/mapi_utils.html
http://www.wischik.com/lu/programmer/dbx_utils.html
 
Back
Top