Regex help with large strings

M

Mark

Hi,

I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf in the file so when I read the file in it's massive and I don't
knwo if the XML is at the start/middle or end of the string. My regex
is as follows:

Regex rXMLPart = new Regex(
@"(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

and a sample of the string is:
-----------------------
Message-ID: <[email protected]>
From: "Test" <[email protected]>
To: <>
Subject: This is a test subject
Date: Thu, 2 Sep 2004 16:58:12 +0100
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_NextPart_000_0005_01C4910E.083D9600"
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2800.1409
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1409

This is a multi-part message in MIME format.

------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: multipart/alternative;
boundary="----=_NextPart_001_0006_01C4910E.083D9600"


------=_NextPart_001_0006_01C4910E.083D9600
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

This is some body text.


-mark.
------=_NextPart_001_0006_01C4910E.083D9600
Content-Type: text/html;
charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-8859-1">
<META content=3D"MSHTML 6.00.2800.1458" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial size=3D2>This is some body text.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>***</FONT></DIV></BODY></HTML>

------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: text/xml;
name="DO_NOT_DELETE_EMAIL_ATTACHMENT.XML"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
filename="DO_NOT_DELETE_EMAIL_ATTACHMENT.XML"

<?xml version="1.0" encoding="UTF-8"?>
<distributionList>
</distributionList>


------=_NextPart_000_0005_01C4910E.083D9600
Content-Type: application/pdf;
name="Reader.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="Reader.pdf"

JVBERi0xLjUNJeLjz9MNCjkxOTUgMCBvYmo8PC9IWzQzMzk2IDM5MzJdL0xpbmVhcml6ZWQgMS9F
IDEyMjMzNi9MIDE1NTUzMDcvTiAxNzkvTyA5MTk5L1QgMTM3MTM2Mz4+DWVuZG9iag0gICAgICAg
IA14cmVmDTkxOTUgMzYNMDAwMDAwMDAxNiAwMDAwMCBuDQowMDAwMDQ3Njc5IDAwMDAwIG4NCjAw
MDAwNDMzOTYgMDAwMDAgbg0KMDAwMDA0NzkzNSAwMDAwMCBuDQowMDAwMDQ3OTk5IDAwMDAwIG4N
CjAwMDAwNDgyNzYgMDAwMDAgbg0KMDAwMDA0ODMyNyAwMDAwMCBuDQowMDAwMDQ4NjMwIDAwMDAw
IG4NCjAwMDAwNTM0ODAgMDAwMDAgbg0KMDAwMDA1MzUxNiAwMDAwMCBuDQowMDAwMDUzOTUyIDAw
.......
------------------------

I've cut the string short but that is the jist of it. If I were to run
against this attached string it all works fine but when really large
(with the rest of the pdf in) the match hangs:

Match mXMLPersonalisation = rXMLPart.Match(data);

Could anyone suggest a better way that I should do this. I need to get
the first part and the last part and join thus removing the XML part.
I also need to work on the XML to creat the new messages.

i.e.

string sStartPartOfEmailMessage =
mXMLPersonalisation.Groups["Start"].ToString();
string sXMLPartOfMessage =
mXMLPersonalisation.Groups["Middle"].ToString();;
string sEndPartOfEmailMessage =
mXMLPersonalisation.Groups["End"].ToString();;

SendXMLEmail(sStartPartOfEmailMessage, sXMLPartOfMessage,
sEndPartOfEmailMessage);

Any help would be much appreciated.

-mark.
 
N

Niki Estner

Mark said:
Hi,

I've seen some postings on this but not exactly relating to this
posting. I'm reading in a large mail message as a string. In the
string is an xml attachment that I need to parse out and remove from
the message once processed. I have to do this as a string and not
using any CDO libraries. My problem is that there's normally a large
pdf in the file so when I read the file in it's massive and I don't
knwo if the XML is at the start/middle or end of the string. My regex
is as follows:

Regex rXMLPart = new Regex(
@"(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)",
RegexOptions.IgnoreCase |
RegexOptions.Singleline |
RegexOptions.IgnorePatternWhitespace);

I haven't done any performance tests with that regex, but I'm quite sure it
will take years if it can *not* find a match on a long string: Here are a
few suggestions:

- Add start/end anchors like these:
@"^(?<Start>.*)(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)(?<End>.*)$"
So the .* expression in the beginning doesn't have to try every starting
point in the string.
- Couldn't you use Regex.Replace on a pattern like this:
@"(?<Middle>Content-Type:[^.*?]text\/xml.*?finaldistributeinformation.*?\<\/distributionList\>)"
The way regex's work, this should be a lot faster. If you need complex
processing on the string that can't be done with capturing paranthesis, you
could use a MatchEvealuator.
- Finally:
@"Content-Type:[^.*?]text\/xml"
Are you sure about this character class? I'd have expected something like
"\s*" instead of "[^.*?]".

Niki
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top