Regex woes

Karch · Mar 18, 2008

I have these two methods that are chewing up a ton of CPU time in my
application. Does anyone have any suggestions on how to optimize them or
rewrite them without Regex? The most time-consuming operation by a long-shot
is the regex.Replace. Basically the only purpose of it is to remove spaces
between opening/closing tags and the element name. Surely there is a better
way.

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script", RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>", RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();

return result.Trim();
}

private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*", RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Jesse Houwing · Mar 19, 2008

Hello Karch" news.microsoft.com,

I have these two methods that are chewing up a ton of CPU time in my
application. Does anyone have any suggestions on how to optimize them
or rewrite them without Regex? The most time-consuming operation by a
long-shot is the regex.Replace. Basically the only purpose of it is to
remove spaces between opening/closing tags and the element name.
Surely there is a better way.

Let's try to find out what exactly is going on here first:

body and html are removed
head tag and contents are removed
//.* is removed except when there's a : in front of it (looks like an atempt
to remove singleline comments)
\ is repaced with \\
" is replaced with \"
script tag is cleaned
then script is rewritten as scr"+"ipt
Environment.Newline and tab and \n are replaced with nothing
The whole string is trimmed twice

I might have missed something, but the code in question is pretty hard to
read.

While doing this a lot of Regex Objects are created (and then left to die)
and a lot of new strings are allocated.

It shouldn't be too hard to clean this up a bit... Especially remove some
of the duplicate parts.

first trick to use when using a regex over and over again is to compile it
once and store it outside of your method. The easiest way is to save them
to a static readonly Regex instance like this:

private static readonly Regex scriptRewrite = new Regex("<script>", RegexOptions.Compiled
| RegexOptions.IgnoreCase);

That way it is created, parsed and compiled only once. You can then use the
optimised form.

Secondly, a lot of regexes do the same thing, for a slightly different tag.
You could parobably combine this to a single regex to improve performance

for example <\s*/?hhtml and \s*/?body and so forth can be combined. Also
try using the verbatim string notation to improve readability like this:

private static readonly Regex removeOuterTags= new Regex(@"<[\s/]*(?:body|html)[^>]*>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

<script> and </script> are replaced seperately, while this can be done in
one pass without trouble:

private static readonly Regex splitScriptTags= new Regex(@"<[\s]*(/)?\s*script([^>]*)>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
splitScriptTags.Replace(input, "<$1scr\"+\"ipt$2>");

using 3 chained replace calls to remove tabs, newlines and Environment.Newlines
(\r\n usually) can be rewritten as one Regex which might be faster... Or
you could use a stringbuilder and loop yourself. test different scenario's
here...

the regex would be very simple:
private static readonly Regex removeUnwantedWhitespace = new Regex(@"[\t\r\n]+",
RegexOptions.Compiled);
removeUnwantedWhitespace.Replace(input, "");

possibly combine this with the //.*, but I'm not sure it will be faster.
My guess is that two passes will be faster than one combined pass:

private static readonly Regex removeUnwantedWhitespaceAndComments = new Regex(@"([\t\r\n]+|(?<!

//.*$)",
RegexOptions.Compiled);
removeUnwantedWhitespaceAndComments.Replace(input, "");

At least fix the regex to remove comments, as it currently removes one character
too much (the character in front of the // which is probably usually a space,
but could just as well be a ;.
private static readonly Regex removeUnwantedComments = new Regex(@"(?<!

.*$",
RegexOptions.Compiled);
removeUnwantedComments.Replace(input, "");

Result is trimmed twice in FixupJavascript, so that's an easy save (Wouldn't
be that costly probably, but still).

You might also want to investigate if instead of removing the head (+content),
html and body tag, it isn't less expensive to try to retain the contents
of the body tag instead. That would remove a large bit of content to filter
as well, so do this early in the cleanup process:

private static readonly Regex retainBody = new Regex(@"<\s*body[^>]*>(.*)<[\s/]*body",
RegexOptions.Compiled | RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");

You're already using string.Replace when it's optimal to do so. So that's
good.

Some alternative solutions you could try...

Appart from all these possible optimizations, my primary question to you
is... why? What got you into the position to need to do this. Is there an
easier way. Can you generate the HTML differently in the first place? Can
you optimise it once and store the result instead of the original? Can you
prevent having to Fixup the HTML at all? Please elaborate more on the original
issue which resulted in the creation of this code... Also try to see if there's
an existing framework function (possibly in the Ajax framework) which already
does most of this stuff. It looks like it could be a standard format as Javascript
string function... It might help to attach some examples of your input and
the required output. That way we can test our assumptions.

Kind regards,

Jesse Houwing

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();

return result.Trim();
}
private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Jesse Houwing · Mar 19, 2008

Hello Jesse,

Some alternative solutions you could try...

I was going to type something here, but somehow I forgot

- Try the HTML Agility Pack (codeplex.com)
- Try an SGML parser
- Use a Xslt instead (but my guess is your html isn't standard xml enough
yet)
- Use a stringbuilder instead of different strings and remove the characters
by parsing the buffer contents

Karch · Mar 19, 2008

Thanks - I implemented most of your suggestions and was able to trim about
60% of the execution time. However, I have one question about your examples:

You say: "You might also want to investigate if instead of removing the
head (+content), html and body tag, it isn't less expensive to try to retain
the contents of the body tag instead. That would remove a large bit of
content to filter as well, so do this early in the cleanup process:" This
doesn't seem to work quite as expected. I thought it would remove everything
except the contents of the body, but it actually leaves the outer tag - at
least in one case, where I do not have a head tag, the html tag is left. Did
I misunderstand you? In my case the body could be quite large, so I expect
you are correct that it would be better to use this method. How to also
remove the outer tags in this one regex?

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");

To answer your question, the reason for this mess of code in the first place
is to take customer entered code and convert it. I really have no control
over the generation. The original code has been in place for a while, but
like most people would, I thought there had to be a better way. Those
frameworks sounds interesting, but, in this case, I would rather not
introduce any other libraries just to fix these two functions - this is in a
high perf situation and I would rather just optimize what is there.

Jesse Houwing said:
Hello Karch" news.microsoft.com,

I have these two methods that are chewing up a ton of CPU time in my
application. Does anyone have any suggestions on how to optimize them
or rewrite them without Regex? The most time-consuming operation by a
long-shot is the regex.Replace. Basically the only purpose of it is to
remove spaces between opening/closing tags and the element name.
Surely there is a better way.

Click to expand...

Let's try to find out what exactly is going on here first:

body and html are removed
head tag and contents are removed
//.* is removed except when there's a : in front of it (looks like an
atempt to remove singleline comments)
\ is repaced with \\
" is replaced with \"
script tag is cleaned
then script is rewritten as scr"+"ipt
Environment.Newline and tab and \n are replaced with nothing
The whole string is trimmed twice

I might have missed something, but the code in question is pretty hard to
read.

While doing this a lot of Regex Objects are created (and then left to die)
and a lot of new strings are allocated.

It shouldn't be too hard to clean this up a bit... Especially remove some
of the duplicate parts.

first trick to use when using a regex over and over again is to compile it
once and store it outside of your method. The easiest way is to save them
to a static readonly Regex instance like this:

private static readonly Regex scriptRewrite = new Regex("<script>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

That way it is created, parsed and compiled only once. You can then use
the optimised form.

Secondly, a lot of regexes do the same thing, for a slightly different
tag. You could parobably combine this to a single regex to improve
performance

for example <\s*/?hhtml and \s*/?body and so forth can be combined. Also
try using the verbatim string notation to improve readability like this:

private static readonly Regex removeOuterTags= new
Regex(@"<[\s/]*(?:body|html)[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);

<script> and </script> are replaced seperately, while this can be done in
one pass without trouble:

private static readonly Regex splitScriptTags= new
Regex(@"<[\s]*(/)?\s*script([^>]*)>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
splitScriptTags.Replace(input, "<$1scr\"+\"ipt$2>");

using 3 chained replace calls to remove tabs, newlines and
Environment.Newlines (\r\n usually) can be rewritten as one Regex which
might be faster... Or you could use a stringbuilder and loop yourself.
test different scenario's here...

the regex would be very simple:
private static readonly Regex removeUnwantedWhitespace = new
Regex(@"[\t\r\n]+", RegexOptions.Compiled);
removeUnwantedWhitespace.Replace(input, "");

possibly combine this with the //.*, but I'm not sure it will be faster.
My guess is that two passes will be faster than one combined pass:

private static readonly Regex removeUnwantedWhitespaceAndComments = new
Regex(@"([\t\r\n]+|(?<!//.*$)", RegexOptions.Compiled);
removeUnwantedWhitespaceAndComments.Replace(input, "");

At least fix the regex to remove comments, as it currently removes one
character too much (the character in front of the // which is probably
usually a space, but could just as well be a ;.
private static readonly Regex removeUnwantedComments = new
Regex(@"(?<!.*$", RegexOptions.Compiled);
removeUnwantedComments.Replace(input, "");

Result is trimmed twice in FixupJavascript, so that's an easy save
(Wouldn't be that costly probably, but still).

You might also want to investigate if instead of removing the head
(+content), html and body tag, it isn't less expensive to try to retain
the contents of the body tag instead. That would remove a large bit of
content to filter as well, so do this early in the cleanup process:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");

You're already using string.Replace when it's optimal to do so. So that's
good.

Some alternative solutions you could try...
Appart from all these possible optimizations, my primary question to you
is... why? What got you into the position to need to do this. Is there an
easier way. Can you generate the HTML differently in the first place? Can
you optimise it once and store the result instead of the original? Can you
prevent having to Fixup the HTML at all? Please elaborate more on the
original issue which resulted in the creation of this code... Also try to
see if there's an existing framework function (possibly in the Ajax
framework) which already does most of this stuff. It looks like it could
be a standard format as Javascript string function... It might help to
attach some examples of your input and the required output. That way we
can test our assumptions.

Kind regards,

Jesse Houwing

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();

return result.Trim();
}
private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Click to expand...

Karch · Mar 19, 2008

Thanks, I'll look into these, but I suspect better, smarter usage of the
Regex is going to be best for me. For (1) and (2), I hate to introduce
another library in this case; for (3), you are correct - not standard
enough; for (4) the strings can be to large; I think optimal regex will be
better and cleaner from a code standpoint if I can squeeze the performance
out of them.

Jesse Houwing · Mar 19, 2008

Hello Karch,

Ahh I seemed to have forgotten the SingleLine RegexOption here:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.SingleLine);
retainBody.Replace(input, "$1");

That does it for me. That just grabs everythign from between the body tags.
It does require you to have a body tag though...

Jesse

Thanks - I implemented most of your suggestions and was able to trim
about 60% of the execution time. However, I have one question about
your examples:

You say: "You might also want to investigate if instead of removing
the head (+content), html and body tag, it isn't less expensive to try
to retain the contents of the body tag instead. That would remove a
large bit of content to filter as well, so do this early in the
cleanup process:" This doesn't seem to work quite as expected. I
thought it would remove everything except the contents of the body,
but it actually leaves the outer tag - at least in one case, where I
do not have a head tag, the html tag is left. Did I misunderstand you?
In my case the body could be quite large, so I expect you are correct
that it would be better to use this method. How to also remove the
outer tags in this one regex?

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
To answer your question, the reason for this mess of code in the first
place is to take customer entered code and convert it. I really have
no control over the generation. The original code has been in place
for a while, but like most people would, I thought there had to be a
better way. Those frameworks sounds interesting, but, in this case, I
would rather not introduce any other libraries just to fix these two
functions - this is in a high perf situation and I would rather just
optimize what is there.

Hello Karch" news.microsoft.com,

I have these two methods that are chewing up a ton of CPU time in my
application. Does anyone have any suggestions on how to optimize
them or rewrite them without Regex? The most time-consuming
operation by a long-shot is the regex.Replace. Basically the only
purpose of it is to remove spaces between opening/closing tags and
the element name. Surely there is a better way.

Click to expand...

Let's try to find out what exactly is going on here first:

body and html are removed
head tag and contents are removed
//.* is removed except when there's a : in front of it (looks like an
atempt to remove singleline comments)
\ is repaced with \\
" is replaced with \"
script tag is cleaned
then script is rewritten as scr"+"ipt
Environment.Newline and tab and \n are replaced with nothing
The whole string is trimmed twice
I might have missed something, but the code in question is pretty
hard to read.

While doing this a lot of Regex Objects are created (and then left to
die) and a lot of new strings are allocated.

It shouldn't be too hard to clean this up a bit... Especially remove
some of the duplicate parts.

first trick to use when using a regex over and over again is to
compile it once and store it outside of your method. The easiest way
is to save them to a static readonly Regex instance like this:

private static readonly Regex scriptRewrite = new Regex("<script>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

That way it is created, parsed and compiled only once. You can then
use the optimised form.

Secondly, a lot of regexes do the same thing, for a slightly
different tag. You could parobably combine this to a single regex to
improve performance

for example <\s*/?hhtml and \s*/?body and so forth can be combined.
Also try using the verbatim string notation to improve readability
like this:

private static readonly Regex removeOuterTags= new
Regex(@"<[\s/]*(?:body|html)[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);

<script> and </script> are replaced seperately, while this can be
done in one pass without trouble:

private static readonly Regex splitScriptTags= new
Regex(@"<[\s]*(/)?\s*script([^>]*)>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
splitScriptTags.Replace(input, "<$1scr\"+\"ipt$2>");
using 3 chained replace calls to remove tabs, newlines and
Environment.Newlines (\r\n usually) can be rewritten as one Regex
which might be faster... Or you could use a stringbuilder and loop
yourself. test different scenario's here...

the regex would be very simple:
private static readonly Regex removeUnwantedWhitespace = new
Regex(@"[\t\r\n]+", RegexOptions.Compiled);
removeUnwantedWhitespace.Replace(input, "");
possibly combine this with the //.*, but I'm not sure it will be
faster. My guess is that two passes will be faster than one combined
pass:

private static readonly Regex removeUnwantedWhitespaceAndComments =
new Regex(@"([\t\r\n]+|(?<!//.*$)", RegexOptions.Compiled);
removeUnwantedWhitespaceAndComments.Replace(input, "");

At least fix the regex to remove comments, as it currently removes
one
character too much (the character in front of the // which is
probably
usually a space, but could just as well be a ;.
private static readonly Regex removeUnwantedComments = new
Regex(@"(?<!.*$", RegexOptions.Compiled);
removeUnwantedComments.Replace(input, "");
Result is trimmed twice in FixupJavascript, so that's an easy save
(Wouldn't be that costly probably, but still).

You might also want to investigate if instead of removing the head
(+content), html and body tag, it isn't less expensive to try to
retain the contents of the body tag instead. That would remove a
large bit of content to filter as well, so do this early in the
cleanup process:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
You're already using string.Replace when it's optimal to do so. So
that's good.

Some alternative solutions you could try...
Appart from all these possible optimizations, my primary question to
you
is... why? What got you into the position to need to do this. Is
there an
easier way. Can you generate the HTML differently in the first place?
Can
you optimise it once and store the result instead of the original?
Can you
prevent having to Fixup the HTML at all? Please elaborate more on the
original issue which resulted in the creation of this code... Also
try to
see if there's an existing framework function (possibly in the Ajax
framework) which already does most of this stuff. It looks like it
could
be a standard format as Javascript string function... It might help
to
attach some examples of your input and the required output. That way
we
can test our assumptions.
Kind regards,

Jesse Houwing

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();
return result.Trim();
}
private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Click to expand...

Click to expand...

Karch · Apr 1, 2008

If you run this:

string result = "<html><head></head><body>The body</body></html>";
result = retainBody.Replace(result, "$1");

With the following Regex:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.Singleline);

You get this as the return:

<html><head></head>The body</html>

I want this:

The body

Jesse Houwing said:
Hello Karch,

Ahh I seemed to have forgotten the SingleLine RegexOption here:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.SingleLine);
retainBody.Replace(input, "$1");

That does it for me. That just grabs everythign from between the body
tags. It does require you to have a body tag though...

Jesse

Thanks - I implemented most of your suggestions and was able to trim
about 60% of the execution time. However, I have one question about
your examples:

You say: "You might also want to investigate if instead of removing
the head (+content), html and body tag, it isn't less expensive to try
to retain the contents of the body tag instead. That would remove a
large bit of content to filter as well, so do this early in the
cleanup process:" This doesn't seem to work quite as expected. I
thought it would remove everything except the contents of the body,
but it actually leaves the outer tag - at least in one case, where I
do not have a head tag, the html tag is left. Did I misunderstand you?
In my case the body could be quite large, so I expect you are correct
that it would be better to use this method. How to also remove the
outer tags in this one regex?

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
To answer your question, the reason for this mess of code in the first
place is to take customer entered code and convert it. I really have
no control over the generation. The original code has been in place
for a while, but like most people would, I thought there had to be a
better way. Those frameworks sounds interesting, but, in this case, I
would rather not introduce any other libraries just to fix these two
functions - this is in a high perf situation and I would rather just
optimize what is there.

Hello Karch" news.microsoft.com,

I have these two methods that are chewing up a ton of CPU time in my
application. Does anyone have any suggestions on how to optimize
them or rewrite them without Regex? The most time-consuming
operation by a long-shot is the regex.Replace. Basically the only
purpose of it is to remove spaces between opening/closing tags and
the element name. Surely there is a better way.

Let's try to find out what exactly is going on here first:

body and html are removed
head tag and contents are removed
//.* is removed except when there's a : in front of it (looks like an
atempt to remove singleline comments)
\ is repaced with \\
" is replaced with \"
script tag is cleaned
then script is rewritten as scr"+"ipt
Environment.Newline and tab and \n are replaced with nothing
The whole string is trimmed twice
I might have missed something, but the code in question is pretty
hard to read.

While doing this a lot of Regex Objects are created (and then left to
die) and a lot of new strings are allocated.

It shouldn't be too hard to clean this up a bit... Especially remove
some of the duplicate parts.

first trick to use when using a regex over and over again is to
compile it once and store it outside of your method. The easiest way
is to save them to a static readonly Regex instance like this:

private static readonly Regex scriptRewrite = new Regex("<script>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

That way it is created, parsed and compiled only once. You can then
use the optimised form.

Secondly, a lot of regexes do the same thing, for a slightly
different tag. You could parobably combine this to a single regex to
improve performance

for example <\s*/?hhtml and \s*/?body and so forth can be combined.
Also try using the verbatim string notation to improve readability
like this:

private static readonly Regex removeOuterTags= new
Regex(@"<[\s/]*(?:body|html)[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);

<script> and </script> are replaced seperately, while this can be
done in one pass without trouble:

private static readonly Regex splitScriptTags= new
Regex(@"<[\s]*(/)?\s*script([^>]*)>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
splitScriptTags.Replace(input, "<$1scr\"+\"ipt$2>");
using 3 chained replace calls to remove tabs, newlines and
Environment.Newlines (\r\n usually) can be rewritten as one Regex
which might be faster... Or you could use a stringbuilder and loop
yourself. test different scenario's here...

the regex would be very simple:
private static readonly Regex removeUnwantedWhitespace = new
Regex(@"[\t\r\n]+", RegexOptions.Compiled);
removeUnwantedWhitespace.Replace(input, "");
possibly combine this with the //.*, but I'm not sure it will be
faster. My guess is that two passes will be faster than one combined
pass:

private static readonly Regex removeUnwantedWhitespaceAndComments =
new Regex(@"([\t\r\n]+|(?<!//.*$)", RegexOptions.Compiled);
removeUnwantedWhitespaceAndComments.Replace(input, "");

At least fix the regex to remove comments, as it currently removes
one
character too much (the character in front of the // which is
probably
usually a space, but could just as well be a ;.
private static readonly Regex removeUnwantedComments = new
Regex(@"(?<!.*$", RegexOptions.Compiled);
removeUnwantedComments.Replace(input, "");
Result is trimmed twice in FixupJavascript, so that's an easy save
(Wouldn't be that costly probably, but still).

You might also want to investigate if instead of removing the head
(+content), html and body tag, it isn't less expensive to try to
retain the contents of the body tag instead. That would remove a
large bit of content to filter as well, so do this early in the
cleanup process:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
You're already using string.Replace when it's optimal to do so. So
that's good.

Some alternative solutions you could try...
Appart from all these possible optimizations, my primary question to
you
is... why? What got you into the position to need to do this. Is
there an
easier way. Can you generate the HTML differently in the first place?
Can
you optimise it once and store the result instead of the original?
Can you
prevent having to Fixup the HTML at all? Please elaborate more on the
original issue which resulted in the creation of this code... Also
try to
see if there's an existing framework function (possibly in the Ajax
framework) which already does most of this stuff. It looks like it
could
be a standard format as Javascript string function... It might help
to
attach some examples of your input and the required output. That way
we
can test our assumptions.
Kind regards,

Jesse Houwing

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();
return result.Trim();
}
private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Click to expand...

Click to expand...

Jesse Houwing · Apr 4, 2008

Hello Karch" news.microsoft.com,

If you run this:

string result = "<html><head></head><body>The body</body></html>";
result = retainBody.Replace(result, "$1");

With the following Regex:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.Singleline);

You get this as the return:

<html><head></head>The body</html>

I want this:

The body

I must have been sleeping... Sorry...

Match m =retainBody.Match(result);
if (m.Success)
{
result = m.Groups[1].Value;
}

will probably work better :S

Jesse

Hello Karch,

Ahh I seemed to have forgotten the SingleLine RegexOption here:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.SingleLine);
retainBody.Replace(input, "$1");

That does it for me. That just grabs everythign from between the body
tags. It does require you to have a body tag though...

Jesse

Thanks - I implemented most of your suggestions and was able to trim
about 60% of the execution time. However, I have one question about
your examples:

You say: "You might also want to investigate if instead of removing
the head (+content), html and body tag, it isn't less expensive to
try to retain the contents of the body tag instead. That would
remove a large bit of content to filter as well, so do this early in
the cleanup process:" This doesn't seem to work quite as expected. I
thought it would remove everything except the contents of the body,
but it actually leaves the outer tag - at least in one case, where I
do not have a head tag, the html tag is left. Did I misunderstand
you? In my case the body could be quite large, so I expect you are
correct that it would be better to use this method. How to also
remove the outer tags in this one regex?

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
To answer your question, the reason for this mess of code in the
first
place is to take customer entered code and convert it. I really have
no control over the generation. The original code has been in place
for a while, but like most people would, I thought there had to be a
better way. Those frameworks sounds interesting, but, in this case,
I
would rather not introduce any other libraries just to fix these two
functions - this is in a high perf situation and I would rather just
optimize what is there.

Hello Karch" news.microsoft.com,

I have these two methods that are chewing up a ton of CPU time in
my application. Does anyone have any suggestions on how to
optimize them or rewrite them without Regex? The most
time-consuming operation by a long-shot is the regex.Replace.
Basically the only purpose of it is to remove spaces between
opening/closing tags and the element name. Surely there is a
better way.

Let's try to find out what exactly is going on here first:

body and html are removed
head tag and contents are removed
//.* is removed except when there's a : in front of it (looks like
an
atempt to remove singleline comments)
\ is repaced with \\
" is replaced with \"
script tag is cleaned
then script is rewritten as scr"+"ipt
Environment.Newline and tab and \n are replaced with nothing
The whole string is trimmed twice
I might have missed something, but the code in question is pretty
hard to read.
While doing this a lot of Regex Objects are created (and then left
to die) and a lot of new strings are allocated.

It shouldn't be too hard to clean this up a bit... Especially
remove some of the duplicate parts.

first trick to use when using a regex over and over again is to
compile it once and store it outside of your method. The easiest
way is to save them to a static readonly Regex instance like this:

private static readonly Regex scriptRewrite = new Regex("<script>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

That way it is created, parsed and compiled only once. You can then
use the optimised form.

Secondly, a lot of regexes do the same thing, for a slightly
different tag. You could parobably combine this to a single regex
to improve performance

for example <\s*/?hhtml and \s*/?body and so forth can be combined.
Also try using the verbatim string notation to improve readability
like this:

private static readonly Regex removeOuterTags= new
Regex(@"<[\s/]*(?:body|html)[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);

<script> and </script> are replaced seperately, while this can be
done in one pass without trouble:

private static readonly Regex splitScriptTags= new
Regex(@"<[\s]*(/)?\s*script([^>]*)>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
splitScriptTags.Replace(input, "<$1scr\"+\"ipt$2>");
using 3 chained replace calls to remove tabs, newlines and
Environment.Newlines (\r\n usually) can be rewritten as one Regex
which might be faster... Or you could use a stringbuilder and loop
yourself. test different scenario's here...
the regex would be very simple:
private static readonly Regex removeUnwantedWhitespace = new
Regex(@"[\t\r\n]+", RegexOptions.Compiled);
removeUnwantedWhitespace.Replace(input, "");
possibly combine this with the //.*, but I'm not sure it will be
faster. My guess is that two passes will be faster than one
combined
pass:
private static readonly Regex removeUnwantedWhitespaceAndComments =
new Regex(@"([\t\r\n]+|(?<!//.*$)", RegexOptions.Compiled);
removeUnwantedWhitespaceAndComments.Replace(input, "");

At least fix the regex to remove comments, as it currently removes
one
character too much (the character in front of the // which is
probably
usually a space, but could just as well be a ;.
private static readonly Regex removeUnwantedComments = new
Regex(@"(?<!.*$", RegexOptions.Compiled);
removeUnwantedComments.Replace(input, "");
Result is trimmed twice in FixupJavascript, so that's an easy save
(Wouldn't be that costly probably, but still).
You might also want to investigate if instead of removing the head
(+content), html and body tag, it isn't less expensive to try to
retain the contents of the body tag instead. That would remove a
large bit of content to filter as well, so do this early in the
cleanup process:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
You're already using string.Replace when it's optimal to do so. So
that's good.
Some alternative solutions you could try...
Appart from all these possible optimizations, my primary question
to
you
is... why? What got you into the position to need to do this. Is
there an
easier way. Can you generate the HTML differently in the first
place?
Can
you optimise it once and store the result instead of the original?
Can you
prevent having to Fixup the HTML at all? Please elaborate more on
the
original issue which resulted in the creation of this code... Also
try to
see if there's an existing framework function (possibly in the Ajax
framework) which already does most of this stuff. It looks like it
could
be a standard format as Javascript string function... It might help
to
attach some examples of your input and the required output. That
way
we
can test our assumptions.
Kind regards,
Jesse Houwing

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();
return result.Trim();
}
private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Click to expand...

Click to expand...

Karch · Apr 5, 2008

Yeah, that's what I ended up going with. Thanks!

Jesse Houwing said:
Hello Karch" news.microsoft.com,

If you run this:

string result = "<html><head></head><body>The body</body></html>";
result = retainBody.Replace(result, "$1");

With the following Regex:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.Singleline);

You get this as the return:

<html><head></head>The body</html>

I want this:

The body

Click to expand...

I must have been sleeping... Sorry...

Match m =retainBody.Match(result);
if (m.Success)
{
result = m.Groups[1].Value;
}

will probably work better :S

Jesse

Hello Karch,

Ahh I seemed to have forgotten the SingleLine RegexOption here:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase | RegexOptions.SingleLine);
retainBody.Replace(input, "$1");

That does it for me. That just grabs everythign from between the body
tags. It does require you to have a body tag though...

Jesse

Thanks - I implemented most of your suggestions and was able to trim
about 60% of the execution time. However, I have one question about
your examples:

You say: "You might also want to investigate if instead of removing
the head (+content), html and body tag, it isn't less expensive to
try to retain the contents of the body tag instead. That would
remove a large bit of content to filter as well, so do this early in
the cleanup process:" This doesn't seem to work quite as expected. I
thought it would remove everything except the contents of the body,
but it actually leaves the outer tag - at least in one case, where I
do not have a head tag, the html tag is left. Did I misunderstand
you? In my case the body could be quite large, so I expect you are
correct that it would be better to use this method. How to also
remove the outer tags in this one regex?

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
To answer your question, the reason for this mess of code in the
first
place is to take customer entered code and convert it. I really have
no control over the generation. The original code has been in place
for a while, but like most people would, I thought there had to be a
better way. Those frameworks sounds interesting, but, in this case,
I
would rather not introduce any other libraries just to fix these two
functions - this is in a high perf situation and I would rather just
optimize what is there.

Hello Karch" news.microsoft.com,

I have these two methods that are chewing up a ton of CPU time in
my application. Does anyone have any suggestions on how to
optimize them or rewrite them without Regex? The most
time-consuming operation by a long-shot is the regex.Replace.
Basically the only purpose of it is to remove spaces between
opening/closing tags and the element name. Surely there is a
better way.

Let's try to find out what exactly is going on here first:

body and html are removed
head tag and contents are removed
//.* is removed except when there's a : in front of it (looks like
an
atempt to remove singleline comments)
\ is repaced with \\
" is replaced with \"
script tag is cleaned
then script is rewritten as scr"+"ipt
Environment.Newline and tab and \n are replaced with nothing
The whole string is trimmed twice
I might have missed something, but the code in question is pretty
hard to read.
While doing this a lot of Regex Objects are created (and then left
to die) and a lot of new strings are allocated.

It shouldn't be too hard to clean this up a bit... Especially
remove some of the duplicate parts.

first trick to use when using a regex over and over again is to
compile it once and store it outside of your method. The easiest
way is to save them to a static readonly Regex instance like this:

private static readonly Regex scriptRewrite = new Regex("<script>",
RegexOptions.Compiled | RegexOptions.IgnoreCase);

That way it is created, parsed and compiled only once. You can then
use the optimised form.

Secondly, a lot of regexes do the same thing, for a slightly
different tag. You could parobably combine this to a single regex
to improve performance

for example <\s*/?hhtml and \s*/?body and so forth can be combined.
Also try using the verbatim string notation to improve readability
like this:

private static readonly Regex removeOuterTags= new
Regex(@"<[\s/]*(?:body|html)[^>]*>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);

<script> and </script> are replaced seperately, while this can be
done in one pass without trouble:

private static readonly Regex splitScriptTags= new
Regex(@"<[\s]*(/)?\s*script([^>]*)>", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
splitScriptTags.Replace(input, "<$1scr\"+\"ipt$2>");
using 3 chained replace calls to remove tabs, newlines and
Environment.Newlines (\r\n usually) can be rewritten as one Regex
which might be faster... Or you could use a stringbuilder and loop
yourself. test different scenario's here...
the regex would be very simple:
private static readonly Regex removeUnwantedWhitespace = new
Regex(@"[\t\r\n]+", RegexOptions.Compiled);
removeUnwantedWhitespace.Replace(input, "");
possibly combine this with the //.*, but I'm not sure it will be
faster. My guess is that two passes will be faster than one
combined
pass:
private static readonly Regex removeUnwantedWhitespaceAndComments =
new Regex(@"([\t\r\n]+|(?<!//.*$)", RegexOptions.Compiled);
removeUnwantedWhitespaceAndComments.Replace(input, "");

At least fix the regex to remove comments, as it currently removes
one
character too much (the character in front of the // which is
probably
usually a space, but could just as well be a ;.
private static readonly Regex removeUnwantedComments = new
Regex(@"(?<!.*$", RegexOptions.Compiled);
removeUnwantedComments.Replace(input, "");
Result is trimmed twice in FixupJavascript, so that's an easy save
(Wouldn't be that costly probably, but still).
You might also want to investigate if instead of removing the head
(+content), html and body tag, it isn't less expensive to try to
retain the contents of the body tag instead. That would remove a
large bit of content to filter as well, so do this early in the
cleanup process:

private static readonly Regex retainBody = new
Regex(@"<\s*body[^>]*>(.*)<[\s/]*body", RegexOptions.Compiled |
RegexOptions.IgnoreCase);
retainBody.Replace(input, "$1");
You're already using string.Replace when it's optimal to do so. So
that's good.
Some alternative solutions you could try...
Appart from all these possible optimizations, my primary question
to
you
is... why? What got you into the position to need to do this. Is
there an
easier way. Can you generate the HTML differently in the first
place?
Can
you optimise it once and store the result instead of the original?
Can you
prevent having to Fixup the HTML at all? Please elaborate more on
the
original issue which resulted in the creation of this code... Also
try to
see if there's an existing framework function (possibly in the Ajax
framework) which already does most of this stuff. It looks like it
could
be a standard format as Javascript string function... It might help
to
attach some examples of your input and the required output. That
way
we
can test our assumptions.
Kind regards,
Jesse Houwing

private string FixupJavascript(string htmlCode)
{
string result = FixupHTML(htmlCode);
result = result.Replace("\\", "\\\\");
result = result.Replace("\"", "\\\"");
Regex regex = new Regex("<\\s*script",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "<scr\" + \"ipt");
regex = new Regex("<\\s*\\/script>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, "</scr\" + \"ipt>");
result = result.Replace(Environment.NewLine,
string.Empty).Replace("\t", string.Empty).Replace("\n",
string.Empty).Trim();
return result.Trim();
}
private string FixupHTML(string htmlCode)
{
string result = htmlCode;
Regex regex = new Regex("<\\s*\\/*html[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*head.*>.*<\\s*\\/head[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("<\\s*\\/*body[^>]*>",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
regex = new Regex("[^:]\\/\\/.*",
RegexOptions.IgnoreCase);
result = regex.Replace(result, string.Empty);
return result;
}

Click to expand...

Click to expand...

Regex to retain only the HTML body	1	Apr 1, 2008
Regex performance issue	6	Oct 10, 2008
Regex help needed	1	Apr 4, 2010
Regex - Matching URLS	2	Dec 10, 2007
more regex question how to avoid capturing leading empty lines	2	Aug 9, 2007
Regex Issues - Finding Qualified URLS	2	Dec 11, 2007
Regex: Capturing HTML	1	Oct 11, 2005
Regex	1	Nov 22, 2005

Regex woes

Karch

Jesse Houwing

Jesse Houwing

Karch

Karch

Jesse Houwing

Karch

Jesse Houwing

Karch

Ask a Question

Similar Threads