RegEx : Match and replace term within HTML tags

M

mike c

I have a search app that searches local HTML files for a specified
term. I then display the pages that contain the term.

I would like to highlight the search term within the HTML when it is
viewed.

I have the following regular expression code:

string searchTerm = "(?<STARTTAG>(<[^>]*>.*))(?<MATCHTERM>(" +
lastSearchTerm + "))(?<ENDTAG>(.*<[^>]*>))";

string replaceString = "${STARTTAG}<span
style=\"background-color:#FFFFCC\">${MATCHTERM}</span>${ENDTAG}";

Regex.Replace(htmlBody, searchTerm, replaceString,
RegexOptions.IgnoreCase);

I am trying to match the search term within HTML tags. i.e.

<htmltag>searchterm</htmltag>

and then replace the search term with a span tag to color it, like so:

<htmltag><span
style=\"background-color:#FFFFCC\">searchterm</span></htmltag>

This works, but works inconsitently (and without a discernable pattern
when it fails).

So, does anyone see anything obviously wrong with my Regular
Expressions? I am pretty new to regular expressions, although I
usually know enough to get stuff done.

mike c
 
B

BMermuys

Hi,
inline

mike c said:
I have a search app that searches local HTML files for a specified
term. I then display the pages that contain the term.

I would like to highlight the search term within the HTML when it is
viewed.

I have the following regular expression code:

string searchTerm = "(?<STARTTAG>(<[^>]*>.*))(?<MATCHTERM>(" +
lastSearchTerm + "))(?<ENDTAG>(.*<[^>]*>))";

string replaceString = "${STARTTAG}<span
style=\"background-color:#FFFFCC\">${MATCHTERM}</span>${ENDTAG}";

Regex.Replace(htmlBody, searchTerm, replaceString,
RegexOptions.IgnoreCase);

I am trying to match the search term within HTML tags. i.e.

<htmltag>searchterm</htmltag>

Because of the .* (greedy) in ENDTAG it will match the last tag. Even if
you replace it with .*? (non-greedy) there are still some problems:

<h1> searchterm <b> searchterm </b> </h1>
<h1> searchterm <br> searchterm </h1>
<h1> searchterm searchterm </h1>

In all cases only one searchterm will be replaced.


If you have valid html, then you can say that a word isn't inside a tag if
the first following bracket is a < and not a >. So put together with a
positive lookahead this would become:

string searchTerm = lastSearchTerm + "(?=[^>]*<)";

string replaceString = "<span style=\"background-color:#FFFFCC\">"+
lastSearchTerm + "</span>";

It may still do wrong at title and scripts.

hth,
greetings
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top