More RegEx Questions

Shawn B. · Apr 16, 2007

Greetings,

Lets say I have the following expression:

(<A|ABBR|ADDRESS|APPLET(\s){1,}(.*?)>(.*?)</A|ABBR|ADDRESS|APPLET)

Such that it'll match any HTML tag that opens with the above specified
(simplified for brevity) and the closing tag as well.

Assuming that I had a list of opening possibilities, how can I specify for
the RegEx that it should only match on the same closing occurance that it
first matched such that:

<A ...>...</A> will be matched as per the above example but not
<ADDRESS>...</A> but then, <ADDRESS>...<A>...</A>...</ADDRESS> will be
matched?

Do I have to input and maintain each of the hundreds of HTML tags
seperately?

Thanks,
Shawn

Hans Kesting · Apr 17, 2007

Greetings,

Lets say I have the following expression:

(<A|ABBR|ADDRESS|APPLET(\s){1,}(.*?)>(.*?)</A|ABBR|ADDRESS|APPLET)

Such that it'll match any HTML tag that opens with the above specified
(simplified for brevity) and the closing tag as well.

Assuming that I had a list of opening possibilities, how can I specify
for the RegEx that it should only match on the same closing occurance
that it first matched such that:

<A ...>...</A> will be matched as per the above example but not
<ADDRESS>...</A> but then, <ADDRESS>...<A>...</A>...</ADDRESS> will be
matched?

Do I have to input and maintain each of the hundreds of HTML tags
seperately?

Thanks,
Shawn

<(A|ABBR|ADDRESS|APPLET)((\s)+(.*?))?>(.*?)</\1>

the \1 refers to the first captured group (within the first () pair, that
is the tags).
Note I also made the arguments list optional

Hans Kestin

Kevin Spencer · Apr 17, 2007

OK, I love regular expressions, so I fiddled with this (really difficult)
problem a bit. First, a more succinct and extensible version of Hans'
solution:

(?i)(?s)<(\w+)([^>]*)>(.*?)</\1>

The first 2 items are simply encoding for "non-case-sensitive" and "dot
matches newline." After that, I substituted "\w" for the tag names, since
all HTML tags consist only or word characters. This will capture ANY tag.

However, this does not address the (common) problem of nested tags. Consider
the following HTML, which you can use to test all of these:

<table id="outer">
<tr>
<td>Outer</td>
<td>Outer</td>
<td>Outer</td>
</tr>
<tr>
<td>Outer</td>
<td>
<table id="inner">
<tr>
<td>Inner</td>
<td>Inner</td>
</tr>
<tr>
<td>Inner</td>
<td>Inner</td>
</tr>
</table>
</td>
<td>Outer</td>
</tr>
<tr>
<td>Outer</td>
<td>Outer</td>
<td>Outer</td>
</tr>
</table>
<table><tr><td></td></tr></table>

The regular expression above will capture from the first <table> tag to the
first </table> tag, and those are tags from 2 different tables, the outer
and the inner.

So, here's a solution that ensures that nested elements are captured:

(?i)(?s)<(\w+)[^>]*>(?(?=</\1)</\1>|.+</\1>)

The problem with this one is that it doesn't necessarily match at the end of
the first element; it will match at the last instance of the end of the
first element. In the example, it will capture the entire text as a single
match.

This solution captures only inner-most nested tags:

(?i)(?s)<(\w+)[^>]*>[^>]*</\1>

The problem with this one is that it doesn't capture any tags containing
nested tags.

So, depending upon the requirements, it would problably be necessary to
combine regular expressions with processing code, to do something like the
following:

1. Find innermost nested tags.
2. Remove the matches.
3. Repeat (recursively) until no matches are found.

--
HTH,

Kevin Spencer
Microsoft MVP

Printing Components, Email Components,
FTP Client Classes, Enhanced Data Controls, much more.
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net

More RegEx Questions

Shawn B.

Hans Kesting

Kevin Spencer