I did the case where you can have nested tags, using backtracking :
public void showTDContent(string content) {
Regex regex=new Regex(@"<td
class=\""someclass\"">(?<tdcontent>.*?((?=<td)|(?=</td))(((?<Open><td.*?>).*
?((?=<td)|(?=</td)))+((?<Close-Open></td>).*?((?=<td)|(?=</td)))+)*(?(Open)(
?!)))</td>");
MatchCollection matches=regex.Matches(content);
foreach(Match match in matches) {
string sMatch=match.Groups["tdcontent"].Value;
MessageBox.Show(sMatch);
showTDContent(sMatch); //Try to find nested tags
}
}
Here an example :
showTDContent(@"<table><tr><td
class=""someclass""><table><tr><td>someText</td><td
class=""someclass"">otherText</td></tr></table></td></tr><tr><td
class=""someclass"">thirdText</td></tr></table>");
It returns three strings :
1) <table><tr><td>someText</td><td
class=""someclass"">otherText</td></tr></table>
2) otherText
3) thirdText
If you dont have nested tags like the example before, you can keep the SAME
regex expression but you don't need to use recursivity
public void showTDContent(string content) {
Regex regex=new Regex(@"<td
class=\""someclass\"">(?<tdcontent>.*?((?=<td)|(?=</td))(((?<Open><td.*?>).*
?((?=<td)|(?=</td)))+((?<Close-Open></td>).*?((?=<td)|(?=</td)))+)*(?(Open)(
?!)))</td>");
MatchCollection matches=regex.Matches(content);
foreach(Match match in matches) {
MessageBox.Show(match.Groups["tdcontent"].Value);
}
}
To explain the regular expression, have a look to
http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx. It explains
how works balanced matching
<
[^<>]*
(
(
(?<Open><)
[^<>]*
)+
(
(?<Close-Open>>)
[^<>]*
)+
)*
(?(Open)(?!))
My regex is nearly the same:
<td\s+class=\"someclass\">(?<tdcontent>
.*?((?=<td)|(?=</td))
(
(
(?<Open><td.*?>)
.*?((?=<td)|(?=</td))
)+
(
(?<Close-Open></td>)
.*?((?=<td)|(?=</td))
)+
)*
(?(Open)(?!))
)</td>
In fact,
[^<>]* is replaced by .*?((?=<td)|(?=</td)) that means any opening or
closing TD tag
and the other things are exactly the same.
Hope everything helps,
Ludovic SOEUR.