regular expression help needed

  • Thread starter Thread starter henrik
  • Start date Start date
H

henrik

Hi

I have a regex question. I want to find all content of a <td
class="someclass"> tag. This means the expression should include all other
tags included between <td class="someclass"> and </td>.

Please help

Regards

Henrik
 
Hi Henrik

I guess something like /^<[^>]+>(.*)<[^>]+>$/ should do the trick,
although I'm not really sure about that greedy matching... (i.e. will
$1 include *everything* after the first tag, or only up to the closing
tag?)


hth,
Markus
 
It depends on what you really are looking for.

if you document is like this one : <tr><td class="myClass">Some
text</td></tr>
the following regex=new Regex(@"<td\s+class=\"myClass\"\s+>(.*?)</td>");
will give you as result : "Some text"

if you document is like this one : <tr><td class="myClass"> <table><tr><td
class="anotherClass">Other text</td></tr></table> </td></tr>
the following regex=new Regex(@"<td\s+class=\"myClass\"\s+>(.*?)</td>");
will give you as result : " <table><tr><td class="anotherClass">Other text"
wich is not what you are looking for

if you remove the question mark in the regex, it will work for the two
anterior examples but will not work for the following one :
if you document is like this one : <tr><td class="myClass">Some
text</td></tr><tr><td class="anotherClass">Other text</td></tr>
that will give you "Some text</td></tr><tr><td class="anotherClass">Other
text"

My question is : do you have nested <td> tags ? In that case, you have to
use backtracking
Could you give us the most complicated example of file you have ?

Ludovic SOEUR.
 
I did the case where you can have nested tags, using backtracking :

public void showTDContent(string content) {
Regex regex=new Regex(@"<td
class=\""someclass\"">(?<tdcontent>.*?((?=<td)|(?=</td))(((?<Open><td.*?>).*
?((?=<td)|(?=</td)))+((?<Close-Open></td>).*?((?=<td)|(?=</td)))+)*(?(Open)(
?!)))</td>");
MatchCollection matches=regex.Matches(content);
foreach(Match match in matches) {
string sMatch=match.Groups["tdcontent"].Value;
MessageBox.Show(sMatch);
showTDContent(sMatch); //Try to find nested tags
}
}

Here an example :
showTDContent(@"<table><tr><td
class=""someclass""><table><tr><td>someText</td><td
class=""someclass"">otherText</td></tr></table></td></tr><tr><td
class=""someclass"">thirdText</td></tr></table>");

It returns three strings :
1) <table><tr><td>someText</td><td
class=""someclass"">otherText</td></tr></table>
2) otherText
3) thirdText

If you dont have nested tags like the example before, you can keep the SAME
regex expression but you don't need to use recursivity
public void showTDContent(string content) {
Regex regex=new Regex(@"<td
class=\""someclass\"">(?<tdcontent>.*?((?=<td)|(?=</td))(((?<Open><td.*?>).*
?((?=<td)|(?=</td)))+((?<Close-Open></td>).*?((?=<td)|(?=</td)))+)*(?(Open)(
?!)))</td>");
MatchCollection matches=regex.Matches(content);
foreach(Match match in matches) {
MessageBox.Show(match.Groups["tdcontent"].Value);
}
}

To explain the regular expression, have a look to
http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx. It explains
how works balanced matching
<
[^<>]*

(

(

(?<Open><)

[^<>]*

)+

(

(?<Close-Open>>)

[^<>]*

)+

)*

(?(Open)(?!))


My regex is nearly the same:
<td\s+class=\"someclass\">(?<tdcontent>
.*?((?=<td)|(?=</td))
(
(
(?<Open><td.*?>)
.*?((?=<td)|(?=</td))
)+
(
(?<Close-Open></td>)
.*?((?=<td)|(?=</td))
)+
)*
(?(Open)(?!))
)</td>

In fact,
[^<>]* is replaced by .*?((?=<td)|(?=</td)) that means any opening or
closing TD tag
and the other things are exactly the same.


Hope everything helps,

Ludovic SOEUR.
 
Hi you Guys

Thank you for you help!

I solved it with <td[\ \s]class="myClass">(\s\S]*?)</td>. I donot have
nested tables, so this did the trick. Very close to some of your
suggenstions.

:o)

Regards,

Henrik
 
Back
Top