Regex: Capturing HTML

G

Guest

I am trying to strip the outermost html tag by capturing this tag with regex
and then using the string replace function to replace it with an empty
string. while stepping through the code, RegEx returns the entire input
string although testing this in The Regulator returns just what I want.

What am I doing wrong here?
***********************************************
Regex regX;
RegexOptions options = (RegexOptions.Multiline | RegexOptions.IgnoreCase);
Match rMatch;
string sX,sTag;
string regexOpening = "(?:^)(<[a-zA-Z]*\\s*[a-zA-Z0-9=\'\" ]*>).*$";
string regexClosing = "(<\\s*/[a-zA-Z0-9]*\\s*>)\\s*$\r\n";
//code removed for clarity: return a datareader here...
{
while(r.Read()){
sX = r["HeaderHTML"].ToString().Trim();
regX = new Regex(regexOpening,options);
rMatch= regX.Match(sX);
sTag = rMatch.Value.ToString(); //this returns the entire string!!
sX = sX.Replace(sTag,"");


some sample input:
1. <TH colspan=2 align="left"><IMG src="Images/KCbanner_header.jpg"
width="800" height="90"></TH>
 
K

Kevin Spencer

Here's an easy one for you:

(?i)(?s)(?:<html>)(.*)(?=</html>)

This matches both the <html> tag (non-case-sensitive), and the body, prior
to the </html> tag. The body (after the <html> tag is in Group 1.

string HtmlDocument = "..."; // whatever it is
Regex r = new RegeX(@"(?i)(?s)(?:<html>)(.*)(?=</html>)");
return r.Match(HtmlDocument).Groups[1].Value;


--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Ambiguity has a certain quality to it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top