Regex: Capturing HTML

Guest · Oct 11, 2005

I am trying to strip the outermost html tag by capturing this tag with regex
and then using the string replace function to replace it with an empty
string. while stepping through the code, RegEx returns the entire input
string although testing this in The Regulator returns just what I want.

What am I doing wrong here?
***********************************************
Regex regX;
RegexOptions options = (RegexOptions.Multiline | RegexOptions.IgnoreCase);
Match rMatch;
string sX,sTag;
string regexOpening = "(?:^)(<[a-zA-Z]*\\s*[a-zA-Z0-9=\'\" ]*>).*$";
string regexClosing = "(<\\s*/[a-zA-Z0-9]*\\s*>)\\s*$\r\n";
//code removed for clarity: return a datareader here...
{
while(r.Read()){
sX = r["HeaderHTML"].ToString().Trim();
regX = new Regex(regexOpening,options);
rMatch= regX.Match(sX);
sTag = rMatch.Value.ToString(); //this returns the entire string!!
sX = sX.Replace(sTag,"");

some sample input:
1. <TH colspan=2 align="left"><IMG src="Images/KCbanner_header.jpg"
width="800" height="90"></TH>

Kevin Spencer · Oct 11, 2005

Here's an easy one for you:

(?i)(?s)(?:<html>)(.*)(?=</html>)

This matches both the <html> tag (non-case-sensitive), and the body, prior
to the </html> tag. The body (after the <html> tag is in Group 1.

string HtmlDocument = "..."; // whatever it is
Regex r = new RegeX(@"(?i)(?s)(?:<html>)(.*)(?=</html>)");
return r.Match(HtmlDocument).Groups[1].Value;

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
Ambiguity has a certain quality to it.

Replace email	1	Oct 19, 2009
Regex woes	8	Mar 18, 2008
Regex to retain only the HTML body	1	Apr 1, 2008
Regex - Matching URLS	2	Dec 10, 2007
Regex	1	Nov 22, 2005
Regex help needed	1	Apr 4, 2010
more regex question how to avoid capturing leading empty lines	2	Aug 9, 2007
Regex: replacing \n and spaces	4	Jan 5, 2007

Regex: Capturing HTML

Guest

Kevin Spencer

Ask a Question

Similar Threads