Removing markup from html string

keith · Oct 24, 2005

Hi,

I'm using WebClient to retrieve the contents of a particular page. I
would like to get a string containing only the page's text and no html
markup.

How can I do this? Is there a class to take care of this?

Many thanks!

Keith

Guest · Oct 24, 2005

Regex

Chris R. Timmons · Oct 25, 2005

Hi,

I'm using WebClient to retrieve the contents of a particular
page. I would like to get a string containing only the page's
text and no html markup.

How can I do this? Is there a class to take care of this?

Many thanks!

Keith

Keith,

Here's a method that does that:

/// <summary>
/// Given a string containing HTML/XML/SGML tags, this method strips
/// out all of the tags and returns the remaining text.
/// </summary>
/// <param name="html">
/// The HTML/XML/SGML to search.
/// </param>
/// <returns>
/// The <c>html</c> text stripped of all tags. If <c>html</c> is null or empty,
/// then <c>html</c> is returned.
/// </returns>
public static string GetTextStrippedOfHtml(string html)
{
if ((html == null) || (html.Trim().Length == 0))
return html;

return Regex.Replace(html, @"
< # Tag's opening less-than sign.
[^>]+? # One or more characters that aren't a tag's closing greater-than sign (non-greedy).

# Tag's closing greater-than sign.",

string.Empty,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);
}

keith · Oct 25, 2005

Thanks guys!

Chris great code, works like a charm.

Keith

The compiler never complains about the markup code	1	Apr 26, 2012
Download file has HTML in it.	2	Apr 12, 2008
Library for generating valid HTML files?	1	Aug 5, 2007
socket gives different result than webClient	1	Oct 19, 2010
emit raw html	2	Mar 27, 2006
Adding html content to a string.	2	Nov 17, 2008
Remove html markup tags in an EXCEL cell?	3	Jun 4, 2006
Retrieving form data from an asp page	3	May 1, 2008

Removing markup from html string

keith

Guest

Chris R. Timmons

keith

Ask a Question

Similar Threads