Removing markup from html string

  • Thread starter Thread starter keith
  • Start date Start date
K

keith

Hi,

I'm using WebClient to retrieve the contents of a particular page. I
would like to get a string containing only the page's text and no html
markup.

How can I do this? Is there a class to take care of this?

Many thanks!

Keith
 
Hi,

I'm using WebClient to retrieve the contents of a particular
page. I would like to get a string containing only the page's
text and no html markup.

How can I do this? Is there a class to take care of this?

Many thanks!

Keith

Keith,

Here's a method that does that:


/// <summary>
/// Given a string containing HTML/XML/SGML tags, this method strips
/// out all of the tags and returns the remaining text.
/// </summary>
/// <param name="html">
/// The HTML/XML/SGML to search.
/// </param>
/// <returns>
/// The <c>html</c> text stripped of all tags. If <c>html</c> is null or empty,
/// then <c>html</c> is returned.
/// </returns>
public static string GetTextStrippedOfHtml(string html)
{
if ((html == null) || (html.Trim().Length == 0))
return html;

return Regex.Replace(html, @"
< # Tag's opening less-than sign.
[^>]+? # One or more characters that aren't a tag's closing greater-than sign (non-greedy).
# Tag's closing greater-than sign.",
string.Empty,
RegexOptions.Singleline |
RegexOptions.IgnoreCase |
RegexOptions.IgnorePatternWhitespace);
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top