Extract plain text out of HTML page

Guogang · Oct 24, 2003

Hi,

I need to extract plain text from HTML page (i.e. do not show images, html
formatting, ...)

Is there some C# class/function that can help me on this?

Thanks,
Guogang

Dmitriy Lapshin [C# / .NET MVP] · Oct 24, 2003

Hi,

The dumbest solution would probably be to employ a regular expression to cut
out any construct of a form
"<...>" from the HTML file. You could use the following RegExp:

<[^>]+>

to replace all tags with empty strings.

Chris R. Timmons · Oct 24, 2003

Hi,

I need to extract plain text from HTML page (i.e. do not show
images, html formatting, ...)

Is there some C# class/function that can help me on this?

Guogang,

A regular expression can be used to strip out all HTML tags:

using System.Text.RegularExpressions;
...
string plainText = Regex.Replace(htmlText, "<[^>]+?>", "");

Hope this helps.

Chris.

Morten Wennevik · Oct 24, 2003

Not familiar with the Regex class which may be more suitable, but you
could strip html tags with

int index = 0;

while((i = htmlPage.IndexOf("<")) != -1)
{
i = strip(i);
}

private int strip(int i)
{
int a = htmlPage.IndexOf("<", i);
int b = htmlPage.IndexOf(">", i);
if(a < b) // nested tags, so do a recursive loop
strip(a);
... // then you would add some code to strip away everything from i to b
}

Girish Bharadwaj · Oct 24, 2003

Guogang said:
Hi,

I need to extract plain text from HTML page (i.e. do not show images, html
formatting, ...)

Is there some C# class/function that can help me on this?

Thanks,
Guogang

You can try writing a simple XSL which transforms HTML to text. Of
course, for this to work , you need to make sure that the HTML is
well-formed. otherwise, use the other suggestions.

Extract plain text out of HTML page

Guogang

Dmitriy Lapshin [C# / .NET MVP]

Chris R. Timmons

Morten Wennevik

Girish Bharadwaj