Extract plain text out of HTML page

G

Guogang

Hi,

I need to extract plain text from HTML page (i.e. do not show images, html
formatting, ...)

Is there some C# class/function that can help me on this?

Thanks,
Guogang
 
D

Dmitriy Lapshin [C# / .NET MVP]

Hi,

The dumbest solution would probably be to employ a regular expression to cut
out any construct of a form
"<...>" from the HTML file. You could use the following RegExp:

<[^>]+>

to replace all tags with empty strings.
 
C

Chris R. Timmons

Hi,

I need to extract plain text from HTML page (i.e. do not show
images, html formatting, ...)

Is there some C# class/function that can help me on this?

Guogang,

A regular expression can be used to strip out all HTML tags:

using System.Text.RegularExpressions;
...
string plainText = Regex.Replace(htmlText, "<[^>]+?>", "");

Hope this helps.

Chris.
 
M

Morten Wennevik

Not familiar with the Regex class which may be more suitable, but you
could strip html tags with

int index = 0;

while((i = htmlPage.IndexOf("<")) != -1)
{
i = strip(i);
}

private int strip(int i)
{
int a = htmlPage.IndexOf("<", i);
int b = htmlPage.IndexOf(">", i);
if(a < b) // nested tags, so do a recursive loop
strip(a);
... // then you would add some code to strip away everything from i to b
}
 
G

Girish Bharadwaj

Guogang said:
Hi,

I need to extract plain text from HTML page (i.e. do not show images, html
formatting, ...)

Is there some C# class/function that can help me on this?

Thanks,
Guogang
You can try writing a simple XSL which transforms HTML to text. Of
course, for this to work , you need to make sure that the HTML is
well-formed. otherwise, use the other suggestions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top