Counting String tokens precisly in an html document

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi there,

I have an html file like this.
----------------------------------------
<body>
<h1>Home Page</h1>
<p>
Welcome<br>To<br>

<br> My Home Page
</p>
--------------------------------------------

I want to know exact number of string tokens

it should discard the new lines (But completely discarding them will result
in merging the words seperated by new lines), discard too much white space
etc.

Please, some help will be appreciated.

I wrote this function initially, which only works with single white space.

public int count_body(string s)
{
char[] sp = {' '};
int count = s.Split(sp).Length;
return count;
}

Thanks!!
 
kman,

You are better off using an HTML parser for something like this. You
can use MSHTML through interop (Microsoft's HTML parser), and then access
the innerText property to get just the text for the document. You can then
parse apart that text easily (it should be broken properly, even with the BR
tags in between, which you won't see in the innerText).

Hope this helps.
 
Back
Top