Counting String tokens precisly in an html document

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi there,

I have an html file like this.
----------------------------------------
<body>
<h1>Home Page</h1>
<p>
Welcome<br>To<br>

<br> My Home Page
</p>
--------------------------------------------

I want to know exact number of string tokens

it should discard the new lines (But completely discarding them will result
in merging the words seperated by new lines), discard too much white space
etc.

Please, some help will be appreciated.

I wrote this function initially, which only works with single white space.

public int count_body(string s)
{
char[] sp = {' '};
int count = s.Split(sp).Length;
return count;
}

Thanks!!
 
kman,

You are better off using an HTML parser for something like this. You
can use MSHTML through interop (Microsoft's HTML parser), and then access
the innerText property to get just the text for the document. You can then
parse apart that text easily (it should be broken properly, even with the BR
tags in between, which you won't see in the innerText).

Hope this helps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top