reading text from .htm file

M

mp

I just made a quickie app to browse a folder(s) of .htm files
I put a webBrowser control on a form and a previous and next button, and a
save button.

my aim is to quickly browse through the files and if i find a page that has
useful info on it i want to save the "text information" to a text
file(ascii)- that's what the "save" button will be for.

I found this property webBrowser1.DocumentText which returns a string of the
html text of that page.
this string includes all the html tags etc...not just the written" text "
showing up on the page.

what i'm wondering is do i need to write a routine to parse the html tags to
detect what is the actuall "text" showing up on the page (as opposed to
attributes, formatting, etc)...or is there some kind of existing object that
can parse that text and return the actual "words" which appear on the page?

in other words if i look at a page in the browser, i could manually
copy/paste the "text" that shows up on the page.
is there a builtin way to do that programatically or do i need to create my
own html parsing routine?

thanks
mark
 
A

Arne Vajhøj

I just made a quickie app to browse a folder(s) of .htm files
I put a webBrowser control on a form and a previous and next button, and a
save button.

my aim is to quickly browse through the files and if i find a page that has
useful info on it i want to save the "text information" to a text
file(ascii)- that's what the "save" button will be for.

I found this property webBrowser1.DocumentText which returns a string of the
html text of that page.
this string includes all the html tags etc...not just the written" text "
showing up on the page.

what i'm wondering is do i need to write a routine to parse the html tags to
detect what is the actuall "text" showing up on the page (as opposed to
attributes, formatting, etc)...or is there some kind of existing object that
can parse that text and return the actual "words" which appear on the page?

in other words if i look at a page in the browser, i could manually
copy/paste the "text" that shows up on the page.
is there a builtin way to do that programatically or do i need to create my
own html parsing routine?

If you need to parse all types of pages and not some simple
standardized pages, then writing your own parse will be a lot
of work.

Many people have very bad experiences with the embedded browser
component.

Most people seems to like:
http://htmlagilitypack.codeplex.com/

Arne
 
M

mp

Peter Duniho said:
[...]
what i'm wondering is do i need to write a routine to parse the html tags
to
detect what is the actuall "text" showing up on the page (as opposed to
attributes, formatting, etc)...or is there some kind of existing object
that
can parse that text and return the actual "words" which appear on the
page?
[...]
I would say that in general, you do not want to waste your time writing
your own HTML parser. If the HtmlDocument doesn't provide sufficient
capability for your needs, I would look at third-party libraries, such as
the Html Agility Pack.

Pete

thanks i'll check that out
mark
 
M

mp

Arne Vajhøj said:
I just made a quickie app to browse a folder(s) of .htm files
I put a webBrowser control on a form and a previous and next button, and
a
save button.
[..]
If you need to parse all types of pages and not some simple
standardized pages, then writing your own parse will be a lot
of work.

Many people have very bad experiences with the embedded browser
component.

Most people seems to like:
http://htmlagilitypack.codeplex.com/

Arne

thanks, i'll chek that out
mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top