Find text within HTML file

  • Thread starter Thread starter Piotrekk
  • Start date Start date
P

Piotrekk

Hi

Having a keyword i need to search HTML file for keyword dismissing all
the tags, and checking only plain text.
Is there an easy way to do it in C#?

Thanks
PK
 
This will not do what i asked for.
This method only opens file and reads text. I need to find text within
HTML TAGS - text visible for the user opening the page.
 
Hi Alex,

Hmm, yeah, sorry. The simplest way is to match Regex like "search_string(?=[^>]*<)".
Other is defined by props of html (is it valid, what tags should be ingnored
and so on).

Regards, Alex
[TechBlog] http://devkids.blogspot.com


Hi Piotrekk,

For example,
System.IO.File.ReadAllText(@"C:\text.txt").Contains("something")

Regards, Alex
[TechBlog] http://devkids.blogspot.com
Hi

Having a keyword i need to search HTML file for keyword dismissing
all
the tags, and checking only plain text.
Is there an easy way to do it in C#?
Thanks
PK
 
Piotrekk,

I would use the MSHTML.HTMLDocument class through COM interop (you can
navigate to the file on disk) to load the file from disk. Once you have
that, get the IHTMLElement implementation for the body element through the
body property on the document. Once you have that, you can call the
innerText property to get the text of the document (without tags).
 
You could use a Regex.Replace statement with the correct Regex expression to
"clean" all the HTML tags from the text string of the HTML Page, but that
might not even be necessary since it is unlikely your keyword will be found
in HTML tag names or attributes.
Have you tried just:
int foundPosition = myHtmlString.IndexOf(keyWord) ... ?
this will return the first position of the keyword, or -1 if not found.
-- Peter
Recursion: see Recursion
site: http://www.eggheadcafe.com
unBlog: http://petesbloggerama.blogspot.com
BlogMetaFinder: http://www.blogmetafinder.com
 
Note: if you also need to search for keywords in ALT text, or for a
title (which is outside the body tag) make sure you adapt your Regex/
search strategy accordingly.
 
Back
Top