PC Review


Reply
Thread Tools Rating: Thread Rating: 2 votes, 1.00 average.

c#.NET get text between body tags of an html file

 
 
rhitam
Guest
Posts: n/a
 
      5th May 2009
Hi all ,

I am trying to read an html file and retrieve only the text between
the body tags of that file. Now , for reading a string between two
strings , i already have a function :

http://www.mycsharpcorner.com/Post.aspx?postID=15

But the problem is that the body tag might have some attribute. In
that case i dont know how to exclude that and get only the text
between the tags. Ie , something like this :

<body style="margin:0;padding:0">
...
..
..
..
</body>

Any ideas?

Regards,
Rhitam

 
Reply With Quote
 
 
 
 
Cor Ligthert[MVP]
Guest
Posts: n/a
 
      5th May 2009
Be aware that it is almost impossible what you ask, because there is mostly
not only text between the body tags, but also images, flash, JavaScript etc.

But too get things between the body tags you need MSHTML (The namespace
around the DOM), it depends how you retrieve the page how you use that.

Cor

 
Reply With Quote
 
rhitam
Guest
Posts: n/a
 
      5th May 2009
On May 5, 3:37*pm, "Cor Ligthert[MVP]" <Notmyfirstn...@planet.nl>
wrote:
> Be aware that it is almost impossible what you ask, because there is mostly
> not only text between the body tags, but also images, flash, JavaScript etc.
>
> But too get things between the body tags you need MSHTML (The namespace
> around the DOM), it depends how you retrieve the page how you use that.
>
> Cor


All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam

 
Reply With Quote
 
Cor Ligthert[MVP]
Guest
Posts: n/a
 
      5th May 2009
http://msdn.microsoft.com/en-us/library/bb498651(VS.85).aspx

Have a look at this one, the document is a mshtml document
http://msdn.microsoft.com/en-us/libr....document.aspx


Cor
"rhitam" <(E-Mail Removed)> wrote in message
news:b03bb63c-2036-4ae2-acce-(E-Mail Removed)...
On May 5, 3:37 pm, "Cor Ligthert[MVP]" <Notmyfirstn...@planet.nl>
wrote:
> Be aware that it is almost impossible what you ask, because there is
> mostly
> not only text between the body tags, but also images, flash, JavaScript
> etc.
>
> But too get things between the body tags you need MSHTML (The namespace
> around the DOM), it depends how you retrieve the page how you use that.
>
> Cor


All the html pages i need to parse are already located on the same
machine as the server. Actually i am trying to create a word document
using xml n xsl transform with c# .Now that part is done , and only
to provide a set of offline content , i have to append the html
contents of a set of webpages at the end of the document. Do u still
think MSHTML is the only way? I tried searching for htmlcontainerclass
but could not fine any useful code sample. Maybe someone could provide
some code sample? i will be using the c# code in a DLL which would be
called from classic asp.

-Rhitam

 
Reply With Quote
 
rhitam
Guest
Posts: n/a
 
      5th May 2009
On May 5, 4:39*pm, "Cor Ligthert[MVP]" <Notmyfirstn...@planet.nl>
wrote:
> http://msdn.microsoft.com/en-us/library/bb498651(VS.85).aspx
>
> Have a look at this one, the document is a mshtml documenthttp://msdn.microsoft.com/en-us/library/system.windows.forms.webbrows...
>
> Cor"rhitam" <rhitamsan...@gmail.com> wrote in message
>
> news:b03bb63c-2036-4ae2-acce-(E-Mail Removed)...
> On May 5, 3:37 pm, "Cor Ligthert[MVP]" <Notmyfirstn...@planet.nl>
> wrote:
>
> > Be aware that it is almost impossible what you ask, because there is
> > mostly
> > not only text between the body tags, but also images, flash, JavaScript
> > etc.

>
> > But too get things between the body tags you need MSHTML (The namespace
> > around the DOM), it depends how you retrieve the page how you use that.

>
> > Cor

>
> All the html pages i need to parse are already located on the same
> machine as the server. Actually i am trying to create a word document
> using xml n xsl transform with c# *.Now that part is done , and only
> to provide a set of offline content , i have to append the html
> contents of a set of webpages at the end of the document. Do u still
> think MSHTML is the only way? I tried searching for htmlcontainerclass
> but could not fine any useful code sample. Maybe someone could provide
> some code sample? i will be *using the c# code in a DLL which would be
> called from classic asp.
>
> -Rhitam


That was helpful.. but i am still a little stuck . I wrote the
following code in a simple c#.NET console application using Visual c#
express edition 2005 :


StreamReader TopLinkStream = new StreamReader(FilePath);
string TopLinkHtml = TopLinkStream.ReadToEnd();
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] { TopLinkHtml }); // -- throws error here
HTMLDocumentClass domdoc = (HTMLDocumentClass)doc;
string BodyElem = domdoc.body.innerHTML;


the debugger throws error at the line indicated ie ,


doc.write(new object[] { TopLinkHtml });


At this point IE throws error saying 'Object expected' . Then i just
click on 'No' , and it proceeds to debug . Also the innerhtml is
loaded correctly in the 'doc' variable. How do i avoid that ?


Regards,

Rhitam







 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
FP2003 adding HTML+Body tags when doctype declared =?Utf-8?B?SWFpbg==?= Microsoft Frontpage 3 2nd Jul 2007 09:20 PM
No <body> tags in HTML Steve H Microsoft Frontpage 2 14th Jan 2005 05:09 PM
Literally displaying imported cells with <HTML> and <BODY> tags o_cardoso@yahoo.com Microsoft Excel Misc 3 13th Jan 2005 11:54 PM
lost html code between body tags =?Utf-8?B?ZnJlc2gzMzk=?= Microsoft Frontpage 1 30th May 2004 04:14 AM
All body text recevied as .txt file or .html file attachments Microsoft Outlook 0 12th Feb 2004 07:50 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 10:43 AM.