Parse HTML DOM document in console application

  • Thread starter Thread starter John Williams
  • Start date Start date
J

John Williams

How do I load a HTML page (via URL) and parse the DOM in a Console
Application?

I've successfully done all this in a Windows Application by using the
WebBrowser control, calling the Navigate method on the specified URL, and
then, within the DocumentComplete event, parsing the HTML page using
mshtml.HTMLDocument.

I'm writing it as a console app because I don't need to display the HTML,
just search for a specific tag and retrieve a href value from it.

Thanks for any help on this.
 
John Williams said:
How do I load a HTML page (via URL) and parse the DOM in a Console
Application?

I found the following thread (note the * at the end is part of the URL)

http://groups.google.com/groups?hl=...roup%3Dmicrosoft.public.dotnet.languages.vb.*

but was unable to make the solution by Charles Law work on my m/c (I have
defined the IPersistStreamInit interface). In my code the readstate is
always 'loading' and therefore it loops indefinitely at:

Do Until objDocument.readyState = "complete"
Application.DoEvents()
Loop
 
Charles Law said:
Hi John

I have made a simple console app that demonstrates the loading of HTML from
a url, based on the thread you found below. It works on my m/c, but gives an
unrelated error about being unable to set focus. Just ignore the error and
it will continue normally.

Let me know if you have problems getting the zip file and I will mail it
instead.

HTH

Charles, thanks for your reply and the sample code. Your code works fine
when run in the VS IDE, however when run from a command window it sits in
the loop:
Do Until objDocument.readyState = "complete"

Application.DoEvents()

Loop

because readyState is "loading", then "uninitialized", never "complete". If
I comment out Application.DoEvents(), readyState stays "loading". I don't
understand this!

Thanks.
 
Hi John

Unfortunately I don't get the same problem. I opened a command window and
ran the executable. I have ZoneAlarm running, so it warned me that the
application was trying to access the internet. I allowed it to continue and
then I got an error about setting focus (as I mentioned). I clicked on No
and the command window filled with the HTML.

I am running XP Pro with SP2, and .NET Framework 1.1 SP1. I also have IE6
installed. What are you running with?

Charles
 
Just start with a windows app, then delete the code that the wizard
generates, and put the code that you normally get from the
console wizard, because I don't think you will be saving anything
by not using a window, the .net overhead is there whether you
create windows or not, I think?
 
Hi Charles,

After more investigation, my Debug version works fine from a command window.
It's my Release version which sits in the loop, which probably means
something isn't being initialised. I then found this:

http://www.google.com/groups?hl=zh-cn&lr=&[email protected]

which says:
<quote>
I then checked the ReadyState property in a loop, and it was
returning 1 ("loading") all the time.

I tracked the problem down to my CoInitialize() call. The plain old
CoInitialize(NULL) didn't work but when I replaced it with the following,
everything started working fine:

CoInitializeEx(NULL,COINIT_MULTITHREADED);
</quote>

Do you know how to implement or call (?) CoInitializeEx in a VB .Net
program, if in fact that is what I need?

Thanks.
 
Hi John

Yes, I see what you mean. I have modified the application slightly so it now
works in release build outside the IDE. I have removed the DoEvents because
that requires the Windows forms assembly, and HTML documents are loaded
asynchronously (on another thread), so all we need really is to set the
apartment to multithreaded and then go to sleep in the loop while we are
waiting for the document to load.

HTH

Charles
 
Thank you, Charles, that works perfectly now :)

I've come up with another version which uses HTTPWebRequest/HTTPWebResponse,
which has the advantage of providing a timeout property, though a timeout
would be easy to implement in your version. I'm not sure of the pros and
cons of either method but it was an interesting exercise!

Thanks again for replying and helping out.
 
Back
Top