HTML Page Scraping

  • Thread starter Thread starter james
  • Start date Start date
J

james

Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James
 
james said:
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Grotesquely complicated.

You want to use an embedded web browser control to do all of the HTML
handling and have your code "scrape" values by finding the appropriate
elements in the DOM exposed by the web browser.

-cd
 
James,

It can get pretty difficult, depending on the technologies that are
utilized in the page. For example, authentication can be in the form of
HTTP authentication, or forms-based, or maybe it is done through an AJAX
call. Needless to say, you are almost definitely going to have to
specialized depending on the site and the security it uses.
 
james said:
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

In addition to what Carl and Nicholas said, there is an additional
complication if you actually plan to distribute your app to the users.
Depending on the version of the browser they have, the rendering might
be just slightly off and all of a sudden the Regex that you used to
parse the constituent source is no longer working.
 
Carl,

Would an embedded browser be able to handle security tokens for me?

Also, do you know if I can access the DOM in C# or would I have to use
javascript or regex?

Thanks,
James
 
Carl,

Would an embedded browser be able to handle security tokens for me?

Also, do you know if I can access the DOM in C# or would I have to use
javascript or regex?

Thanks,
James
 
james said:
Carl,

Would an embedded browser be able to handle security tokens for me?

Maybe. It's a better place to start than from scratch, for sure.
Also, do you know if I can access the DOM in C# or would I have to use
javascript or regex?

You should be able to access it from C#. It's a COM/OLE Automation
component, so you should be able to access the whole API from C# via COM
interop. I belive that .NET 2.0+ provides a pre-built wrapper for you.

-cd
 
James,

You will for sure need MSHTML, it is a very extended class and therefore the
information is hard to get on MSDN. Be aware that you have to cast almost
everything and mostly 2 times over each other to get the right types as you
use it.

Be aware that it can slow down your IDE referencing it.

Cor
 
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James

Like some people stated here, its site dependant.
but here's a list of starting points:

IE Developer toolbar - to look for the commands/html/http/post
requests between the website and the browser (works on IE, for Firefox
you have FireBug)
HttpWebRequest and HttpWebResponse for using http protocol.
for the above - CookieContainer - for cookie based authentication
CredentialCache for NTLM based authentication (perhaps its used for
other things, but this is what I used it for so far)
HtmlAgilityPack - for parsing HTML, perhaps there's a better
component, so do your own research.

Hope its helpful :-)
 
Hi Guys,

I want to write an app in C# that signs in to a website and grabs some
information. Do you know how complicated that can get (with security
tokens, etc)?

Thanks,
James

You can also try SWExplorerAutomation (SWEA) from http://webius.net.
SWEA records, replays automation scripts and generates VB.NET or C#
code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top