Seeking examples of screen scraping....

Jim · Jan 16, 2006

If I were to code the solution myself, I would agree.

Starting from scratch....it seems that the best way to get the data will be
to design a UI that (a) shows the web page from which you wish to gather the
data and (b) allows you to select a portion of the web page by simply
drawing a box around the intended elements.

Then, you would need to identify the element in the HTML by name, position,
element type or some other text that is most likely to occur in the element
as a type of tag. A combination of these identifiers would be most helpful,
but most data formatted for the web conatins some type of header (title) in
the text that can be used for the identifier.

There was a software package that did something like this called...EyeOnWeb
(http://www.eyeonweb.com/screen.html). The website has a 2004 date....so I
am not sure about the continuation of this product. There is no mention of
a developer's product here, but I suspect it would be a welcomed addition to
a web developer's Visual Studio Toolbox.

Jim

alex_f_il · Jan 16, 2006

Look at SWExplorerAutomation (SWEA)
(http://home.comcast.net/~furmana/SWIEAutomation.htm). SWEA has
TableDataExtactor and XPathDataExtractor which allows
visually define a data to be extracted. The Table Data Extractor
extracts tabular data from the Web pages. If a Web page contains
repeating information patterns than the data can be transformed into
ADO.NET DataTable object. XPathDataExtractor allows visually define
XPath expressions for the data extraction.

Jim · Jan 16, 2006

The website says "Requires Microsoft .Net framework runtime 1.1." and I am
using 2.0 for this project.

But, it looks cool.

Cor Ligthert [MVP] · Jan 16, 2006

Registered user

Something like the HTMLDocumentClass type perhaps?
If so mshtml.dll is the place to look.

The sample I gave to Jim is about the HTMLDocumentClass and Mshtml, Don't
you think that it is better next time to look first to the given answer
before you reply?

Cor

alex_f_il · Jan 16, 2006

SWEXploerAutomation will work with .Net framework runtime 2.0. You can
only have installation problems. To install current version:

1. Unzip the downloaded exe file. Use MSI to install.

2. Update swdesigner.exe.config:

<startup>
<supportedRuntime version="v2.0.50727"/>
<supportedRuntime version="v1.1.4322"/>
</startup>

I will post a new release which will install on machines with only .Net
framework 2.0 this week.

Jim · Jan 16, 2006

Sweet! I'll poke it in the eye this afternoon.

Thanks!

Registered User · Jan 16, 2006

Registered user

The sample I gave to Jim is about the HTMLDocumentClass and Mshtml, Don't
you think that it is better next time to look first to the given answer
before you reply?

A bit cantankerous eh? I was responding to the quoted follow-up.
Apparently the given answer was not sufficient hence Jim's subsequent
question.

regards
A.G.

Nick Malik [Microsoft] · Jan 16, 2006

Jim said:
This is an excellent starting point. Thank you for posting it.

What I am wondering is if there is a way to load the results into an
object that allows one to extract data as if it were a recordset. Have
you seen anything like that?

Hi Jim,

I have seen numerous controls in the third party space where you can load an
HTML page and then move through it as an object heirarchy.

The problem with HTML is that it is a text markup language. It is not
really useful for describing data as an object. Therefore tools that read
HTML (including the app you are writing) have to cope with this lack of
structure by using patterns to find the relevant sections of text.

It sounds like the sites you are visiting are updated daily. This nearly
always means that they are program-generated (ASP, PHP, etc). Using regular
expressions, and examples from a couple of days of pulling the page down,
you should be able to isolate the strings that never change from the data
that does. That information can help you to produce a regular expression
that will isolate the data you want.

I wrote a little app like this a couple of years ago that would pull the
dilbert of the day down to my hard drive and set it up to be in my
screensaver. (See what happens when programmers get bored?)

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--

Jim · Jan 16, 2006

Nick Malik said:
Hi Jim,

I have seen numerous controls in the third party space where you can load
an HTML page and then move through it as an object heirarchy.

The problem with HTML is that it is a text markup language. It is not
really useful for describing data as an object. Therefore tools that read
HTML (including the app you are writing) have to cope with this lack of
structure by using patterns to find the relevant sections of text.

It sounds like the sites you are visiting are updated daily. This nearly
always means that they are program-generated (ASP, PHP, etc). Using
regular expressions, and examples from a couple of days of pulling the
page down, you should be able to isolate the strings that never change
from the data that does. That information can help you to produce a
regular expression that will isolate the data you want.

I wrote a little app like this a couple of years ago that would pull the
dilbert of the day down to my hard drive and set it up to be in my
screensaver. (See what happens when programmers get bored?)

Excellent use of resources!

I have Dilbert as a page in my news (real news not newsgroups) group of
pages that I open first thing every morning.

Jim

Guest · Feb 20, 2006

Come on, the guy posted a reasonable question for help and some jerk said
GOOGLE IT.

Was that person trying to be helpful? NO.
It was a petty, passive aggressive flame.

People like that are a waste of time and dilute the quality of the newsgroups.

Seeking examples of screen scraping....

Jim

alex_f_il

Jim

Cor Ligthert [MVP]

alex_f_il

Jim

Registered User

Nick Malik [Microsoft]

Jim

Guest