What is the best way to process HTML Data?

I

ink

Hi all,

I am trying to pull some financial data off of an HTML web page so that I
can store it in a Database for Sorting and filtering.
I have been thinking about this for some time and trying to find the best
way to do it but I am just not experienced enough with this sort of thing to
make the best decision, so any advice would be great.

The data is on a number of different nested tables with in the HTML, and on
a number of different pages, and each page is laid out differently.

The common factors are that each Table is well formed and has a heading with
in the first row of the table, or has a separate heading table just above
the data table.

This is the kind of thing that I was thinking about doing.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Deserialize the XHTML into objects.
3. Read the data from the objects into the database.

Or another idea is.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data from the
tables.
3. Store the data into the database.

Or maybe a combination of both.

1. Use SGMLReader to tidy up the HTML file into XHTML.
2. Use XPath or some kind of regular expressions to pull the data tables.
3. Deserialize the XHTML data tables into objects.
4. Read the data from the objects into the database.

I am not sure what will be the simplest to write and maintain in the future
should things change on the HTML pages. As I am new to both these types of
development having done them all at least once but on a much smaller scale I
sort of know how they work but not what the potential pit falls are and
weather it is possible to use these sorts of things for such complex HTML.
One thing I can think that is good about XPath is that I could store it in a
config file and if the web page changed I could change where it read the
data from with out to much work. I am not even sure that I would be able to
Deserialize such a complex HTML model or can I just Deserialize the tables.

This is the kind of thing that I really want to get as close to correct the
first time as I can. So any ideas would be great. As you can see I am
struggling with a lot of new boy questions.

Thanks,
ink
 
A

Ashot Geodakov

There's a good reason financial sites lay out their pages differently - to
prevent their data from being stolen.

Don't bother writing software. Just pay them a modest subscription of around
$10 per month, and they'll send you spreadsheets with your morning coffee!
:)
 
I

ink

Really i have no problem paying but i cant find any that are selling UK data
to private investors.

They seem to think everyone is a large Broker and have £500 a month to blow
on data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top