REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6)

S

Steve

I distributed some workbook apps in 2001 that are currently being used
for budgeting and forecasting work. A main feature of the apps is the
ability to pull current-year data from the proprietary accounting software
that the businesses use. The users were able to run a general ledger
trial balance, save it in plain text format, and my apps would parse the
data and import it into the workbooks.

Well, now their accounting software has been updated to export the
reports into html format, not plain text.

So, instead of seeing:

ACCOUNT-NO
DESCRIPTION BEGINNING BALANCE ---------TOTAL-------- NET CHANGE ENDING BALANCE BUDGET PRIOR-YEAR

010-30120-00 14,606.00- DEBIT .00 8,185.00- 22,791.00- 13,952- 15,549-
IND PREFERRED CREDIT 8,185.00

TRX-DATE DR-AMOUNT CR-AMOUNT RUNNING-BALANCE SOURCE REFERENCE
08/01/00 227.00 14,833.00- IJ01 FRIDAY
08/02/00 216.00 15,049.00- IJ02 SATURDAY
08/03/00 233.00 15,282.00- IJ03 SUNDAY


It now looks like this:


<table>
<TR><TD>010-30110-00 </TD><TD ALIGN=RIGHT> 14,606.00-</TD><TD ALIGN=RIGHT>DEBIT</TD><TD COLSPAN=2 ALIGN=RIGHT> .00 </TD><TD COLSPAN=2 ALIGN=RIGHT> 8,185.00-</TD><TD ALIGN=RIGHT> 22,791.00-</TD><TD ALIGN=RIGHT> 13,952-</TD><TD ALIGN=RIGHT>
15,549-</TD></TR>
<TR><TD>IND - FIT PACKAGE </TD><TD COLSPAN=2 ALIGN=RIGHT>CREDIT</TD><TD COLSPAN=2 ALIGN=RIGHT> 8,185.00</TD><TD COLSPAN=6 >&nbsp;</TD></TR>
<TR><TD COLSPAN=10 >&nbsp;</TD></TR>
<TR><TD ALIGN=RIGHT>TRX-DATE</TD><TD ALIGN=CENTER>DR-AMOUNT</TD><TD COLSPAN=2 ALIGN=CENTER>CR-AMOUNT</TD><TD COLSPAN=2 ALIGN=RIGHT> RUNNG-BAL </TD><TD ALIGN=CENTER>SOURCE</TD><TD COLSPAN=3 ALIGN=CENTER>REFERENCE</TD></TR>
<TR><TD ALIGN=RIGHT>08/01/03</TD><TD ALIGN=RIGHT> </TD><TD COLSPAN=2 ALIGN=RIGHT> 227.00</TD><TD COLSPAN=2 ALIGN=RIGHT> 14,833.00-</TD><TD ALIGN=CENTER>IJ01 </TD><TD COLSPAN=3 >Friday </TD></TR>
<TR><TD ALIGN=RIGHT>08/02/03</TD><TD ALIGN=RIGHT> </TD><TD COLSPAN=2 ALIGN=RIGHT> 216.00</TD><TD COLSPAN=2 ALIGN=RIGHT> 15,049.00-</TD><TD ALIGN=CENTER>IJ02 </TD><TD COLSPAN=3 >Saturday </TD></TR>
<TR><TD ALIGN=RIGHT>08/03/03</TD><TD ALIGN=RIGHT> </TD><TD COLSPAN=2 ALIGN=RIGHT> 233.00</TD><TD COLSPAN=2 ALIGN=RIGHT> 15,282.00-</TD><TD ALIGN=CENTER>IJ03 </TD><TD COLSPAN=3 >Sunday </TD></TR>
</table>


Any advice on what is the best way to parse data in HTML format?
Does VBA have any built-in support for this purpose?
The versions of Excel being used are 2000 and 2002 exclusively.

Should I bypass Excel altogether and write a VB6 app that converts
the html formatted report into the text format that Excel expects?
I'm thinking that might be the best bet, in case future software updates
to their accounting software make any html formatting changes.

In any case, I would still love to know if there is an easier way to
parse the html. It looks like all I really need to do is to be able to read
the table cell contents in a logical manner.

Thanks

Steve
 
T

Tom Ogilvy

Just open the html file in Excel using file open. then look at the result.
Most of it should be parsed into separate columns.

--
Regards,
Tom Ogilvy

Steve said:
I distributed some workbook apps in 2001 that are currently being used
for budgeting and forecasting work. A main feature of the apps is the
ability to pull current-year data from the proprietary accounting software
that the businesses use. The users were able to run a general ledger
trial balance, save it in plain text format, and my apps would parse the
data and import it into the workbooks.

Well, now their accounting software has been updated to export the
reports into html format, not plain text.

So, instead of seeing:

ACCOUNT-NO
DESCRIPTION BEGINNING BALANCE ---------TOTAL--------
NET CHANGE ENDING BALANCE BUDGET PRIOR-YEAR
010-30120-00 14,606.00- DEBIT .00
8,185.00- 22,791.00- 13,952- 15,549-
IND PREFERRED CREDIT 8,185.00

TRX-DATE DR-AMOUNT CR-AMOUNT
RUNNING-BALANCE SOURCE REFERENCE
08/01/00 227.00 14,833.00- IJ01 FRIDAY
08/02/00 216.00 15,049.00- IJ02 SATURDAY
08/03/00 233.00 15,282.00- IJ03 SUNDAY


It now looks like this:


<table>
<TR><TD>010-30110-00 </TD><TD ALIGN=RIGHT>
14,606.00-</TD><TD ALIGN=RIGHT>DEBIT</TD><TD COLSPAN=2 ALIGN=RIGHT>
..00 said:
15,549-</TD></TR>
<TR><TD>IND - FIT PACKAGE </TD><TD COLSPAN=2
ALIGN=RIGHT>CREDIT said:
<TR><TD COLSPAN=10 >&nbsp;</TD></TR>
<TR><TD ALIGN=RIGHT>TRX-DATE</TD><TD ALIGN=CENTER>DR-AMOUNT</TD><TD
COLSPAN=2 ALIGN=CENTER>CR-AMOUNT</TD><TD COLSPAN=2 ALIGN=RIGHT> RUNNG-BAL
<TR><TD ALIGN=RIGHT>08/01/03</TD><TD ALIGN=RIGHT> </TD><TD
COLSPAN=2 ALIGN=RIGHT> 227.00</TD><TD COLSPAN=2 ALIGN=RIGHT>
14 said:
<TR><TD ALIGN=RIGHT>08/02/03</TD><TD ALIGN=RIGHT> </TD><TD
COLSPAN=2 ALIGN=RIGHT> 216.00</TD><TD COLSPAN=2 ALIGN=RIGHT>
15 said:
<TR><TD ALIGN=RIGHT>08/03/03</TD><TD ALIGN=RIGHT> </TD><TD
COLSPAN=2 ALIGN=RIGHT> 233.00</TD><TD COLSPAN=2 ALIGN=RIGHT>
 
R

Rick Rothstein

Just open the html file in Excel using file open. then look at the
result.
Well, I have to avoid that kind of scenario (opening the import file in Excel directly).
The import files can be quite large (exceeding the 65,536 line limitation of Excel).
The pcs that the apps run on are slow and clunky too, so I much prefer to simply
open the import file for line input, and process the data line-by-line.

Thanks for the advice though Tom!

This suggestion is along the same lines as Tom's, so it might not be of any
help either, however...

You could create a separate VB "file conditioning" program (or bundle the
few lines that would make it up directly in your existing program) that
opens up a session of Word, loads the html file and then saves it back out
as plain text. I would think that would create a text file that would be
useable by your existing program's code.

Rick - MVP
 
A

Andrew Faust

Steve said:
I distributed some workbook apps in 2001 that are currently being used
for budgeting and forecasting work. A main feature of the apps is the
ability to pull current-year data from the
proprietary accounting software
that the businesses use. The users were able to run a general ledger
trial balance, save it in plain text format, and my apps would parse the
data and import it into the workbooks.

Well, now their accounting software has been updated to export the
reports into html format, not plain text.

A lot of accounting software these days will export to
XML
format. See if maybe that is an option. If so you can
use
MSXML to easily parse the XML.

If not, and you are stuck with HTML, the first thing I
would
do is find the <Table>...</Table> section and strip it
out.
Then I would scan through the remainder, and count the
number of rows, at the same time I would count the max
number of columns. This is easily done by counting the
<TR>
tags for rows, and each time you see a <TR> count the
number of <TD> or <TH> tags that show up before the
next </TR>. Only keep track of the max number of
columns.

Then redim an array as myArray(0 to numCols - 1, 0 to
numRows - 1)
Now read throught the data again, and start at row 0,
col 0. Each time
you hit a new </TD> or </TH> tag increment the col
pointer.
Each time you hit a </TR> tag increment the row
pointer. Read the
data between each tag in the corresponding
myArray(curCol, curRow).

There are probably easier ways to do this. But this is
just
the method that first came to mind.

Andrew Faust
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top