Parse HMTL source to fill in cells?

J

Jason

I'm a newbie with VBA, but want to automate a procedure we are doing.

How would I write a function to look at the following HTML source:

'<td><a href="/history/airport/KNYC/2008/7/2/DailyHistory.html">2</a></td>
' <td class="bl gb">
' 85
'</td>
' <td class="gb">
' 76
'</td>
' <td class="br gb">
' 67
'</td>

And put the 85 into cell B1, 76 into B2, 67 into B3?

Thanks!
 
R

Ron Rosenfeld

I'm a newbie with VBA, but want to automate a procedure we are doing.

How would I write a function to look at the following HTML source:

'<td><a href="/history/airport/KNYC/2008/7/2/DailyHistory.html">2</a></td>
' <td class="bl gb">
' 85
'</td>
' <td class="gb">
' 76
'</td>
' <td class="br gb">
' 67
'</td>

And put the 85 into cell B1, 76 into B2, 67 into B3?

Thanks!

In general, it is a simple problem if you define how you are identifying these
three numbers, but you don't give sufficient information.

I assume that the 85, 76 and 67 are merely examples of what could be in those
locations, otherwise you could just write those number into the respective
cells.

A few of the possibilities from examining your data:

Are the numbers identified by being the last three values in the source, and
are they all integers?

Are the values always on the 3rd, 6th and 9th line?

Will the values always be integers; are could they be decimal values (or even
fractions)?

Or is there something else special about these particular numbers to identify
them? For example, is what you really want in B1:B3 the numbers associated
with the different td classes "bl gb", "gb", and "br gb" ???

--ron
 
B

Bernie Deitrick

Jason,

Is your HTML source a document, or a webpage? There are all sorts of different ways to approach
this - depending on what the source actually is.

For Example, reading the document from a web page:

Dim myStr As String
Dim test As Variant
Dim Val As Variant
Dim IE As Object
Dim i As Integer
Dim j As Integer
j = 1

Set IE = CreateObject("InternetExplorer.Application")
IE.Navigate "http://somewebpage.shtml"
myStr = IE.Document.body.innerText

test = Split(myStr, "</td>")
For i = LBound(test) To UBound(test)
If test(i) <> "" Then
Val = Split(test(i), ">")
Cells(j, 2).Value = CDbl(Replace(Replace(Val(UBound(Val)), " ", ""), "'", ""))
j = j + 1
End If
Next i

HTH,
Bernie
MS Excel MVP
 
J

Jason

Sorry if I was too ambiguous with my original post...

This HTML source is from the following webpage on temperature data.
http://www.wunderground.com/history/airport/KNYC/2008/7/2/DailyHistory.html?MR=1

In this case, it is for NYC on July 2, 2008 (from the ...NYC/2008/7/2/... in
the address)
85 is the Max temperatue
76 is the mean
67 is the low temperature.

Essentially we want to setup a process to
1. look up the date in column A
2. go the website and parse the data for that date
3. enter the max in column B, low in column C, mean in column D
4. then repeat this process for the next date in the next row

Hopefully this helps.
 
N

Norie

Jason

Something like this perhaps?

Private Sub MaxMinMeanTemps()

Dim IE As Object
Dim doc As Object
Dim tbls As Object
Dim tbl As Object
Dim tblData As Object
Dim tblRow As Object
Dim rwCells As Object
Dim ws As Worksheet
Dim rng As Range
Dim strDate As String
Dim strURL As String
Dim I As Long

Set ws = Worksheets(1)

Set rng = ws.Range("A2")

Set IE = CreateObject("InternetExplorer.Application")

IE.Visible = True

While rng.Value <> ""

strDate = Year(rng.Value) & "/" & Month(rng.Value) & "/" &
Day(rng.Value)

strURL = "http://www.wunderground.com/history/airport/KNYC/" &
strDate & "/DailyHistory.html"

IE.Navigate strURL

Do While IE.Busy: DoEvents: Loop
Do While IE.ReadyState <> 4: DoEvents: Loop

Set doc = IE.Document

Set tbls = doc.getElementsByTagName("TABLE")

For Each tbl In tbls
If tbl.className = "dataTable tm10" Then
Set tblData = tbl
Exit For
End If
Next

For I = 2 To 4

Set tblRow = tblData.Rows(I)

Set rwCells = tblRow.Cells

rng.Offset(, I - 1) = rwCells(1).innertext

Next I

Set rng = rng.Offset(1)

Wend

IE.Quit: Set IE = Nothing

End Sub
 
R

Ron Rosenfeld

Sorry if I was too ambiguous with my original post...

This HTML source is from the following webpage on temperature data.
http://www.wunderground.com/history/airport/KNYC/2008/7/2/DailyHistory.html?MR=1

In this case, it is for NYC on July 2, 2008 (from the ...NYC/2008/7/2/... in
the address)
85 is the Max temperatue
76 is the mean
67 is the low temperature.

Essentially we want to setup a process to
1. look up the date in column A
2. go the website and parse the data for that date
3. enter the max in column B, low in column C, mean in column D
4. then repeat this process for the next date in the next row

Hopefully this helps.

Much more clear description of your requirements.

Perhaps the following Macro would help:

To enter this Macro (Sub), <alt-F11> opens the Visual Basic Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this Macro (Sub), <alt-F8> opens the macro dialog box. Select the macro
by name, and <RUN>.

As written, the entry cell is set to A2, and the results are in B2, C2 and D2.
This can be easily changed.

==========================================
Option Explicit
Sub gettemps()
Dim c As Range
Dim myStr As String
Dim test As Variant
Dim Val As Variant
Dim IE As Object
Const sURL1 As String = "http://www.wunderground.com/history/airport/KNYC/"
Dim sURLdate As String
Const sURL2 As String = "/DailyHistory.html?MR=1"

Set c = Range("A2")

sURLdate = Format(c.Value2, "yyyy/m/d")

Application.Cursor = xlWait

Set IE = CreateObject("InternetExplorer.Application")
IE.Navigate sURL1 & sURLdate & sURL2
While IE.ReadyState <> 4
DoEvents
Wend

myStr = IE.Document.body.innerhtml

c.Offset(0, 1).Value = RegexMid(myStr, "max")
c.Offset(0, 2).Value = RegexMid(myStr, "min")
c.Offset(0, 3).Value = RegexMid(myStr, "mean")

Application.Cursor = xlDefault
End Sub

Private Function RegexMid(s As String, sTempType As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Pattern = "\b" & sTempType & "\b\D+(\d+)"

If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
End Function
==================================
--ron
 
J

Jason

Norie,

Thanks, that did it. How can I get it to enter just the number, say "85"
instead of "85 <degree sign>F"? I know that is simple, but I'm still
learning.

I changed
Set rng = ws.Range("A2")
to something similar to Joel's post
RowCount = 1
Set rng = ws.Range("A" & RowCount)
RowCount = RowCount + 1
so that I can enter in a whole list of dates and have it go through and do
it for all the dates listed.

Of course, now my mind is going...

Any idea how I could have it look at the table, find the last date listed in
column A, enter the next date below it, and fill in all the dates up to
present day minus 1? Meaning from the last date until "yesterday". Of course
"yesterday" would depend on when it is run.

This way, we could run the script to fill in all our missing data
 
J

Jason

Ron,

Thanks, I like it alot.

If you look at my reply to Norie. I setup a way to repeat it for a list of
dates, but could not do that with yours...I am sure I missing something
besides just changing c=A2 to:

RowCount = 1
Set c = Range("A" & RowCount)
RowCount = RowCount + 1

Also, if you notice from my reply to Norie, I now want to add a way to fill
in column A

Thanks!

Jason
(e-mail address removed)
 
R

Ron Rosenfeld

Ron,

Thanks, I like it alot.

If you look at my reply to Norie. I setup a way to repeat it for a list of
dates, but could not do that with yours...I am sure I missing something
besides just changing c=A2 to:

RowCount = 1
Set c = Range("A" & RowCount)
RowCount = RowCount + 1

Also, if you notice from my reply to Norie, I now want to add a way to fill
in column A

Thanks!

Jason

It's relatively easy, but perhaps not intuitive, to extend my macro to handle a
range of cells:

=================================
Option Explicit
Sub gettemps()
Dim c As Range, rng As Range
Dim j As Long
Dim myStr As String
Dim test As Variant
Dim Val As Variant
Dim IE As Object
Const sURL1 As String = "http://www.wunderground.com/history/airport/KNYC/"
Dim sURLdate As String
Const sURL2 As String = "/DailyHistory.html?MR=1"

Application.Cursor = xlWait
Set IE = CreateObject("InternetExplorer.Application")

Set rng = Range("A2").CurrentRegion
Set rng = rng.Resize(rng.Rows.Count, 1)

For Each c In rng
sURLdate = Format(c.Value2, "yyyy/m/d")
IE.Navigate sURL1 & sURLdate & sURL2
While IE.ReadyState <> 4
DoEvents
Wend
myStr = IE.Document.body.innerhtml

c.Offset(0, 1).Value = RegexMid(myStr, "max")
c.Offset(0, 2).Value = RegexMid(myStr, "min")
c.Offset(0, 3).Value = RegexMid(myStr, "mean")
j = j + 1
Next c

Application.Cursor = xlDefault
End Sub

Private Function RegexMid(s As String, sTempType As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Pattern = "\b" & sTempType & "\b\D+(\d+)"

If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
End Function
=========================================

But this just reads in each page one at a time. Whether it is worthwhile to do
the coding to try to download the custom range of dates from the site, and then
parse out the appropriate data, depends on how often you'll be downloading
multiple dates, and how many dates each time.
--ron
 
J

Jason

It works great!

However, once I saw it run, I tried to run it in an existing spreadsheet
where the first cell would be A3838 instead of A2 and that did not go too
well...

Right now, I'm just trying to speed it up, which you are helping with
immensely. I'm sure the next step would be to put all this into a database
and have it run automatically daily.
 
N

Norie

Jason

I'm not quite sure what you are asking and it seems actually be more to do
with inserting
dates into a worksheet.

The code I posted assumes you have dates starting in row 2, column A.

It goes down the column, gets the data, inserts it in the relevant columns
(B,C and D) and then
moves on.

It stops when a blank cell is found.

So in theory if you have all the dates you want, in whatever order it should
work.

If you need help filling in dates there are various ways to do that but they
really depend on what you've currently
got and what you want to end up with.

Perhaps a topic for another thread.

As for just getting the number, shouldn't be a problem - I'll see if I can
do something.

Would you like it in Fahrenheit or Celsius?:)

PS If you just need the number for calculations then you could probably do
that without
code, a simple formula should suffice.
 
R

Ron Rosenfeld

It works great!

Good to hear! Thanks for the feedback.
However, once I saw it run, I tried to run it in an existing spreadsheet
where the first cell would be A3838 instead of A2 and that did not go too
well...

You really need to be specific. And you will probably find that by doing so,
and by clearly elaborating your problem, a solution will become apparent.

I'm guessing that the "did not go too well" problem perhaps has to do with
repetitive calls to the web page? If that is the case, then you probably are
not properly setting rng. But with limited information, it's hard to advise
further.
Right now, I'm just trying to speed it up, which you are helping with
immensely.

I suppose by "it", you mean the process of entering these dates.

I'm sure the next step would be to put all this into a database
and have it run automatically daily.

If you are going to run it every day, then calling the web page as I've done
should be adequate. If you were going to run it weekly or monthly, then a call
to the web page that downloads a customizable series of dates might be better.
--ron
 
J

Jason

Ron,

I changed
Range("A2").CurrentRegion
to
Range("A3838").CurrentRegion

and it started putting temperatures in B1, C1, D1, and then on down instead
of B3838, C3838, D3838 and then down.

My last request, is more of a wish, and not an urgent need, but I want to
setup a sheet with a list of city airport codes, for example:
Sheet name "Airports"
In "Airports" Cells B1-B25 would contain a list of 25 airport codes in a
specific order
A modified version of your existing script would then
-take the airport code in cell B1, likely NYC as we have already done
-insert it into the website address for sURL1
http://www.wunderground.com/history/airport/k
-run your script, inserting the temps in B*,C*,D*
-then read the next airport from "Airports" B2, likely LAX
-modify the sURL1 to
-http://www.wunderground.com/history/airport/kLAX
-insert the temperatures into E*,F*,G*

Essentially column A is the date, then next 3 columns are the Max, Min, Mean
for City 1, then the next 3 columns are the Max, Min, Mean for City 2 and so
on...

I think what we have now is already leaps and bounds ahead of the manual
copying and pasting. So this will definitely work for us, if we need more I
am sure I will post those needs under a new thread.

However, I would like to know in your script how you got it to look at the
website and pull the data...if you wouldn't mind adding a couple comments to
your code that would great. This way, I can hopefully learn a little instead
of just copying and pasting your work.

Thank you immensely,

Jason
 
R

Ron Rosenfeld

Ron,

I changed
Range("A2").CurrentRegion
to
Range("A3838").CurrentRegion

and it started putting temperatures in B1, C1, D1, and then on down instead
of B3838, C3838, D3838 and then down.

And what, exactly, is the range that results from that command? Take a look at
HELP for the CurrentRegion property and you may be able to figure out for
yourself what's going on and how to fix it. Or else you'll have to provide
more details of your setup.
My last request, is more of a wish, and not an urgent need, but I want to
setup a sheet with a list of city airport codes, for example:
Sheet name "Airports"
In "Airports" Cells B1-B25 would contain a list of 25 airport codes in a
specific order
A modified version of your existing script would then
-take the airport code in cell B1, likely NYC as we have already done
-insert it into the website address for sURL1
http://www.wunderground.com/history/airport/k
-run your script, inserting the temps in B*,C*,D*
-then read the next airport from "Airports" B2, likely LAX
-modify the sURL1 to
-http://www.wunderground.com/history/airport/kLAX
-insert the temperatures into E*,F*,G*

Essentially column A is the date, then next 3 columns are the Max, Min, Mean
for City 1, then the next 3 columns are the Max, Min, Mean for City 2 and so
on...

That should be easy to do. But perhaps that is something you'd rather work on?
I know I learn a lot by doing...

I think what we have now is already leaps and bounds ahead of the manual
copying and pasting. So this will definitely work for us, if we need more I
am sure I will post those needs under a new thread.

However, I would like to know in your script how you got it to look at the
website and pull the data...if you wouldn't mind adding a couple comments to
your code that would great. This way, I can hopefully learn a little instead
of just copying and pasting your work.

That is a matter of knowing a little (unfortunately too little) about the
InternetExplorer object (see

http://msdn.microsoft.com/en-us/library/aa752084(VS.85).aspx

Constructing the proper URL to download the web page
This was done by examining the URL when I had navigated to the web page
in question.

Examining the HTML source to try to figure out an unambiguous way of defining
the data I wanted to obtain, using the VBScript Regular Expression engine.

The URL parts all have "URL" as part of their name.

The Regular Expression pattern parts all have "Pat" in their name.


Hope this helps.


Thank you immensely,

Jason
--ron
 
P

Patrick Mc

You want to extract some numbers from a web page. The web page is
http://www.wunderground.com/history/airport/KNYC/2008/7/2/DailyHistory.h
tml?MR=1

In this case, it is for NYC on July 2, 2008 (from the
...NYC/2008/7/2/... in
the address)

You want to extract:

85 is the Max temperatue
76 is the mean
67 is the low temperature.


Here is a quick script in biterscripting.

# Script WeatherHistory.txt
var str html ; cat
"http://www.wunderground.com/history/airport/KNYC/2008/7/2/DailyHistory.
html?MR=1" > $html
while ( { sen -c "^<span class=\"b\">^" $html } > 0 )
do
stex -c "^<span class=\"b\">^]" $html > null
stex -c "]^</span>^" $html
done


Script is in biterscripting. I tested this script. It works. (The
markers are <span class="b">...</span>.) To try,

1. Save the above script as C:\WeatherHistory.txt.

2. Download biterscripting from http://www.biterscripting.com.

3. Start biterscripting. Enter the following command.

script "C:\WeatherHistory.txt"

You will get the numbers you want. I am seeing the following.

76
75
85
..
..
..



Email me if you have questions.

Patrick
 
J

Jason

Ron,

Just wanted to say thanks for your help. I had to leave town for a few
days, but will work on this and hopefully get the results I need.

Thanks!
 
R

Ron Rosenfeld

Ron,

Just wanted to say thanks for your help. I had to leave town for a few
days, but will work on this and hopefully get the results I need.

Thanks!

You're very welcome. Thanks for the feedback. Let me know what you need.
--ron
 
J

Jason

Ron,

Okay, I'm back...

I was able to actually take your code and reference the sheet with the 25
airport codes and add the data for each code in 3 columns.

Of course, then I got greedy and thought I wanted something faster. So I
tried taking your code and getting it to work for the Custom Date Range so
that data could be entered in much quicker.

I'm still learning, so I just want to have it work for 1 city and a hardset
date range. I think I can get it to repeat for the different airports later.

For this example, I am trying to run for Houston from 9/1/2008 to 6/28/2009.
I have setup a series of dates in column A of my Excel spreadsheet,
beginning with 9/1/2008 in A1 and daily increments down. I have claimed a
small victory in that when run, it does go the site and grab the correct max,
min and mean from the list.

The problem is that it enters the max, min, and mean for 9/1/2008 as the
temperatures for every day following. Hopefully it is something simple, that
will allow it to move down to the next date and grab that temperature.

Hopefully you can understand my explanation, if not, you can simply enter a
list of dates from 9/1/2008 to 9/10/2008 in A1:A10 and run the code below and
see the repeat.

As always thanks!


Option Explicit
Sub getCityTemps()
Dim AirCode As Range, ACrng As Range
Dim c As Range, rng As Range
Dim j As Long
Dim sURLairport As String
Dim myStr As String
Dim test As Variant
Dim Val As Variant
Dim IE As Object

'Set ACrng = Sheets("City_Airport").Range("B2:B26")
'For Each AirCode In ACrng
' sURLairport = AirCode
Const sURL1 As String =
"http://www.wunderground.com/history...009&req_city=NA&req_state=NA&req_statename=NA"
'Const sURL2 As String = "/"
Dim sURLdate As String
'Const sURL3 As String = "/DailyHistory.html?MR=1"

Application.Cursor = xlWait
Set IE = CreateObject("InternetExplorer.Application")


IE.Navigate sURL1
While IE.ReadyState <> 4
DoEvents
Wend
myStr = IE.Document.body.innerhtml

Set rng = Range("A2").CurrentRegion
Set rng = rng.Resize(rng.Rows.Count, 1)
For Each c In rng
sURLdate = Format(c.Value2, "yyyy/m/d")
c.Offset(0, 1).Value = RegexMid(myStr, "bl gb")
c.Offset(0, 2).Value = RegexMid(myStr, "br gb")
c.Offset(0, 3).Value = RegexMid(myStr, "class=gb")
j = j + 1
Next c



'i = i + 3

'Next AirCode

Application.Cursor = xlDefault
End Sub

Private Function RegexMid(s As String, sTempType As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Pattern = "\b" & sTempType & "\b\D+(\d+)"

If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
End Function
 
R

Ron Rosenfeld

Of course, then I got greedy and thought I wanted something faster. So I
tried taking your code and getting it to work for the Custom Date Range so
that data could be entered in much quicker.

I looked at that a bit when I was coming up with my original recommendations.
But I did not pursue the Custom Date Range approach. I don't know when I'll
have a chance to look at that in depth, though. This may be a busy week for
me, but if I have time ...

If I recall correctly, one problem is that the identifiers you are using:

"bl gb"
"br gb"
"class=gb"

may not uniquely identify that which you are looking for.
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top