Parse HMTL source to fill in cells?

J

Jason

Ron,

No worries, thanks!

Yeah, you are right about the un-unique identifiers. I am trying to come up
with a way to iterate the code below for each date then look for those
identifiers. I will take a look at the original link you sent a while back
on the IE Object.

<td><a href="/history/airport/KHOU/2008/9/1/DailyHistory.html">1</a></td>
<td class="bl gb">
91
</td>
<td class="gb">
84
</td>
<td class="br gb">
76
</td>
 
R

Ron Rosenfeld

Ron,

No worries, thanks!

Yeah, you are right about the un-unique identifiers. I am trying to come up
with a way to iterate the code below for each date then look for those
identifiers. I will take a look at the original link you sent a while back
on the IE Object.

<td><a href="/history/airport/KHOU/2008/9/1/DailyHistory.html">1</a></td>
<td class="bl gb">
91
</td>
<td class="gb">
84
</td>
<td class="br gb">
76
</td>

At least in a single test, I believe the following modifications will uniquely
identify the required segments. I suspect there are easier ways to do this,
but ...

Note that I changed the pattern; I also changed the RegexMid function and added
some "clean-up" to both the main Sub and the private Sub.

(If you don't explicitly quit IE, you wind up with multiple IE processes
running; and it will eventually crash. In some other program, the limit was
about sixteen).

Anyway try this:

====================================
Option Explicit
Sub getCityTemps()
Dim AirCode As Range, ACrng As Range
Dim c As Range, rng As Range
Dim j As Long
Dim sURLairport As String
Dim myStr As String
Dim test As Variant
Dim Val As Variant
Dim IE As Object

'Set ACrng = Sheets("City_Airport").Range("B2:B26")
'For Each AirCode In ACrng
' sURLairport = AirCode
Const sURL1 As String = _
"http://www.wunderground.com/history...009&req_city=NA&req_state=NA&req_statename=NA"
'Const sURL2 As String = "/"
Dim sURLdate As String
'Const sURL3 As String = "/DailyHistory.html?MR=1"

Application.Cursor = xlWait
Set IE = CreateObject("InternetExplorer.Application")

IE.Navigate sURL1
While IE.ReadyState <> 4
DoEvents
Wend
myStr = IE.Document.body.innerhtml

Set rng = Range("A2").CurrentRegion
Set rng = rng.Resize(rng.Rows.Count, 1)

For Each c In rng
sURLdate = Format(c.Value2, "yyyy/m/d")
c.Offset(0, 1).Value = RegexMid(myStr, sURLdate, "bl gb")
c.Offset(0, 2).Value = RegexMid(myStr, sURLdate, "br gb")
c.Offset(0, 3).Value = RegexMid(myStr, sURLdate, "class=gb")
Next c


'i = i + 3

'Next AirCode
IE.Quit
Set IE = Nothing
Application.Cursor = xlDefault
End Sub
Private Function RegexMid(s As String, sDate As String, sTempType As String) _
As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
re.Pattern = "\b" & sDate & "/DailyHistory[\s\S]+?" & sTempType _
& "\D+(\d+)"

If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
Set re = Nothing
End Function
====================================
--ron
 
J

Jason

Ron,

Worked like a champ! I was able to edit to include referencing the airport
codes and get it to run for all the cities needed.

I am still a bit confused on why I cannot get it to run on a worksheet that
say, already has a couple hundred days of information listed. It will work
when I want it to start with Row 1 and then go down, but if I wanted to start
at Row 250 and go down it will not work. I have tried adjust rng and
researching CurrentRegion, but keep getting stuck.

Don't worry too much about this since it is not a big issue, just more of
curiosity on my part. The existing code works great, since we can just run it
in a blank sheet and then copy and and paste it into our existing data.

THANKS!


Ron Rosenfeld said:
Ron,

No worries, thanks!

Yeah, you are right about the un-unique identifiers. I am trying to come up
with a way to iterate the code below for each date then look for those
identifiers. I will take a look at the original link you sent a while back
on the IE Object.

<td><a href="/history/airport/KHOU/2008/9/1/DailyHistory.html">1</a></td>
<td class="bl gb">
91
</td>
<td class="gb">
84
</td>
<td class="br gb">
76
</td>

At least in a single test, I believe the following modifications will uniquely
identify the required segments. I suspect there are easier ways to do this,
but ...

Note that I changed the pattern; I also changed the RegexMid function and added
some "clean-up" to both the main Sub and the private Sub.

(If you don't explicitly quit IE, you wind up with multiple IE processes
running; and it will eventually crash. In some other program, the limit was
about sixteen).

Anyway try this:

====================================
Option Explicit
Sub getCityTemps()
Dim AirCode As Range, ACrng As Range
Dim c As Range, rng As Range
Dim j As Long
Dim sURLairport As String
Dim myStr As String
Dim test As Variant
Dim Val As Variant
Dim IE As Object

'Set ACrng = Sheets("City_Airport").Range("B2:B26")
'For Each AirCode In ACrng
' sURLairport = AirCode
Const sURL1 As String = _
"http://www.wunderground.com/history...009&req_city=NA&req_state=NA&req_statename=NA"
'Const sURL2 As String = "/"
Dim sURLdate As String
'Const sURL3 As String = "/DailyHistory.html?MR=1"

Application.Cursor = xlWait
Set IE = CreateObject("InternetExplorer.Application")

IE.Navigate sURL1
While IE.ReadyState <> 4
DoEvents
Wend
myStr = IE.Document.body.innerhtml

Set rng = Range("A2").CurrentRegion
Set rng = rng.Resize(rng.Rows.Count, 1)

For Each c In rng
sURLdate = Format(c.Value2, "yyyy/m/d")
c.Offset(0, 1).Value = RegexMid(myStr, sURLdate, "bl gb")
c.Offset(0, 2).Value = RegexMid(myStr, sURLdate, "br gb")
c.Offset(0, 3).Value = RegexMid(myStr, sURLdate, "class=gb")
Next c


'i = i + 3

'Next AirCode
IE.Quit
Set IE = Nothing
Application.Cursor = xlDefault
End Sub
Private Function RegexMid(s As String, sDate As String, sTempType As String) _
As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.MultiLine = True
re.Global = True
re.Pattern = "\b" & sDate & "/DailyHistory[\s\S]+?" & sTempType _
& "\D+(\d+)"

If re.test(s) = True Then
Set mc = re.Execute(s)
RegexMid = mc(0).submatches(0)
End If
Set re = Nothing
End Function
====================================
--ron
 
R

Ron Rosenfeld

Ron,

Worked like a champ! I was able to edit to include referencing the airport
codes and get it to run for all the cities needed.

I am still a bit confused on why I cannot get it to run on a worksheet that
say, already has a couple hundred days of information listed. It will work
when I want it to start with Row 1 and then go down, but if I wanted to start
at Row 250 and go down it will not work. I have tried adjust rng and
researching CurrentRegion, but keep getting stuck.

Don't worry too much about this since it is not a big issue, just more of
curiosity on my part. The existing code works great, since we can just run it
in a blank sheet and then copy and and paste it into our existing data.

THANKS!

You're very welcome. Glad to help. Thanks for the feedback.

Hopefully, the HTML code won't change in such a way as to break your routine.

I suspect your problem is here, or whatever is the equivalent code for your
real data.

-------------------
Set rng = Range("A2").CurrentRegion
Set rng = rng.Resize(rng.Rows.Count, 1)

For Each c In rng
---------------------

Something like the following would look for the first row that is not
completely filled in, and set rg = to start at that row, and continue down as
far as you have entered dates.

For example, if you have dates entered A2:A30; and data in B2:D11, then rg
would be set to A12:A30.

This routine assumes that there is nothing in Column E or below row 30 that
might extend CurrentRegion.

You'd have to set this up for your own data range. You also need to make sure
that the initial setting of rg is only where your data might be. But there are
a variety of ways to show that.

============================
Option Explicit
Sub SetRange()
Dim rg As Range, c As Range
Dim StartRow As Long
Set rg = Range("A2").CurrentRegion
'look for last filled in row
Set c = rg.SpecialCells(xlCellTypeBlanks).Areas(1)
Set c = c.Resize(1, 1)

StartRow = c.Row - rg.Row

Set rg = rg.Offset(rowoffset:=StartRow).Resize( _
rowsize:=rg.Rows.Count - StartRow, columnsize:=1)

Debug.Print rg.Address

End Sub
===============================
--ron
 
J

Jason

I guess this wear my newness comes in. Am I supposed to run this sub, then
run the original code or add this in?

I've tried running this new sub then running the original code, but it seems
to clear out the first 3 columns. Again, I know we are close and this is not
major, but it is more of a learning thing now. It's not urgent, but if you
would prefer me to just email you the spreadsheet, I can do that as well if
that would be quicker.
 
R

Ron Rosenfeld

I guess this wear my newness comes in. Am I supposed to run this sub, then
run the original code or add this in?

I've tried running this new sub then running the original code, but it seems
to clear out the first 3 columns. Again, I know we are close and this is not
major, but it is more of a learning thing now. It's not urgent, but if you
would prefer me to just email you the spreadsheet, I can do that as well if
that would be quicker.

It's just an example of code you can use to set rg to the range you want to
process. You would include it in your own code, modified to appropriately
select your range. You will need to study and understand what it is doing,
though, in order to modify it.
--ron
 
J

Jason

Okay, I'll tinker with the range and hopefully get it to work. I believe the
problem lies in some blank cells above the data where the header and title
information are located.

I've already managed to use the code to get ideas for other tasks and had
one final question in regards to Reg Exp syntax.

I modified the Pattern you included in the original code to grab a number
from a different site.

re.Pattern = "\b" & "nbsp;" & sYear & "[\s\S]+?" & sMonthTotal & "\D+(\d+)"

Where the sMonthTotal is a monthly total number

However the number includes commas and so the modified pattern only grabs
the numbers before the first comma.
Example:
-The number is 13,456,876 then the next line is 457,754
-The code grabs 13 and 457 for the next one

Thanks again! I am working with it and have no doubt I will benefit greatly
from your expertise.
 
J

Jason

Ron,

Okay I got the range to accurately work. The problem was the first 2 rows
had merged cells in them. If I delete the merged cells it works fine. Which
I can make work.

Is there a way to get the search for the first blank row to ignore merged
cells or not start the search until Row 4? I tried changing the Range"A2" to
Range "A4", Range "A10" etc, but no luck.

Thanks!

Jason
 
R

Ron Rosenfeld

Ron,

Okay I got the range to accurately work. The problem was the first 2 rows
had merged cells in them. If I delete the merged cells it works fine. Which
I can make work.

Is there a way to get the search for the first blank row to ignore merged
cells or not start the search until Row 4? I tried changing the Range"A2" to
Range "A4", Range "A10" etc, but no luck.

Thanks!

Jason


I'll look into that. Merged cells, in general are a PITA.
--ron
 
R

Ron Rosenfeld

Okay, I'll tinker with the range and hopefully get it to work. I believe the
problem lies in some blank cells above the data where the header and title
information are located.

I've already managed to use the code to get ideas for other tasks and had
one final question in regards to Reg Exp syntax.

I modified the Pattern you included in the original code to grab a number
from a different site.

re.Pattern = "\b" & "nbsp;" & sYear & "[\s\S]+?" & sMonthTotal & "\D+(\d+)"

Where the sMonthTotal is a monthly total number

However the number includes commas and so the modified pattern only grabs
the numbers before the first comma.
Example:
-The number is 13,456,876 then the next line is 457,754
-The code grabs 13 and 457 for the next one

Thanks again! I am working with it and have no doubt I will benefit greatly
from your expertise.

I have not tested this, but you might try using "[\d,]+" in place of the "\d".
It is not as robust as a regex that looks for proper comma placement, but after
an adult beverage, it might be good enough.
--ron
 
R

Ron Rosenfeld

Ron,

Okay I got the range to accurately work. The problem was the first 2 rows
had merged cells in them. If I delete the merged cells it works fine. Which
I can make work.

Is there a way to get the search for the first blank row to ignore merged
cells or not start the search until Row 4? I tried changing the Range"A2" to
Range "A4", Range "A10" etc, but no luck.

Thanks!

Jason

Here's one way to start at Row 4, even if there is confusing stuff in rows 1-3:

==========================
Sub SetRange()
Dim rg As Range, c As Range
Const FirstValidRow As Long = 4
Dim StartRow As Long
Set rg = Range("A2").CurrentRegion
Set rg = rg.Resize(rowsize:=rg.Rows.Count - _
FirstValidRow + rg.Row)
Set rg = rg.Offset(rowoffset:=FirstValidRow - rg.Row)
'look for last filled in row
Set c = rg.SpecialCells(xlCellTypeBlanks).Areas(1)
Set c = c.Resize(1, 1)

StartRow = c.Row - rg.Row

Set rg = rg.Offset(rowoffset:=StartRow).Resize( _
rowsize:=rg.Rows.Count - StartRow, columnsize:=1)

Debug.Print rg.Address

End Sub
==============================

So far as ignoring merged cells, and whether that would be effective, it could
be done. How to do it would depend on your precise layout. But there are VBA
properties and methods available to deal with merged cells.
--ron
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top