VBA Regular Expressions & URL Extraction


A

Akrobrat

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
However, if I do get a non-null result back, it is usually:

---------------------- Sample Patterns Tested:
----------------------
..Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W?"
..Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
..Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo">
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"><img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/></a>
</div>

<td class="skucontent">

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink">
Transformers - Widescreen Dubbed Subtitle AC3</a><br/>

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!
 
J

jaf

Hi,
If.
Let me stress that.
IF this were a hyperlink on a webpage "/site/olspage.jsp" would be pointing to the subfolder "/site" and the file "olspage.jsp" on
the WEBSERVER and would function.

As a hyperlink in Excel this will not work without the http://www. unless you have a folder on your hard drive named "/site" and
Excel can find it without the rest of the path.

John


Akrobrat said:
Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
However, if I do get a non-null result back, it is usually:

---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W?"
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo">
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"><img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/></a>
</div>

<td class="skucontent">

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink">
Transformers - Widescreen Dubbed Subtitle AC3</a><br/>

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!
 
Ad

Advertisements

R

Ron Rosenfeld

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
However, if I do get a non-null result back, it is usually:

---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W?"
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo">
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"><img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/></a>
</div>

<td class="skucontent">

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink">
Transformers - Widescreen Dubbed Subtitle AC3</a><br/>

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!

Try this:

===================
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = "href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count >= 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
=====================
--ron
 
R

Ron Rosenfeld

---------------------- Sample Patterns Tested:
I did not tease this apart, but:

To include " in VBA, you need to "double" them.

The angle bracket is not a meta-character and the initial "\" should be
removed. (Unless you are trying to match a beginning of word boundary, in
which case you could use "\b(?=\w)" )

Dot will not match newline in VBScript or JavaScript flavors. Since your data
extends over several lines, it is likely that your "." is not doing what you
expect.
--ron
 
A

Akrobrat

Jaf,

Good point. I do indeed concatenate "http://..." with the partial URL
and that works provided that I extract the correct URL. Thanks.

- Dilan
 
R

Ron Rosenfeld

===================
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = "href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count >= 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
=====================
A bit simpler, but works on your example:

===================
Option Explicit
Function MovieURL(str As String) As String
Dim myRegExp As Object, myMatches As Object
Set myRegExp = CreateObject("vbscript.regexp")
myRegExp.Pattern = "[^""]+(?=""\s*class=""prodlink"")"
If myRegExp.Test(str) = True Then
Set myMatches = myRegExp.Execute(str)
MovieURL = myMatches(0).Value
End If
End Function
=====================

By the way, both of my variants depend on there not being quotes (") within the
extracted portion.
--ron
 
A

Akrobrat

Ron,

Thanks for the info. As you suggested, I tried:
..Pattern = "href=""([^""]+)""\s*class=""prodlink"""
.... but it didn't work. I too suspected the dot might be a source of
trouble, but I've implemented it successfully on other movie sites.

So that I'm not providing false assumptions, here is my code in
context. Note that I typically ignore the case and avoid a global
search because I only want the first accurate match - which is my core
problem. How do I get the regex pattern to match the HREF tag when it
contains the CLASS="prodlink" tag? I realize the code could be neater
(and I appreciate any advice), but all I want to do right now is
develop an accurate pattern.


--------------------------------------

Private Sub getBestBuy(strItem As String, lngRow As Long, IE As
InternetExplorer)

Dim RegexURL As RegExp
Dim RegexMatch As MatchCollection
Dim strURL As String

Set RegexURL = New RegExp

With RegexURL
.MultiLine = True
.IgnoreCase = True
.Global = False
'The following pattern was recommended by Ron from Google
Groups
.Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink"""
End With

With IE
.Visible = True
'Replace spaces with plus signs in URL as per BB's approach
.Navigate "http://www.bestbuy.com/site/olspage.jsp?
_dyncharset=ISO-8859-1&id=pcat17071&type=page&st=" & Replace(strItem,
" ", "+") & "&sc=Global&cp=1&nrp=15&sp=&qp=&list=n&iht=y&usc=All
+Categories&ks=960"
End With

'Load the page completely
Do Until IE.ReadyState = READYSTATE_COMPLETE And IE.Busy = False
Loop

'Get the URL of the first listed item, which is usually the
most accurate based on the keyword search
strPageContent = IE.Document.body.innerHTML

If RegexURL.Test(strPageContent) Then
Set RegexMatch = RegexURL.Execute(strPageContent)
Range("K" & lngRow).Value = "http://www.bestbuy.com" &
RegexMatch(0).SubMatches(0)
End If

'//--------------- Rest of code continues here ---------------//

End Sub
 
R

Ron Rosenfeld

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
However, if I do get a non-null result back, it is usually:

---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W?"
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo">
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"><img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/></a>
</div>

<td class="skucontent">

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink">
Transformers - Widescreen Dubbed Subtitle AC3</a><br/>

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!

And here's another version that might work a bit better, depending on your
specific requirements. It has no problem with embedded quotes in the URL. This
uses the Replace method to get rid of everything else.

==============================
Option Explicit
Function MovieURL(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = _
"[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*"
MovieURL = re.Replace(str, "$1")
End Function
==============================
--ron
 
R

Ron Rosenfeld

Thanks for the info. As you suggested, I tried:
.Pattern = "href=""([^""]+)""\s*class=""prodlink"""
... but it didn't work. I too suspected the dot might be a source of
trouble, but I've implemented it successfully on other movie sites.
Hmmm.

What does "didn't work" mean? No match or an error message?

That pattern works fine here using the data you provided, as well as a minor
variation I included in the data.

I assume you have a reference set to "Microsoft VBScript Regular Expressions"
5.5 or else you'd be getting some error message from VBA.

In this context, setting the Multiline and Global parameters is irrelevant, but
harmless.

Perhaps if you posted the full contents of strPageContent I could make some
further suggestions.

Or perhaps you could try the Replace method I posted in a different context.
--ron
 
A

Akrobrat

Hmmm.

What does "didn't work" mean?  No match or an error message?

That pattern works fine here using the data you provided, as well as a minor
variation I included in the data.

I assume you have a reference set to "Microsoft VBScriptRegularExpressions"
5.5 or else you'd be getting some error message fromVBA.

In this context, setting the Multiline and Global parameters is irrelevant, but
harmless.

Perhaps if you posted the full contents of strPageContent I could make some
further suggestions.

Or perhaps you could try the Replace method I posted in a different context.
--ron
Ron,

Sorry, by error I meant that there wasn't a match for either of the
examples. In fact, the last Pattern suggestion stalled Excel / VBA.

I do have a reference set to MS VBScript Regex 5.5, and I don't know
if this matters much, but I use a Windows XP SP2 machine.
Instead of the posting the contents of strPageContent (very long),
here is the URL to the webpage that it contains:
http://www.bestbuy.com/site/olspage.jsp?_dyncharset=ISO-8859-1&id=pcat17071&type=page&st=Transformers+optimus&sc=Global&cp=1&nrp=15&sp=&qp=&list=n&iht=y&usc=All+Categories&ks=960

I'll be AWOL for the next 36 hours but I appreciate your help!
 
R

Ron Rosenfeld

Ron,

Sorry, by error I meant that there wasn't a match for either of the
examples. In fact, the last Pattern suggestion stalled Excel / VBA.

I do have a reference set to MS VBScript Regex 5.5, and I don't know
if this matters much, but I use a Windows XP SP2 machine.
Instead of the posting the contents of strPageContent (very long),
here is the URL to the webpage that it contains:
http://www.bestbuy.com/site/olspage.jsp?_dyncharset=ISO-8859-1&id=pcat17071&type=page&st=Transformers+optimus&sc=Global&cp=1&nrp=15&sp=&qp=&list=n&iht=y&usc=All+Categories&ks=960

I'll be AWOL for the next 36 hours but I appreciate your help!
I don't believe the problem is in your regex pattern.

I obtained the data by going to the above URL, executing the View Source
command in FireFox; copying that data, and then processing it within VBA. The
expected URL was returned.

(I had to do the extraction entirely within VBA as the page source is too long
to fit into an Excel cell.

The following worked fine, after copying the data to the clipboard and also
setting a reference to Microsoft Forms 2.0 Object Library:

=============================
Option Explicit
Sub GetClipboardData()
Dim PageSource As String
Dim myData As DataObject
Set myData = New DataObject
myData.GetFromClipboard
PageSource = myData.GetText
Debug.Print MovieURL(PageSource)
End Sub
'---------------------------------------
Function MovieURL(str As String) As String
Dim myRegExp, myMatches, ResultString
Set myRegExp = New RegExp
myRegExp.MultiLine = True
myRegExp.IgnoreCase = True
myRegExp.Global = False
myRegExp.Pattern = "<a\s*href=""([^""]+)""\s*class=""prodlink"""
Set myMatches = myRegExp.Execute(str)
If myMatches.Count >= 1 Then
MovieURL = myMatches(0).SubMatches(0)
End If
End Function
==============================

Of interest, in the particular page you sent me to, there are 11 different URL
snippets that meet your criteria of ending with classid="prodlink"

However, unlike your example, all of them include this string in the middle:

;jsessionid=BGJY2UTT13ORTKC4D3IFAGA

Also, they differ by the skuid.

I'm guessing you don't want the jsessionid information.

A slight change will exclude that from the result:

==============================================
Option Explicit
Sub GetClipboardData()
Dim PageSource As String
Dim myData As DataObject
Set myData = New DataObject
myData.GetFromClipboard

PageSource = myData.GetText

Debug.Print MovieURL(PageSource)
End Sub
'-----------------------------------
Function MovieURL(str As String) As String
Dim myRegExp As RegExp
Dim myMatches As MatchCollection

Set myRegExp = New RegExp
myRegExp.IgnoreCase = True
myRegExp.Pattern = _
"""([^""]+)(;jsessionid[^""]+)(\?[^""]+)(?=""\s*class=""prodlink"")"

If myRegExp.Test(str) = True Then
Set myMatches = myRegExp.Execute(str)
MovieURL = myMatches(0).SubMatches(0) & _
myMatches(0).SubMatches(2)
End If
End Function
====================================
--ron
 
R

Ron Rosenfeld

Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
/site/olspage.jsp?skuId=8936896&amp;st=Transformers+Widescreen&amp;type=product&amp;id=1754542
However, if I do get a non-null result back, it is usually:

---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W?"
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo">
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"><img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/></a>
</div>

<td class="skucontent">

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink">
Transformers - Widescreen Dubbed Subtitle AC3</a><br/>

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!

And here's another version that might work a bit better, depending on your
specific requirements. It has no problem with embedded quotes in the URL. This
uses the Replace method to get rid of everything else.

==============================
Option Explicit
Function MovieURL(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.Global = True
re.IgnoreCase = True
re.Pattern = _
"[\s\S]*<a\shref=""([\s\S]+)""\s*class=""prodlink""[\s\S]*"
MovieURL = re.Replace(str, "$1")
End Function
==============================
--ron
Well, given your data and noting that there are multiple matches, I probably
wouldn't use this method.
--ron
 
R

Rick Rothstein

The expected result is:
Does it have to be a Regular Expression solution? This function will return
the first prodlink site (your expected result above from the Partial Source
Data text you posted)...

Function GetURL(S As String) As String
GetURL = Split(S, "class=""prodlink""", , vbTextCompare)(0)
GetURL = Mid(GetURL, InStrRev(GetURL, "href=") + 6)
GetURL = Replace(Replace(GetURL, vbLf, ""), vbCr, "")
GetURL = Trim(Replace(GetURL, """", ""))
End Function

--
Rick (MVP - Excel)


Akrobrat said:
Greetings all,

I am trying to extract the URLs of a set of animated movies off
various sites using regular expressions and then dump those URLs into
an Excel document (via VBA). I have a decent grasp of regex but I
have hit a brick wall lately with a particular site. I have
experimented with a number of patterns but cannot yet get the correct
result.

The expected result is:
However, if I do get a non-null result back, it is usually:

---------------------- Sample Patterns Tested:
----------------------
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?prodlink\W?"
.Pattern = "\<a\s+href=""([A-Za-z0-9/;&\.\?\+-=]+)""\s+class"
.Pattern = "\<a\s+href=\W?(.*?)\W?\s?class=\W?\w\W?"



---------------------- Partial Source Data (from website):
----------------------

<div class="logo">
<a href="http://www.bestbuy.com/site/olspage.jsp?
type=category&id=cat00000" name="&lid=hdr_logo"><img src="http://
images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"
alt="Best Buy Logo"/></a>
</div>

<td class="skucontent">

<a href="/site/olspage.jsp?skuId=8936896&amp;st=Transformers
+Widescreen&amp;type=product&amp;id=1754542" class="prodlink">
Transformers - Widescreen Dubbed Subtitle AC3</a><br/>

---------------------- ---------------------- ----------------------

I'm most interested in utilizing the [class="prodlink"] string as this
is the tag that labels a movie URL. I know that regex in VBA can be a
bit tricky owing to the use of double quotes and other non-alpha
characters, but can any of you guys spot what I'm doing wrong? Thanks
for your help!
 
A

Akrobrat

Ron, my approach with Regular Expression was to capture the *first*
occurence of the HREF tag with the CLASS="prodlink" tag. However, you
might be right in that there's more going on here than expected - no
matter how much I tweak the regex, the returned value is either
incorrect or null. Your pattern suggestions will be useful for other
sites I hope to cover soon.

Rick, I don't have to use regex for this task, but I found that it was
a good solution early on in my project. Your function (with a couple
of modifications to handle the full HTML source) helped me solve the
problem.

Thank you immensely to both of you for helping with this challenge!
 
R

Ron Rosenfeld

Ron, my approach with Regular Expression was to capture the *first*
occurence of the HREF tag with the CLASS="prodlink" tag. However, you
might be right in that there's more going on here than expected - no
matter how much I tweak the regex, the returned value is either
incorrect or null. Your pattern suggestions will be useful for other
sites I hope to cover soon.

Rick, I don't have to use regex for this task, but I found that it was
a good solution early on in my project. Your function (with a couple
of modifications to handle the full HTML source) helped me solve the
problem.

Thank you immensely to both of you for helping with this challenge!
Glad you got it working. But it would be interesting to figure out why the
regex approach isn't working, especially since I was using the same data
source.
--ron
 
A

Akrobrat

Glad you got it working.  But it would be interesting to figure out whythe
regex approach isn't working, especially since I was using the same data
source.
--ron
Ron, I still don't know why the regex worked for you and not for me.
I sit behind a large corporation's firewall and use IE 6.0 that's been
slightly customized for the company's purposes. Could that have
something to do with it? Outside of that, I can't imagine what could
be wrong. Everything else is pretty standard - I use Win XP, Excel
2003 and the Microsoft VBScript Regular Expressions 5.5 library.

Thanks again.
 
R

Ron Rosenfeld

Ron, I still don't know why the regex worked for you and not for me.
I sit behind a large corporation's firewall and use IE 6.0 that's been
slightly customized for the company's purposes. Could that have
something to do with it? Outside of that, I can't imagine what could
be wrong. Everything else is pretty standard - I use Win XP, Excel
2003 and the Microsoft VBScript Regular Expressions 5.5 library.

Thanks again.
I suppose there could be something about the data, but don't know what it is.

I tried again, using mostly the code you had previously posted, with some
variation to use data that had been copied to the clipboard, instead of
obtained via IE, and also to print the result to the immediate window, instead
of the worksheet, and it continues to work OK here.
--ron
 
A

Akrobrat

Ron (and all),

Here's something else I found this morning that might be of interest:
In the following string from an HTML doc:

<p id="height10" class="price-strike2">$2,599.00/EA Each</p>

.... I am unable to match it (aiming for the currency bit) using the
following reduced regular expressions within VBA:

..Pattern = "price-strike2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "price-strike2\W?>(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "strike2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "2\W?>(\$[0-9]*,?[0-9]*\.?[0-9]*)"
..Pattern = "2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"


Any Regexp tester tells me that all of the above would match the
string. Also, there is only one instance of a tag including the
phrase "strike2" in the entire HTML source, so there can be, at most,
only one match. The currency regexp is fine (I know this from
previous tests) and the pattern preceding it seems to be the root of
the problem. It is not essential that I get a solution to this, but
you can imagine that this is frustrating! =)

Thanks,
Dilan
 
R

Ron Rosenfeld

Ron (and all),

Here's something else I found this morning that might be of interest:
In the following string from an HTML doc:

<p id="height10" class="price-strike2">$2,599.00/EA Each</p>

... I am unable to match it (aiming for the currency bit) using the
following reduced regular expressions within VBA:

.Pattern = "price-strike2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "price-strike2\W?>(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "strike2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "2\W?>(\$[0-9]*,?[0-9]*\.?[0-9]*)"
.Pattern = "2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"


Any Regexp tester tells me that all of the above would match the
string. Also, there is only one instance of a tag including the
phrase "strike2" in the entire HTML source, so there can be, at most,
only one match. The currency regexp is fine (I know this from
previous tests) and the pattern preceding it seems to be the root of
the problem. It is not essential that I get a solution to this, but
you can imagine that this is frustrating! =)

Thanks,
Dilan
Dilan,

I wonder if you are running into the same problem you did with the previous
Regex you could not get to work. (What I suspect is a problem in your VBA
routines).

Your first regex works fine here, returning the "currency bit" into the first
submatch.

Here is the routine I used with your regex and sample:

============================
Option Explicit
Function GetStrike2(sStr As String)
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Pattern = """price-strike2"">(\$[0-9]*,?[0-9]*\.?[0-9]*)"
If re.test(sStr) = True Then
Set mc = re.Execute(sStr)
GetStrike2 = mc(0).submatches(0)
End If
End Function
================================

Against your sample, this returns $2,599.00
--ron
 
Ad

Advertisements

A

Akrobrat

So do you think there's a difference between:
Dim re as Object
Set re = CreateObject("vbscript.regexp")
.... and ...
Dim re as RegExp
Set re = New RegExp
?

I'll test it but I wondered if you have some insight into this. The
rest of my code (the actual regexp pattern test) is the same as yours.

Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top