Regular expressions for replacements in Excel?

S

Shannon Jacobs

Doesn't seem like many people are using the regular expression capability in
Excel. Some reason? Apart from the fact that it's kind of hard to figure
out? Actually, I'm trying to port some JavaScript regular expressions into
Excel... I think I've enabled the required library linkage, but I can't seem
to figure out the syntax required to refer to all of the lines in the
spreadsheet (which is really just a list of text strings). Don't seem to be
any good examples anywhere I can find them on the Web, and the Microsoft
documentation has been typically unenlightening.

Are the examples and useful instructions more liable to be found on the Word
side? Some of the later manipulations would be better with the spreadsheet's
capabilities, however.
 
J

Jezebel

There's a tutorial on the MVPS site. Word's regular expressions are pretty
well the same as any other, except that it uses minimal matching where unix
regex uses maximal.
 
S

Shannon Jacobs

Frank said:
Hi
post your current code which does not work
<old context snip>

Well, I'd be willing to consider pasting in the code, but right now I can't
do it. I can't even tell for sure whether or not it is working. If it is
working, then Microsoft's implementation is incredibly bad and slow, which
would also explain why it seems that very few people are using the feature.
The current status is that I can't even get Excel's attention to ask it to
look at the code, but it's using 95% of the machine cycles and seems happy,
so I'm waiting to see what results come out, if any.

From memory I can say that I just defined a simple regexp, basically two
options before a short trailer string in the Pattern. Then I tried to invoke
a Replace on the Cells (just over 2000 in a single column), which did
nothing yesterday. Today I created an outer loop and tried to pass each cell
in individually--but I'm still waiting for it to finish or crash or do
something. About 10 minutes now, and still no results, and CPU still around
95%.

In yesterday's experiments I just zapped Excel at this point, and upon
restarting it would attempt to recover the file, but whether I used the
original saved file or the recovered version, in either case after that
Excel would not allow me to use any macros. The error message complained
about the security settings, but no matter what I did to the security
settings I could not get back into the file. Had to zot the file completely
each time and start from scratch.

Like they say: Microsoft's way or the highway.
 
J

Jezebel

If you've been waiting that long, then clearly something has gone wrong.
Kill it in the task manager, restart, dig out your code. It might indeed be
Microsoft's fault, but probably not. The Find/Replace machinery has been
around for quite a while now and is generally pretty stable.
 
S

Shannon Jacobs

Jezebel said:
If you've been waiting that long, then clearly something has gone
wrong. Kill it in the task manager, restart, dig out your code. It
might indeed be Microsoft's fault, but probably not. The Find/Replace
machinery has been around for quite a while now and is generally
pretty stable.
<old stuff snip>

Okay, here's the current state of the code. It does the right thing, but it
runs like a really tired snail. Around 15 seconds for 2,000 lines. I'm using
the 5.5 version of the RegExp library which has the MultiLine parameter
because my goal is to chain a series of these together and gradually reduce
the file to a radically different state. As it stands now, I wound up
splitting it out on a one-line-at-a-time basis to get a look at what is
going on inside there, but of course this is a horrible way to do it,
recreating the regexp on each pass through the loop. What I really want to
do is just have two lines for each pass, one to reassign the Pattern, and
one Replace. However, whenever I try to handle things on a larger
"object-oriented" basis, I seem to wind up in some sort of infinite loop.

It's really hard for me to understand the minds behind a debugger that
apparently can't be interrupted short of zotting the entire process. The way
the parameters need to be passed is also driving me nuts...

It would be greatly appreciated if some wise guru of VB would comment on the
best way to restructure it. I tried moving the regexp preparation into the
top procedure with the hope that I could access it as a global variable, but
that didn't work... My next idea is to move the loop into the second
procedure, so I can at least avoid all those reallocations of the regexp.

Sub clean()
Dim I As Long
Dim lastI As Long
Dim S As String

ActiveCell.SpecialCells(xlLastCell).Select
lastI = ActiveCell.Row
For I = 65 To lastI
S = Cells(I, 1).Value
Call cleanALine(S)
Cells(I, 1).Value = S
Next
End Sub

Sub cleanALine(theLine)
Set aRegExp = CreateObject("vbscript.regexp")
With aRegExp
.Pattern = "(height|width)=""12"" "
.Global = True
.IgnoreCase = True
' .MultiLine = True
End With
theLine = aRegExp.Replace(theLine, "")
End Sub
 
J

Jezebel

You could try --

Sub clean()
Dim aRegExp as object
Dim I As Long

Set aRegExp = CreateObject("vbscript.regexp")
With aRegExp
.Pattern = "(height|width)=""12"" "
.Global = True
.IgnoreCase = True
' .MultiLine = True
End With

For I = 65 To ActiveSheet.UsedRange.Rows.Count
Cells(I, 1) = aRegExp.Replace(Cells(I, 1), "")
Next

End Sub



Are there more search strings you're dealing with? Otherwise, it would be
heaps quicker to use:

with Range("A65:A" & ActiveSheet.UsedRange.Rows.Count)
.Replace What:="Height=""12""", Replacement:="", LookAt:=xlPart,
SearchOrder:=xlByRows, MatchCase:=False
.Replace What:="Width=""12""", Replacement:="", LookAt:=xlPart,
SearchOrder:=xlByRows, MatchCase:=False
end with
 
S

Shannon Jacobs

Jezebel said:
You could try --

Sub clean()
Dim aRegExp as object
Dim I As Long

Set aRegExp = CreateObject("vbscript.regexp")
With aRegExp
.Pattern = "(height|width)=""12"" "
.Global = True
.IgnoreCase = True
' .MultiLine = True
End With

For I = 65 To ActiveSheet.UsedRange.Rows.Count
Cells(I, 1) = aRegExp.Replace(Cells(I, 1), "")
Next

End Sub

Are there more search strings you're dealing with? Otherwise, it
would be heaps quicker to use:

with Range("A65:A" & ActiveSheet.UsedRange.Rows.Count)
.Replace What:="Height=""12""", Replacement:="",
LookAt:=xlPart, SearchOrder:=xlByRows, MatchCase:=False
.Replace What:="Width=""12""", Replacement:="",
LookAt:=xlPart, SearchOrder:=xlByRows, MatchCase:=False
end with

This sounds very much like what I've been trying to figure out, though I
won't be able to pursue it farther until later, probably Friday at earliest.
However, right now I'm struggling with the following problem. I want to drop
the last part of the image tags, and all of the first part including the
directory information. All I want to have left is the stem name of the file.
The first triplet works properly, but the second one is overmatching
somehow, and the stem name is also dropped. Even worse, in this latest form
I tried to use the parenthetic grouping for a clause and then the \1 to
return the stem, but now it just sticks a "\1" into the file.

' remove ends of image tags
aRegExp.Pattern = "\.gif[\w+ ?= ?\w+]*>"
theLine = aRegExp.Replace(theLine, "")

' remove fronts of image tags, but save the last word
aRegExp.Pattern = "<img src ?= ?\./[\w+/]*(\w+)"
theLine = aRegExp.Replace(theLine, "\1")

I haven't even gotten to the really nasty part of the file transformation
yet, though so far I'm getting a moderate feel for how they work here...

Not sure if it will help, but my current primary reference is the following
URL, and I'm working with an English version of Excel XP right now.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnclinic/html/scripting051099.asp
 
J

Jezebel

If you are dealing with just one column of the spreadsheet, it might be
easier to copy it into a Word document, make the changes there, and copy it
back. Word's Find and Replace has the regular expression functions built in
(check the 'Use Wildcards' checkbox), so you can see what's going on and you
don't need to screw around with coding at all.



Shannon Jacobs said:
Jezebel said:
You could try --

Sub clean()
Dim aRegExp as object
Dim I As Long

Set aRegExp = CreateObject("vbscript.regexp")
With aRegExp
.Pattern = "(height|width)=""12"" "
.Global = True
.IgnoreCase = True
' .MultiLine = True
End With

For I = 65 To ActiveSheet.UsedRange.Rows.Count
Cells(I, 1) = aRegExp.Replace(Cells(I, 1), "")
Next

End Sub

Are there more search strings you're dealing with? Otherwise, it
would be heaps quicker to use:

with Range("A65:A" & ActiveSheet.UsedRange.Rows.Count)
.Replace What:="Height=""12""", Replacement:="",
LookAt:=xlPart, SearchOrder:=xlByRows, MatchCase:=False
.Replace What:="Width=""12""", Replacement:="",
LookAt:=xlPart, SearchOrder:=xlByRows, MatchCase:=False
end with

This sounds very much like what I've been trying to figure out, though I
won't be able to pursue it farther until later, probably Friday at earliest.
However, right now I'm struggling with the following problem. I want to drop
the last part of the image tags, and all of the first part including the
directory information. All I want to have left is the stem name of the file.
The first triplet works properly, but the second one is overmatching
somehow, and the stem name is also dropped. Even worse, in this latest form
I tried to use the parenthetic grouping for a clause and then the \1 to
return the stem, but now it just sticks a "\1" into the file.

' remove ends of image tags
aRegExp.Pattern = "\.gif[\w+ ?= ?\w+]*>"
theLine = aRegExp.Replace(theLine, "")

' remove fronts of image tags, but save the last word
aRegExp.Pattern = "<img src ?= ?\./[\w+/]*(\w+)"
theLine = aRegExp.Replace(theLine, "\1")

I haven't even gotten to the really nasty part of the file transformation
yet, though so far I'm getting a moderate feel for how they work here...

Not sure if it will help, but my current primary reference is the following
URL, and I'm working with an English version of Excel XP right now.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnclinic/html/scripting051099.asp
 
S

Shannon Jacobs

Jezebel said:
If you are dealing with just one column of the spreadsheet, it might
be easier to copy it into a Word document, make the changes there,
and copy it back. Word's Find and Replace has the regular expression
functions built in (check the 'Use Wildcards' checkbox), so you can
see what's going on and you don't need to screw around with coding at
all.

The Word wildcards seem to be much more limited than real regular
expressions, so I had already eliminated that option. I haven't listed all
of my transformations here, but a number of them are somewhat trickier, so
that won't work. Your suggestion with the loop simplification is good, but I
don't think the Call itself is really causing much of the overhead, but you
are right that I should move it to one level. The other idea about using the
range.replace doesn't seem to apply here.

(In addition, I admit that part of my motivation is to learn about macro
programming specifically within Excel, since my coworkers already use it a
lot (though none of them are into regular expressions).)

For now, I'll just note the mistake and fix for the problem I commented on
last time:

This is the incorrect code:
' remove fronts of image tags, but save the last word
aRegExp.Pattern = "<img src ?= ?\./[\w+/]*(\w+)"
theLine = aRegExp.Replace(theLine, "\1")

This is correct:
Typically dumb typo with typical fatal results in a RegExp. Within a
bracketed class of course the \w picked up the rest of the string, whether
or not it ended in the slash that I intended to require. By ending the match
after the last slash, I didn't have to worry about explicitly preserving the
last word.

I still don't understand why the \1 didn't work as expected, but just for
grins, I tried ye olde $1, which produced really weird results... This may
be a future problem, since I think that one of the later transformations is
not just a simple deletion, but will require submatch reference in the
replacement... Right now I'm just massaging the line breaks, which is kind
of a nuisance since I still can't get the MultiLine parameter to work as I
had hoped. If that would only work, I should be able to slurp the next major
transformation in one step or two...
 
J

Jezebel

Check some of the web tutorials (eg
http://word.mvps.org/FAQs/General/UsingWildcards.htm) on Word's Find and
Replace -- it really does support full regular expressions.

BTW, why do your posts use JIS (Japanese) encoding?




Shannon Jacobs said:
Jezebel said:
If you are dealing with just one column of the spreadsheet, it might
be easier to copy it into a Word document, make the changes there,
and copy it back. Word's Find and Replace has the regular expression
functions built in (check the 'Use Wildcards' checkbox), so you can
see what's going on and you don't need to screw around with coding at
all.

The Word wildcards seem to be much more limited than real regular
expressions, so I had already eliminated that option. I haven't listed all
of my transformations here, but a number of them are somewhat trickier, so
that won't work. Your suggestion with the loop simplification is good, but I
don't think the Call itself is really causing much of the overhead, but you
are right that I should move it to one level. The other idea about using the
range.replace doesn't seem to apply here.

(In addition, I admit that part of my motivation is to learn about macro
programming specifically within Excel, since my coworkers already use it a
lot (though none of them are into regular expressions).)

For now, I'll just note the mistake and fix for the problem I commented on
last time:

This is the incorrect code:
' remove fronts of image tags, but save the last word
aRegExp.Pattern = "<img src ?= ?\./[\w+/]*(\w+)"
theLine = aRegExp.Replace(theLine, "\1")

This is correct:
Typically dumb typo with typical fatal results in a RegExp. Within a
bracketed class of course the \w picked up the rest of the string, whether
or not it ended in the slash that I intended to require. By ending the match
after the last slash, I didn't have to worry about explicitly preserving the
last word.

I still don't understand why the \1 didn't work as expected, but just for
grins, I tried ye olde $1, which produced really weird results... This may
be a future problem, since I think that one of the later transformations is
not just a simple deletion, but will require submatch reference in the
replacement... Right now I'm just massaging the line breaks, which is kind
of a nuisance since I still can't get the MultiLine parameter to work as I
had hoped. If that would only work, I should be able to slurp the next major
transformation in one step or two...
 
S

Shannon Jacobs

As regards the Word regular expressions, I'm afraid the answer is "No, they
are not 'full' regular expressions." For example, my very first simple
transformation (expressed as "(height|width) ?=? ?""?\d+%?""? ?" using VB)
cannot be expressed in that system. At least I could find no way to do so,
and the URL you provided also shows no way to do the simple string
alternative. Here is another important bit of evidence from the URL you
provided:

"If you are using VBA, you might want to look at the RegExp object (from
VBScript, which you can include in your macro projects, and which offers
some features not included in Word wildcards)."

I'm never absolutely sure in dealing with Microsoft, but at this point I'm
pretty well convinced that the situation is that there are four "levels" of
search and replace available within Word: (1) literal, (2) very basic
wildcards with a few special notations, (3) extended wildcards with some
regexp notation, and (4) full regular expressions (which can only be
accessed via VBScript). (Actually, there also appear to be two levels of the
"full" regular expressions.) The advantage of being Microsoft is that you
can announce that all of these are now de facto standards.

Within Excel, I haven't found access to levels (2) or (3), but that's moot,
since they can't handle some of the transformations I'm already doing from
level (4). (Or is that really level (4)b?)

If I can just figure out some way to make the MultiLine option work, I think
I'll basically be done with the project. I feel like I need some minor trick
on the "Cells" object, but so far I haven't been able to find it. Hmm...
Have you ever made a *really* long string by just appending all of the
fields together?

And sorry about the Japanese encoding. Sometimes OE respects my "standing
request" to use a Western encoding, and sometimes it does not. It's pretty
annoying to have to check all the time trying to catch it when it slips
back. I have not been able to find the pattern of whatever is controlling
the encoding (apart from OE trying to use the original encoding when making
a reply, but that's actually a reasonable behavior and works properly (most
of the time)).
Check some of the web tutorials (eg
http://word.mvps.org/FAQs/General/UsingWildcards.htm) on Word's Find
and Replace -- it really does support full regular expressions.

BTW, why do your posts use JIS (Japanese) encoding?
<older stuff snipped>
 
S

Shannon Jacobs

Well, below is the current version of the code, but unless someone is
willing to help and can clarify how to make the MultiLine thing work, I'm
probably going to let it stand. This is about 5 times faster than the
original version, but basically it's just inlining the subroutine call and
reducing the overhead for object creation. This will process the 2000 lines
in around 4 seconds.

By using a "Dim theText() as String" statement the system seemed to accept
the MultiLine assignment, but in reality only the first string from Cells
was copied into my array, and all of my experiments failed. I wasn't able to
find any directly applicable source code examples, and I couldn't twist any
of the examples I did find into working as expected... My current conclusion
is that there are very few people using the MultiLine capability, and none
of them are monitoring any of these newsgroups.

The other approach of using large strings seems implausible since there's a
32 KB size limit on the strings, and the total file is something over 200
KB. However, I may still be able to pursue that approach if I can devise
some criteria for lumping things into reasonably small packages...

Basically you can take the following as an example of how to use true
regular expressions within Excel via a VB macro:

Sub FastClean()

Dim startTime
startTime = Now

Dim I As Long
Dim lastI As Long
Dim S As String

Set aRegExp = CreateObject("vbscript.regexp")
With aRegExp
.Global = True
.IgnoreCase = True
' .MultiLine = True
End With

ActiveCell.SpecialCells(xlLastCell).Select
lastI = ActiveCell.Row

For I = 1 To lastI
S = Cells(I, 1)

'kill first three parameters
aRegExp.Pattern = "(height|width|border)=""?\d+%?""? ?"
S = aRegExp.Replace(S, "")

' Eight more patterns and replacements

Cells(I, 1) = S
Next

MsgBox "Finished in " + Format(Now - startTime, "ss.s") + " seconds."
End Sub
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top