Hi Jay,
You are clearly one of the folks that I would describe as "more expert than
I in RegEx."
I agree, I break down regular expressions, while I am developing them.
However: Once I am comfortable that they work, I then combine them, to
"simplify" the supporting code.
Code simplicity is an interesting term. Not sure I agree that combining two
or three (or ten) expressions creates simplicity. The code is certainly
shorter. However, I have no desire to make things simple for the compiler
or the runtime. I want to make things simple for myself and the developer
who will follow me, and have to maintain my code.
Is it? My concern is the manual looping you are adding unnecessary
complexity to the code, hence my question.
In my opinion, a loop is a fairly common construct, and therefore the
complexity of adding a loop is small compared to the complexity of making
the RegEx more difficult for a non-expert to read.
Plus you might be adding possible
performance problems (evaluating multiple RegEx as opposed to a single
complex one).
If RegEx was being used in an inner loop, in a situation where we were
processor bound, I would agree. I haven't run across that situation. I
suppose my answer would become more cautious if I had. That said, RegEx is
pretty efficient.
Either method may be causing increased GC pressure.
Sorry to be thick, but I don't understand why. If I were doing a series of
RegEx matches in a loop, I would create the expressions outside the loop and
simply use them in the loop. A match is as good as a mile. Technically,
that should create the same number of matches.
Also, once again, most of the apps that I've done parsing in aren't tuned
for Garbage Collection. It is nearly always easier to find opportunities to
reduce GC pressure simply by applying StringBuilder where it is useful (the
"80-20" rule).
Also my concern (with both methods) is precedence, which is a problem I my
expression has with 2 & 4 digit years (it actually allows a 3 digit year).
Manually looping over individual expressions may cause an different
expression to be matched then a properly constructed group with alternation
(I am not inferring my expression is properly constructed!).
I completely agree. This is one place where I feel that a loop is better.
You can add in extra logic by structuring the code so that you match your
string against a couple of different patterns, and then YOU can apply a
complex rule to decide which to use... with the RegEx language, you don't
have the right to control precedence in as detailed a way as you can with
logical constructs and business rules.
Also in this instance I would consider something like:
Const pattern1 As String = "a"
Const pattern2 As String = "b"
Const pattern3 As String = "c"
Const pattern As String = pattern1 & "|" & pattern2 & "|" & pattern3
Which easily allows you to define & maintain the patterns separately, then
gain the "simplicity" of combining the RegEx call...
An excellent idea. One thing to consider, though. Each of the patterns
above would need to be tested individually, and the combined pattern would
need to be tested as well. If you do one, and not the other, it is possible
for a small syntax error in two patterns to balance eachother out, allowing
the final construct to be legal, valid, and wrong.
This adds to the testing burden a bit. Not much, perhaps, but still a bit.
The unit tests that you describe should still cover it, as long as they look
for boundary conditions effectively.
Note: | is the alternation operator not the Or operator... As Or implies
combining (when applied to numbers & boolean), where | does not combine it
provides alternatives!
I stand corrected.
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik
Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--