function to match parenthesis '(' and ')'

  • Thread starter Thread starter Steve Kirsch
  • Start date Start date
I certainly appreciate that regexes are useful. I just don't think this
is a place where they're suitable. I would guess that at least half of
the requests for regular expressions I see on the newsgroups are
inappropriate places to apply regular expressions. A lot of people seem
to approach any textual problem with the immediate idea of "use a
regular expression!" without considering whether or not it's the
best/simplest way of solving it.

Yes, I agree with you. that is a good summary of Regex usage today.
I will however add, for this topic (validating and parsing expressions with
parenthesis), that a known regex does all the job requested.
Of course you can write code, write your own libraries that should be more
elegant (it depends on what you think is elegant). I prefer reuse what is
already done and only optimise it if necessary.
 
Ludovic said:
First of all, I would like to answer to Michael's reply. No, I'm not evil
enough to write such a RegEx. I used the Regex from the Ryan Byington's one
http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
that is very well explained.

<snip>

Sorry to reply twice to the same post, but I just looked at the blog
above.

There are two points which are interesting to note:

1) This is .NET-specific - as I stated (with some trepidation!) before,
there is no way to do this in "general" regular expressions. Therefore
anyone reading it and trying to understand it not only needs to know
regular expressions, but the .NET variety of them.

2) One of the comments is "My brain just exploded." That comment was
added by someone who is very familiar with regular expressions, given
the blog he links to.

I would be willing to bet that if you gave both versions to 1000
"average" C# coders, far, far more would be able to understand the code
from Marcus than the regular expression. Why include effectively "black
box" code when you can so easily avoid it?

Jon
 
Thanks for all your remarks, Jon.

I would however add some arguments :
Few years ago, I was not used to regular expressions. I would have written
my own classe to do balancing, and I would probably have done a code close
to Marcus' one. Then I would have written all the code necessary to store
subelements, and so on.
Years after years, I got used to regex. I also had to write an expression
validator. First, I thought it was impossible with regex and it was not the
right tool. I was right in your point of view. So I wrote everything. It did
not took me a lot, the source is clear, commented but all the parsing class
is important. I however wanted to know if it was possible with regex. I
tried to understand balancing in regex. Many days later, I found that a
regex was doing everything I wanted in one line.
Then I had to write another expression parser and I used this regex because
I did not need efficiency. With one line I had everything.
I also could have copied my old file where I had already developped a
similar parser and use it nearly the same.
I would have had a "black box" too. I don't need to look in my source
because I know how I did it and that it works, the same for the regex.
In my mind, the regex line only means "ParenthesisBalanceClasss".
If I want efficiency, I can replace this regex line by a call to my class I
wrote few years ago. It would be exactly the same, I don't see the
difference....except that with the class version is quicker but the code is
longer.
What is the best ? I think it depends on what you really want to do.
What is the more readable ? It depends on if you have ever seen this regex.
If yes, both are the same. If not, of course Marcus' code is the one to use.
So you will say that my code is not maintenable ? I disagree. Firstly, there
is no need to modify it (it's like modifying indexOf of string class). A
black box ? It's the same when you use a DLL or use any of the function in
C#.

I think I can't change your mind. I also understand your point of view,
which is logical.
I would end saying that Regex is not the better for this balancing, but
someone did an interesting thing for it. I prefer using it than copying a
long class doing exactly the same.
It's my choice and a choice of many developers that do not want to rewrite
what is already simply done.
 
: Hmm... I seem to remember something in CS classes saying that this is
: something that regular expressions actually can't do. You can develop
: them that will match down to a certain level of nesting (i.e. for any
: given number of levels of nesting, you can build a regular expression
: which will match it), but nothing to match *any* level of nesting. I
: could easily be wrong though...

Yes and no.

If you mean regular expressions in the same sense as theoretical
computer scientists use it, you're correct. Keep in mind, however,
that the toolbox is pretty bare: literals, concatenation, grouping,
alternation, and Kleene star.

The .NET framework's "regular expressions" are more powerful in that
they don't limit reconizers to regular languages (i.e., languages
that finite-state automata can recognize).

For example, by application of the pumping lemma, we can show that
the language of strings where each character is either an 'a' or a 'b'
and where each string consists of two copies of a substring appended
together is not regular. (Notation makes this clearer: L = { ww :
w \in {a, b}* }.)

Backreferences make the construction trivial:

Regex r = new Regex(@"^([ab]*)\1$");

In *Mastering Regular Expressions*, Friedl gives an VB.NET example for
matching balanced parentheses, which I've paraphrased and translated to
C# below:

string pattern =
@"^ \(
(?>
\( (?<DEPTH>) |
\) (?<-DEPTH>) )*
(?(DEPTH)(?!))
\) $";

Regex balanced = new Regex(pattern,
RegexOptions.IgnorePatternWhitespace);

See page 430 of http://www.oreilly.com/catalog/regex2/chapter/ch09.pdf

Hope this helps,
Greg
 
: [...]
: With the explicit code, it's easy to modify, easy to verify by
: inspection, easy to understand, and fast. The regular expression is
: none of these. It's just the wrong tool for the job, IMO.

I mean, really. What's wrong with MOV and SCASB?

Greg
 
So you will say that my code is not maintenable ? I disagree. Firstly, there
is no need to modify it (it's like modifying indexOf of string class). A
black box ? It's the same when you use a DLL or use any of the function in
C#.

It is true that THIS Regex will never need to be changed because this regex solves THIS problem.
But,
What if the problem changes???
Jon already brought up <,>.
What about {,}and [,]?
Or...Let's go crazy, what about /*,*/

These are all common Balanced pairs.
What if the problem domain demands that we handle them too? (or some arbitrary subset)

Granted, you may find another Off-the-shelf Regex to handle the problem, and you may not.
More importantly, The guy who inherited your code will need to find it.
So the original Regex may not need maintenance, but it may diverge form the problem set requiring a
new solution.

So, how likely is it that the Problem domain will change over time?
That depend on the problem.
If the problem will not likely change over time Perhaps the Regex will be fine.
If obvious changes to the problem can be handled easily by Regex as well then Perhaps the Regex will
be fine.
If it is likely that the problem will morph beyond a "simple" regex then perhaps it is NOT the way
to go.

I brought up the question of embedded parens in quotes as follows:
Console.WriteLine("(");
Should this match or not:? Depends on the problem domain
What about this:
(("Hello")("<G>")(":^)"))

That smiley in the third set of parens COULD ruin your day.


So the question is:
Can your solution easily handle small (obvious) changes to the problem set?
If so, go for it.
If not, then you are doing no favors to the guy who needs to fix the code.

I am not passing judgment of the use of the Regex.
I find it difficult to understand, but, if it works...what the heck.
When I solve a problem I always play devil's advocate with my solution.
Have I built a solution that is robust enough to handle obvious (and not so obvious) changes?

Bill
 
Agreed on all points. Jon!

--
Kevin Spencer
Microsoft MVP
..Net Developer
A watched clock never boils.
 
Ludovic SOEUR said:
I did not think that my post would start a debate. So as it has started, I
have to reply.

Well, you can't expect a nice troll like me not to participate in one. =)
While I'm a joker and probably will drive you nuts below and in future
posts;
I do think you have created a nice debate and have some serious points.
I put this sample to prove that it's possible with regex and I would like
to
do a time comparison too. I still stay on my point of view when I say I
prefer RegEx version if you do not need to do this check 100 000 times.

What works and what we find desirable are seldom the same.
We all know you CAN do it, but do we really WANT it. What do we find USEFUL?
First of all, I would like to answer to Michael's reply. No, I'm not evil
enough to write such a RegEx. I used the Regex from the Ryan Byington's
one
http://blogs.msdn.com/bclteam/archive/2005/03/15/396452.aspx
that is very well explained. The only improvment was to remove all 'names'
and to capture ONLY valid strings (the ones Steve is looking for). I
ALWAYS
explain my regular expressions and this one is a very usefull one that is
known by most of who works with regex.

And I was being funny/ironic.
If you do code such a complex regexp, I would expect a full description of
what it does.
The thing is that the commentary would probably take up a lot of space and
be forced to use complex language.

I would bet that for the comment to be useful; it would take up more space
than if you coded it 'by hand' as in the example in the previous post.
Hence, not only do you got a cryptic regexp, you also have a cryptic comment
that actually is poetry, for what code could better describe, taking up same
amount of .cs-space.
Maintenance for this case is useless. It's like providing a function for
the
System namespace. You dont have to do maintenance for "substring" method
of
string class. When a regular expression is known to be correct and used by
many developpers, you don't have to change anything. Only put a link to
the
original explanation and add comment in your code if you want to be more
precise. On the other hand, Marcus code must be commented, and i would
say,
more than the regex one...it should be surprising for most people who are
not used to balancing in regex.

Amagad. I don't know where to begin...

First things first:
This is a typical case for maintanance. Wanna bet that the OP (his/hers
customer really) comes back and says: - I need to to skip '(' and ')' that
are inside strings a la " and also support braces and curly braces.

Code tend to change...

Secondly:
This is also a typical example for when "developers" don't care for
"maintainers".
As an architect and technical project manager, a part of my job is to
'approve code'.
Often I read 'smart code' like in your example and simply won't accept it.
Too complex.

While being a joking lurking troll, with some (dubious) knowledge, in this
newsgroup; I can be a pretty serious dude at work. When I review code, there
is just one criteria: - Is this maintainable? If so, quality and
performance; are aspects I'd consider. But the first question takes care of
several sub-questions. Is it understandable? Is it readable? Does it need
commenting? Is it fault tolerant? Are all bounds, constraints and exits
handled? Is that visible? Do I like what I read?

Your regexp would never make it to the test-server if I eyeballed it. =)
Second, to answer to Jon, it took me 2 minutes to understand the original
regex from Ryan Byington. Firstly because it's well commented and also
because I am used to writting balanced regex.

<-snip->
Well, YOU are perhaps. The poor dude or gal that needs to modify your
regexp, 3 years after you wrote, it might not be as clever as you.... Or
even find your code as 'fun' as you thought it was when you wrote it. I've
seen it before and I'll probably see it again in the future: People with a
lot of free brainpower doing smart stuff to please themselves and not the
project, system and customer they work for.

Are you even sure that regexp will be still in use for the next 3 years? 5?
10? 20?
I'm not so sure. But what I do know is that keywords like while, for, if,
else and switch are here to stay.
I also believe that humans can and like to follow a sequence, and dislikes
trying to figure out how someone elses regexp works. Call me silly...
I totally disagree when you say that regular expression is not the job for
this work. With one line that you do not mind if it's understable, you can
do everything you need with parenthesis balancing. It's the right tool if
you not need to do a loop over thousands of checks, you don't have to
write
your own classes or method, everything is already done. It's like using a
dll, libray, in four words, reusing what was done.

I totally agree that this is a wonderful example of when not to use regexp.

Just to prove my point:
I for sure didn't even try to understand your regexp. Too complex...
Why don't you write a GOOD comment to your regexp and we'll see if thh rest
of us can understand that.

I dare you.. =)
Ludovic SOEUR.

High Regards
- Michael S
 
Ludovic SOEUR said:
Yes, I agree with you. that is a good summary of Regex usage today.
I will however add, for this topic (validating and parsing expressions with
parenthesis), that a known regex does all the job requested.
Of course you can write code, write your own libraries that should be more
elegant (it depends on what you think is elegant). I prefer reuse what is
already done and only optimise it if necessary.

You had to go to a web page to find that code. I'm sure you *could*
have found something like Marcus's code on a web page. Alternatively,
the code that Marcus wrote was easy to write in the first place - no
need to hunt it down.

Given the option of "look for something that someone else has done,
which I (and anyone reading the code) will need to put significant
effort into understanding" and "write something from scratch in about
10 lines of easily understandable code" I'll pick the second one every
time.
 
Michael S said:
Thanks Bill for living on the same planet as I.

Good post!
That's funny.
I just read your post with the same thought. {of course Bill was replaced with Michael in my thought
:^) }

Bill
 
What is the best ? I think it depends on what you really want to do.
What is the more readable ? It depends on if you have ever seen this regex.
If yes, both are the same. If not, of course Marcus' code is the one to use.
So you will say that my code is not maintenable ? I disagree. Firstly, there
is no need to modify it (it's like modifying indexOf of string class). A
black box ? It's the same when you use a DLL or use any of the function in
C#.

Let's take a step back for a second.

I think we can both agree that it's worth encapsulating the logic
(however it's implemented) in a method - or possibly even in its own
class, if it needs to be shared between different projects.

Let's look at the different "stakeholders" in the code, and which
version they'd prefer:

1) The "client" of the method - whatever calls it: they're unlikely to
care much, so long as it works. The regex will be a bit slower, but I
think it's unlikely to be a problem in most situations.

2) The original coder: you looked up a regular expression, and given
that you found it on a blog, I suspect that took you a little while of
searching (even if only a couple of minutes). You then had to modify it
to match parentheses rather than square brackets. I believe many coders
could easily have written the code Marcus presented (or very similar
code) in about the same amount of time - and been more confident of it
working. (There are plenty of regular expressions out there which
*don't* work, for instance for email address validation.)

3) The code reviewer: I would always want to understand any code I
approved for check-in. That would mean I'd have to go to that blog,
read through the whole entry, then look back and check that the change
from square brackets to parentheses had been done correctly. Or, I
could look over about 10 lines of very simple C#. For me, the regex
loses heavily here.

4) The maintenance engineer trying to debug a problem: The first rule
of debugging is not to assume that things are actually working - so the
maintenance engineer working through the code would probably have to
understand the regular expression too. Another trip to the blog...
Also, if a typo *had* been made somewhere when entering the regular
expression, it's a lot easier to step through the simple code than it
is to step through the regular expression. Yes, you could pull up a
regular expression workbench, but then it's got to accept the same
kinds of regular expressions as .NET, etc.

5) The maintenance engineer trying to change the behaviour: Again,
you'd have to understand the code first. Again, I believe that would
take longer with the regular expression version, even assuming a
reasonable understanding of regular expressions (and with someone who
is pretty rusty on regular expressions, it could take a *lot* longer).
Then you'd have to decide how to go about making the change - fiddle
with the existing regular expression (which involves making sure you
really, really understand it very thoroughly) or trying to find a new
one on the web? Alternatively, modify the simple C#... again, I believe
the regex loses heavily.

I can't see any stakeholder who benefits from using the regular
expression version, unless any of them are *really, really* familiar
with regular expressions (uses them day-in/day-out) and *unfamiliar*
with C#. I think that's an unlikely scenario.
 
Hehe.

We should set up a virtual company and hire Ludovic with a nice salary and
have him maintain code for a while. With that experience he'll come running
back late, in like a week, crawling with bugs Then we can fire him without
even pay him for doing nothing. =)

Happy Maintaining
- Michael S

ps.
Ludovic: I hope you get that I'm just being funny on your behalf. I hope
you'll forgive me.
 
Thanks all for all your comments about this debate. That 's very interesting
to see all these points of view.

I agree on all everybody have said since the beginning except for 2 things :

1) You say that in ten lines, you are able to write something that can say
if the balance matching is correct. You did not say how many lines it takes
to have a collection with all matching subelements. Again, in one line, that
is done. Do you know why there is XPath for xml ? To simplify treatment for
selecting a node. In one line, you can reach the nodes you wanted without
rewritting the same similar code may times. I consider it's the same with
Regex.
2) There some case about little changes where the regex modification is
quicker than a code like marcus' one. If instead of evaluating expression
with only parenthesis, you want to find all arguments of a function in a
dynamical compilation of a source code. You only have to add the name of the
function between the first parenthesis to match. With a code like marcus's
one, you will have to write a string parser wich is already implemented in
regex.

Seeing all you replies, I modify what I said at the beginning of this
debate. Maybe only for matching parenthesis, it's better not to use regex.
When you have to add parsing with the parenthesis matching, I'm sure that
regex is the proper tool.

P.S. Michael : I could not stop laughting seeing your reply about me. Why
not try to work in the same company ? It should be very interesting.... ;-)

Ludovic.
 
Ludovic SOEUR said:
Thanks all for all your comments about this debate. That 's very interesting
to see all these points of view.

I agree on all everybody have said since the beginning except for 2 things :

1) You say that in ten lines, you are able to write something that can say
if the balance matching is correct. You did not say how many lines it takes
to have a collection with all matching subelements. Again, in one line, that
is done. Do you know why there is XPath for xml ? To simplify treatment for
selecting a node. In one line, you can reach the nodes you wanted without
rewritting the same similar code may times. I consider it's the same with
Regex.

Indeed, it would be significantly more complicated to return the actual
elements. Of course, that's not what was asked for. If it had been,
it's quite possible that the regular expression would have been the
right solution. I'm not sure though - I'd have to look at both a
regular expression solution and a "hand-crafted" ones.
2) There some case about little changes where the regex modification is
quicker than a code like marcus' one. If instead of evaluating expression
with only parenthesis, you want to find all arguments of a function in a
dynamical compilation of a source code. You only have to add the name of the
function between the first parenthesis to match. With a code like marcus's
one, you will have to write a string parser wich is already implemented in
regex.

However, in any of those changes, you have to really understand the
original regular expression to start with - the same regular expression
that provoked a comment of "My brain just exploded" from someone who is
already familiar with regexes. You then end up with code which is that
hard to read again after making the change.
Seeing all you replies, I modify what I said at the beginning of this
debate. Maybe only for matching parenthesis, it's better not to use regex.
When you have to add parsing with the parenthesis matching, I'm sure that
regex is the proper tool.

I wouldn't say that I was *sure* that regular expressions would be the
way to go, but they'd certainly be much more likely to be the right
solution in that case than they are when just matching.
 
Couldn't you use the split function.

string a = "a( ( 'john' ) and ( 'jane' ) and ( 'joe' ) )";
string[] b;

b = a.Split( '(' );
MessageBox.Show((b.Length-1).ToString());

b = a.Split( ')' );
MessageBox.Show((b.Length-1).ToString());
 
Kevin said:
Couldn't you use the split function.

string a = "a( ( 'john' ) and ( 'jane' ) and ( 'joe' ) )";
string[] b;

b = a.Split( '(' );
MessageBox.Show((b.Length-1).ToString());

b = a.Split( ')' );
MessageBox.Show((b.Length-1).ToString());

Aside from the needless performance hit (which may not be a problem in
many situations) that doesn't check that the parentheses actually
balance - it just counts them. So "))))((((" would "succeed", even
though it's clearly wrong.

Jon
 
using System;
using System.Collections;

class ParenthesisParser
{
public static bool CorrectParenthesis(char[] elements)
{
Stack stack = new Stack();
char ch;
for (int i = 0; i < elements.Length; i++) {
ch = elements;
switch ( ch ) {
case '(':
stack.Push( ch );
break;
case ')':
if (stack.Count == 0)
return false;
stack.Pop( );
break;
}
}
return stack.Count == 0;
}
}
 
Back
Top