Opinion wanted on "Regex"

M

Mark Chambers

Hi there,

I'm seeking opinions on the use of regular expression searching. Is there
general consensus on whether it's now a best practice to rely on this rather
than rolling your own (string) pattern search functions. Where performance
is an issue you can alway write your own specialized routine of course.
However, for the occasional pattern search where performance isn't an issue,
would most seasoned .NET developers rely on "Regex" and cousins. Are there
any disadvantages I should be aware of other than possible efficiency issues
(compared to a specialized routine). Thanks in advance.
 
N

Nicholas Paldino [.NET/C# MVP]

Mark,

I think that if the work that you are doing is simple (say, replacing
one character with another), then I would not use a regular expression to do
the work. However, for more complex pattern matching (and it doesn't take
too much to get to that state, say, in processing zip codes, phone numbers,
ip addresses, etc, etc), I would definitely use a regular expression.
 
M

Mark Chambers

I think that if the work that you are doing is simple (say, replacing
one character with another), then I would not use a regular expression to
do the work. However, for more complex pattern matching (and it doesn't
take too much to get to that state, say, in processing zip codes, phone
numbers, ip addresses, etc, etc), I would definitely use a regular
expression.

Thanks for the feedback. Yes, I'm talking strictly about pattern matching.
You always want to take the path of least resistance so I wouldn't use it if
something simpler is readily available. However, I started writing an
elaborate routine the other day and after about 20 lines I decided that a
regular expression would be much simpler. After years of working in C++ on
Win32 however (without the luxury of "Regex"), I'm so used to rolling my own
that it didn't even occur to me to try it. Presumably there are few
negatives and your own experience is obviously positve based on your
response. I'll probably rely on it myself from here on. Thanks again for
your input.
 
J

Jesse Houwing

* Mark Chambers wrote, On 17-5-2007 14:31:
Hi there,

I'm seeking opinions on the use of regular expression searching. Is there
general consensus on whether it's now a best practice to rely on this rather
than rolling your own (string) pattern search functions. Where performance
is an issue you can alway write your own specialized routine of course.
However, for the occasional pattern search where performance isn't an issue,
would most seasoned .NET developers rely on "Regex" and cousins. Are there
any disadvantages I should be aware of other than possible efficiency issues
(compared to a specialized routine). Thanks in advance.

I usually prefer to rely on Regular Expressions. They get optimized to
do these string searches almost automatically. Just be very careful with
stuff like .* everywhere, those are killing for performance.

The main reason I like regular expressions is that they are a unified
way to write string manipulation, whereas your own string manipulation
functions can all be different, yet work fine. So from a maintenance
perspective, it should be easier.

This does require you to work with IgnoreWhiteSpace on and verbatim
strings, so you van easily insert comments and such:

sstring regex = @"

[a-z] (?#any character from the alfabet)
(
[ ][a-z][0-9]
)+ (?# Some other nifty comment)

"

It makes it much easier to read, just make sure you write your spaces as
[ ] as in the example.

Jesse
 
M

Mark Chambers

I usually prefer to rely on Regular Expressions. They get optimized to do
these string searches almost automatically. Just be very careful with
stuff like .* everywhere, those are killing for performance.

The main reason I like regular expressions is that they are a unified way
to write string manipulation, whereas your own string manipulation
functions can all be different, yet work fine. So from a maintenance
perspective, it should be easier.

This does require you to work with IgnoreWhiteSpace on and verbatim
strings, so you van easily insert comments and such:

sstring regex = @"

[a-z] (?#any character from the alfabet)
(
[ ][a-z][0-9]
)+ (?# Some other nifty comment)

"

It makes it much easier to read, just make sure you write your spaces as
[ ] as in the example.

Thanks for the insight and I agree. I think that one drawback of regular
expressions however is that they can be tricky to get exactly right
depending on the complexity. When you write your own routine you're focused
on the task at hand and are completely responsible to get the algorithm
right. This is done using native language constructs which developers are
usually much more comfortable with. It's therefore easier to apply since you
just have to get the search logic itself right (granted, this isn't always
trivial). However, regular expressions is a language unto itself. Most
developers probably only use it occasionally and so they aren't as
experienced with it as their native language. It's therefore (potentially)
easier to get tripped up trying to write a complicated search pattern,
compared to writing your own routine (even if it's many times longer
provided the logic itself is straight-forward).
 
J

Jesse Houwing

* Mark Chambers wrote, On 17-5-2007 15:55:
I usually prefer to rely on Regular Expressions. They get optimized to do
these string searches almost automatically. Just be very careful with
stuff like .* everywhere, those are killing for performance.

The main reason I like regular expressions is that they are a unified way
to write string manipulation, whereas your own string manipulation
functions can all be different, yet work fine. So from a maintenance
perspective, it should be easier.

This does require you to work with IgnoreWhiteSpace on and verbatim
strings, so you van easily insert comments and such:

sstring regex = @"

[a-z] (?#any character from the alfabet)
(
[ ][a-z][0-9]
)+ (?# Some other nifty comment)

"

It makes it much easier to read, just make sure you write your spaces as
[ ] as in the example.

Thanks for the insight and I agree. I think that one drawback of regular
expressions however is that they can be tricky to get exactly right
depending on the complexity. When you write your own routine you're focused
on the task at hand and are completely responsible to get the algorithm
right. This is done using native language constructs which developers are
usually much more comfortable with. It's therefore easier to apply since you
just have to get the search logic itself right (granted, this isn't always
trivial). However, regular expressions is a language unto itself. Most
developers probably only use it occasionally and so they aren't as
experienced with it as their native language. It's therefore (potentially)
easier to get tripped up trying to write a complicated search pattern,
compared to writing your own routine (even if it's many times longer
provided the logic itself is straight-forward).

Agreed, but a developer must be pretty good at his/her language to write
performant algorithms. I'd rather leave the performance stuff to the
people who're really good at that (the guys who wrote the rgeex engine
for .NET).

That's why every developer we have is forced to get regex training. I'm
one of the trainers, so for me it has always been 'easy' :)

Jesse
 
M

Mark Chambers

Agreed, but a developer must be pretty good at his/her language to write
performant algorithms. I'd rather leave the performance stuff to the
people who're really good at that (the guys who wrote the rgeex engine for
.NET).

Most of the time that's true. However, since regular expressions depend on
generalized algorithms, they usually won't perform better than a specialized
algorithim. The good thing is that performance isn't really an issue most of
the time.
That's why every developer we have is forced to get regex training. I'm
one of the trainers, so for me it has always been 'easy' :)

I'm a former C++ junkie still coping with my addiction. Do you have a
12-step program you can recommend? :)
 
J

Jesse Houwing

* Mark Chambers wrote, On 17-5-2007 17:29:
Most of the time that's true. However, since regular expressions depend on
generalized algorithms, they usually won't perform better than a specialized
algorithim. The good thing is that performance isn't really an issue most of
the time.


I'm a former C++ junkie still coping with my addiction. Do you have a
12-step program you can recommend? :)

My 12 step program was to develop SpamAssassin anti-spam rules for about
1.5 years. It's the best way to learn these things :).

But seriously, Just try. Take a couple of your older algo's and try to
convert them to regex. You can always ask here (or by email) for
suggestions.

I've seen the regex engine outwit dedicated string searching functions
on more than one occasion. And usually with a lot less code (except for
that one barely readable line of regex in there ;)).

Jesse
 
B

Bruce Wood

Hi there,

I'm seeking opinions on the use of regular expression searching. Is there
general consensus on whether it's now a best practice to rely on this rather
than rolling your own (string) pattern search functions. Where performance
is an issue you can alway write your own specialized routine of course.
However, for the occasional pattern search where performance isn't an issue,
would most seasoned .NET developers rely on "Regex" and cousins. Are there
any disadvantages I should be aware of other than possible efficiency issues
(compared to a specialized routine). Thanks in advance.

IMHO there is a range of problems that are best solved using Regex.

As Nicholas points out, very simple problems can be solved using
regular string routines. Solving them with Regex is like killing a fly
with a sledgehammer. Stuff like finding a particular string in
another, or replacing one character with another.

At the other end of the scale, I've seen posts here asking how to do
such-and-so using Regex, where the patterns are so baroque that it
takes a Regex expert here to sort them out and get them right. This
strikes me as unduly brittle and hard to maintain. If you can't figure
out for yourself how to write the correct Regex to match something,
break it down into a combination of open code and Regex, or test
multiple, simpler Regex patterns one after the other. If it takes you
hours of struggle to come up with the correct pattern, then perhaps
it's a sign that you're over-reaching and you should break the problem
into more manageable chunks.

Nonetheless, there are a huge number of string matching problems in
this mid-range: complex enough that open code becomes unwieldy, but
simple enough that I can write a Regex pattern in a few minutes to
match it.
 
J

Jon Skeet [C# MVP]

I've seen the regex engine outwit dedicated string searching functions
on more than one occasion. And usually with a lot less code (except for
that one barely readable line of regex in there ;)).

And that's exactly the problem - the regex which is barely readable.
I'd rather read five or six lines of simple string manipulation than
rely on not only *my* understanding of regex (and the subtleties) but
also the understanding of whoever's reading and maintaining my code at
a later date.

Regexes are useful in their place, but they can be horribly overused.
I've seen them being used (incorrectly, even!) to check whether a
string starts with a particular string (not a pattern, just a straight
string) and whether a string has a particular length. It's crazy - just
as crazy as writing a complicated pattern matcher rather than using a
regex where appropriate.

Basically, use the right tool for the job. Sometimes that may require
writing some code to achieve a goal both with "hand-coding" and with a
regex, then looking (or asking others) to see which is more readable.
 
J

Jesse Houwing

* Jon Skeet [C# MVP] wrote, On 17-5-2007 20:12:
And that's exactly the problem - the regex which is barely readable.
I'd rather read five or six lines of simple string manipulation than
rely on not only *my* understanding of regex (and the subtleties) but
also the understanding of whoever's reading and maintaining my code at
a later date.

Agreed. That's why I opted to use verbatim strings and regex comments in
combination with IgnoreWhiteSpace. A well commented regex willd o
wonders :). But then it's no longer one line :(. (kidding).
Basically, use the right tool for the job. Sometimes that may require
writing some code to achieve a goal both with "hand-coding" and with a
regex, then looking (or asking others) to see which is more readable.

Agreed. One use for regexes I use very often is to use a simple regex to
find the general pattern. Then use string manipulation or another regex
to finish the job. MatchEvaluators are your friend there.

Also building your regex in code from several well named variables and
+ing them together in the end will improve readability.

I've seen people writing awful string manipulations as well... Whichever
tool or language you choose, the art lies in puttign it to good use
*and* making the result both 100% correct and maintainable.

Jesse
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Mark said:
I'm seeking opinions on the use of regular expression searching. Is there
general consensus on whether it's now a best practice to rely on this rather
than rolling your own (string) pattern search functions. Where performance
is an issue you can alway write your own specialized routine of course.
However, for the occasional pattern search where performance isn't an issue,
would most seasoned .NET developers rely on "Regex" and cousins. Are there
any disadvantages I should be aware of other than possible efficiency issues
(compared to a specialized routine). Thanks in advance.

You should go for the solution that gives the simplest code.

Which means regex for everything >2 Substring and/or IndexOf.

Arne
 
B

Bruce Wood

You should go for the solution that gives the simplest code.

Which means regex for everything >2 Substring and/or IndexOf.

Sorry. I disagree in part: there's an upper limit at which one tries
to stuff too much into Regex and it becomes a delicate, unreadable
mess.

Yes, Regex for everything more complex than a few simple string
operations... up to where you can't easily understand the Regex, at
which point you should start thinking of ways to break the problem
into smaller, more manageable parts.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Bruce said:
Sorry. I disagree in part: there's an upper limit at which one tries
to stuff too much into Regex and it becomes a delicate, unreadable
mess.

Yes, Regex for everything more complex than a few simple string
operations... up to where you can't easily understand the Regex, at
which point you should start thinking of ways to break the problem
into smaller, more manageable parts.

Using Substring/IndexOf will be an even bigger mess for very
complex stuff.

Breaking up the problem is a solution. But that solution can be
used both with regex and Substring/IndexOf.

For real complex stuff a scanner and parser a la lex & yacc
may be the ultimate solution.

Arne
 
J

Jesse Houwing

For real complex stuff a scanner and parser a la lex & yacc
may be the ultimate solution.

Which we've been given for scanner/parser generation in C#. It's
included in the latest Visual Studio SDK for Visual Studio 2005.

Jesse
 
B

Bruce Wood

Using Substring/IndexOf will be an even bigger mess for very
complex stuff.

Perhaps, perhaps not. The point is that more complex problems require
some soft of design up front. Multiple regexes may be the solution, or
perhaps some sort of state machine, or perhaps, as you stated later, a
full-blown parser.
Breaking up the problem is a solution. But that solution can be
used both with regex and Substring/IndexOf.

True. We agree there.
For real complex stuff a scanner and parser a la lex & yacc may be the ultimate solution.

Yes, and those may or may not use Regex internally. The point is that
the problem becomes so complex that attempting to tackle it in a
pattern is daunting, and even if you managed it, it would be
unmaintainable.

For my money, there are three groups of string-mashing problems:

1. So simple that using Regex is overkill.
2. Appropriately solved using Regex (this is a very large group).
3. So complex that they require a design phase, at which point the
technology finally employed may be string functions, regex calls, more
sophisticated approaches, or any combination of these.

So, one can try to do things in open code that are simpler to tackle
using Regex, and on the other hand there's the "since I got this great
big hammer, everything looks like a nail" problem, in which people try
to use Regex for _everything_, and sometimes end up with a baroque
mess.

Nonetheless, for the vast majority of day-to-day string parsing
problems, Regex is still the best way to go.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Bruce said:
Perhaps, perhaps not. The point is that more complex problems require
some soft of design up front. Multiple regexes may be the solution, or
perhaps some sort of state machine, or perhaps, as you stated later, a
full-blown parser.

Difficult not to agree,

:)
For my money, there are three groups of string-mashing problems:

1. So simple that using Regex is overkill.
2. Appropriately solved using Regex (this is a very large group).
3. So complex that they require a design phase, at which point the
technology finally employed may be string functions, regex calls, more
sophisticated approaches, or any combination of these.

So, one can try to do things in open code that are simpler to tackle
using Regex, and on the other hand there's the "since I got this great
big hammer, everything looks like a nail" problem, in which people try
to use Regex for _everything_, and sometimes end up with a baroque
mess.

Nonetheless, for the vast majority of day-to-day string parsing
problems, Regex is still the best way to go.

I think we are in violent agreement.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top