Using a regular expression to retrieve the text between two parentheses

M

Mark Rae

Hi,

Supposing I had a string made up of a person's name followed by their
profession in parentheses e.g.

string strText = "Tiger Woods (golfer)";

and I wanted to extract the portion of the string between the parentheses
i.e. "golfer"

Would a regular expression be the most efficient way of doing this...?

I'm trying to do something like this:

string strProfession = String.Empty;
Regex objRegEx = new Regex("(((.|\n)*?))", RegexOptions.IgnoreCase);
foreach (Match objMatch in objRegEx.Matches(strText)
{
strProfession = objMatch.ToString();
}

but that is returning an empty string, no doubt because I haven't defined
the regular expression correctly.

Also, is it even necessary to have a foreach loop here, as in this
particular scenario there can only ever be one match...?

Any assistance gratefully received.

Mark
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Mark said:
Supposing I had a string made up of a person's name followed by their
profession in parentheses e.g.

string strText = "Tiger Woods (golfer)";

and I wanted to extract the portion of the string between the parentheses
i.e. "golfer"

Would a regular expression be the most efficient way of doing this...?

I'm trying to do something like this:

string strProfession = String.Empty;
Regex objRegEx = new Regex("(((.|\n)*?))", RegexOptions.IgnoreCase);
foreach (Match objMatch in objRegEx.Matches(strText)
{
strProfession = objMatch.ToString();
}

but that is returning an empty string, no doubt because I haven't defined
the regular expression correctly.

Also, is it even necessary to have a foreach loop here, as in this
particular scenario there can only ever be one match...?

string s = "Tiger Woods (golfer)";
Regex re = new Regex(@"(\()([^\)]*)(\))");
string prof = re.Match(s).Groups[2].Value;

seems to work.

No regex will typical not be the most efficient way of coding it,
but it is simple code with a well documented syntax.

Some spagetti with IndexOf will be faster, but it would
also be much easier to introduce bugs if modifying the code.

Arne
 
J

Jon Shemitz

Regex re = new Regex(@"(\()([^\)]*)(\))");

@"\( ([^\)]+) \)", RegexOptions.IgnorePatternWhitespace

is probably a bit simpler and faster - there's no real need to capture
the parens.
No regex will typical not be the most efficient way of coding it,
but it is simple code with a well documented syntax.

You might be surprised. I compared a regex to find all tokens between
% signs (@"% (\w+) %") with a hand-coded state machine. Not only did
the hand-coded version take about 180 times as long to write (ie,
fifteen minutes vs five seconds) it also ran slower. As soon as the
task gets at all complex, a regex can save both programmer time and
run time.
Some spagetti with IndexOf will be faster, but it would
also be much easier to introduce bugs if modifying the code.

Yes, and the regex is easier to read and maintain - part of one line,
instead of three statements and some comments.
 
G

Gary Stephenson

Hi,
string strText = "Tiger Woods (golfer)";

Others have supplied working regexes - so I won't repeat.

But perhaps you should be made aware of the limitations implicit in
regexes - the main one being commonly rendered as "regexes can't count". As
long as you are not having to deal with recursive structures, nested
delimiters and so on, regexes will often work well. But they can't be used
to "find the balancing brace", verify correct nesting or suchlike.

In theory, you can explicitly construct a regex to cope with a given maximum
number of pairs of balancing delimiters, but even to cope with a single
extra level of nestiing requires a regex pattern so complex that it's
clearer and simpler to just code the match algorithm explicitly.

And of course there is the school of thought that regexes are only rarely
the best
solution. -

"Some people, when confronted with a problem, think 'I know, I'll use
regular expressions.' Now they have two problems." - Jamie Zawinski{*1]

cheers,

gary

http://www.oxide.net.au

*[1] - For an entertaining discussion of the origins of this quote, see
Jeffrey Friedl's blog at
http://regex.info/blog/2006-09-15/247
 
M

Mark Rae

string s = "Tiger Woods (golfer)";
Regex re = new Regex(@"(\()([^\)]*)(\))");
string prof = re.Match(s).Groups[2].Value;

seems to work.

Yes indeed - thanks very much.
No regex will typical not be the most efficient way of coding it,
but it is simple code with a well documented syntax.
OK.

Some spagetti with IndexOf will be faster, but it would
also be much easier to introduce bugs if modifying the code.

I guess so...
 
M

Mark Rae

Regex re = new Regex(@"(\()([^\)]*)(\))");

@"\( ([^\)]+) \)", RegexOptions.IgnorePatternWhitespace

is probably a bit simpler and faster - there's no real need to capture
the parens.

That returns an empty string...
You might be surprised. I compared a regex to find all tokens between
% signs (@"% (\w+) %") with a hand-coded state machine. Not only did
the hand-coded version take about 180 times as long to write (ie,
fifteen minutes vs five seconds) it also ran slower. As soon as the
task gets at all complex, a regex can save both programmer time and
run time.

I have a real "blind-spot" with regular expressions... After over 20 years
of programming in all sorts of languages, I *still* can't do them in my
head, or look at them and know intuitively what they're doing... :)
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Mark said:
I have a real "blind-spot" with regular expressions... After over 20 years
of programming in all sorts of languages, I *still* can't do them in my
head, or look at them and know intuitively what they're doing... :)

The syntax is horrible.

But it is well documented. And there are a ton of supporting
tools out there.

Arne
 
M

Mark Rae

The syntax is horrible.

That's for sure!
But it is well documented. And there are a ton of supporting
tools out there.

Can you recommend one? I've looked at several over the years, but almost all
of them seem to be designed to show the effect of a regular expression on a
string, rather than "build me a regular expression which will..."

If you could have found one which would have built me the "find all the text
between the opening and closing parentheses" expression, I wouldn't have
troubled the newsgroup...
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Mark said:
Can you recommend one? I've looked at several over the years, but almost all
of them seem to be designed to show the effect of a regular expression on a
string, rather than "build me a regular expression which will..."

If you could have found one which would have built me the "find all the text
between the opening and closing parentheses" expression, I wouldn't have
troubled the newsgroup...

I am not aware of any english to regex translator, but an
interactive one that shows what you get out of an regex
expression is useful as well. Because it helps you
build up the regex incrementally.

Arne
 
J

Jon Shemitz

Mark said:
@"\( ([^\)]+) \)", RegexOptions.IgnorePatternWhitespace

is probably a bit simpler and faster - there's no real need to capture
the parens.

That returns an empty string...

No, it doesn't. The Match.Value is the parenthesized expression;
Match.Groups[1].Value is the string in the parens.
 
J

Jon Shemitz

Gary said:
But perhaps you should be made aware of the limitations implicit in
regexes - the main one being commonly rendered as "regexes can't count". As
long as you are not having to deal with recursive structures, nested
delimiters and so on, regexes will often work well. But they can't be used
to "find the balancing brace", verify correct nesting or suchlike.

Not true with .NET regexes, btw.
 
E

Ethan Strauss

I haven't been following this whole thread, so I may be somewhat off topic,
but it seems relevant...

Don't forget that you can name groups in a Regex.
For example you could take the expression below and add a name as follows
@"\( (?<StuffIWantToActuallySee>[^\)]+) \)",
RegexOptions.IgnorePatternWhitespace

You can then get at it from the Group name.
Match.Groups["StuffIWantToActuallySee"].Value

I find that much easier to deal with than trying to get the groups by
number.

Ethan

Jon Shemitz said:
Mark said:
@"\( ([^\)]+) \)", RegexOptions.IgnorePatternWhitespace

is probably a bit simpler and faster - there's no real need to capture
the parens.

That returns an empty string...

No, it doesn't. The Match.Value is the parenthesized expression;
Match.Groups[1].Value is the string in the parens.
 
G

Gary Stephenson

Not true with .NET regexes, btw.

Really? How so? They must then be fundamentally different to all regex
implementations I have seen or heard about. A quick scan of the
documentation doesn't reveal anything significantly different about .NET
regexes ... hmmm ...

As I understand it, in order to solve matching-brace type problems, a
push-down automaton is required, as opposed to a finite-state automaton. Do
..NET regexes somehow provide that?

Please explain,

respectfully,

gary
 
J

Jon Shemitz

As I understand it, in order to solve matching-brace type problems, a
push-down automaton is required, as opposed to a finite-state automaton. Do
.NET regexes somehow provide that?

Yes.

..NET capture groups capture all matching expressions, not just the
last one. The "balancing group definition" grouping construct
(?<-name>expr) pops the most recent capture if expr matches; the
(?(name)a|b) alternation construct lets you force the match to fail if
a stack is not empty.

See pgs 289-290 in
<http://www.midnightbeach.com/.net/ShemitzBook.Chapter11.pdf> for more
details.
 
G

Gary Stephenson

Hi Jon

----- Original Message -----
From: "Jon Shemitz" <[email protected]>
Newsgroups: microsoft.public.dotnet.languages.csharp
Sent: Tuesday, January 16, 2007 12:09 PM
Subject: Re: Using a regular expression to retrieve the text between two
parentheses

.NET capture groups capture all matching expressions, not just the
last one. The "balancing group definition" grouping construct
(?<-name>expr) pops the most recent capture if expr matches; the
(?(name)a|b) alternation construct lets you force the match to fail if
a stack is not empty.

See pgs 289-290 in
<http://www.midnightbeach.com/.net/ShemitzBook.Chapter11.pdf> for more
details.

Cool as! Thanks for that - very interesting indeed. Apologies to all for
misrepresenting (and underestimating) .NET regexes.

gary
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top