Regular expression returns round brackets

Eric · Feb 1, 2008

I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%

I want to return any instance of a token that is not surrounded by
single quotes. For example:

Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".

My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"

If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."

I am getting back "%token1%" and "(%token3%)".

Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

Thanks,
Eric

Jon Skeet [C# MVP] · Feb 1, 2008

I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%

I want to return any instance of a token that is not surrounded by
single quotes. For example:

Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".

My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"

If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."

I am getting back "%token1%" and "(%token3%)".

Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

I suspect it's because of \x27a. Did you really mean Unicode U+027A,
or did you mean U+0027 followed by an a?

The \x escape sequence is a really, really bad one to use in almost
all situations. Especially in this case, as using a ' would be a lot
simpler to start with.

Jon

Anthony Jones · Feb 1, 2008

Jon Skeet said:
I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%

I want to return any instance of a token that is not surrounded by
single quotes. For example:

Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".

My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"

If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."

I am getting back "%token1%" and "(%token3%)".

Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

Click to expand...

I suspect it's because of \x27a. Did you really mean Unicode U+027A,
or did you mean U+0027 followed by an a?

The \x escape sequence is a really, really bad one to use in almost
all situations. Especially in this case, as using a ' would be a lot
simpler to start with.

Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

Therefore the spaces that preceed and follow the first token are included in
the match ( you have to look real careful to see that in the result) but in
the second ( ) also match because they're not ' or alphanumeric.

Try this instead:-

"(?<!')%(?!')[^%]+%(?!')"

Jon Skeet [C# MVP] · Feb 1, 2008

Anthony Jones said:
Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

No it doesn't. The escaping isn't just unnecessary, it's broken.

To see what I mean, compile and run this:

using System;
class Test
{
static void Main()
{
string x = "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]";
Console.WriteLine(x);
}
}

You won't see a quote there at all, nor the 'a' of a-z.

The string is equivalent to

"[^\u027A-zA-Z0-9]%[^%]+%[^\u027A-zA-Z0-9]"

Definitely not what's intended. I dare say it's not responsible for the
brackets issue, but the regex isn't doing what it's meant to at the
moment.

Eric · Feb 1, 2008

I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%
I want to return any instance of a token that is not surrounded by
single quotes. For example:
Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".
My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"
If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."
I am getting back "%token1%" and "(%token3%)".
Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

Click to expand...

Click to expand...

I suspect it's because of \x27a. Did you really mean Unicode U+027A,
or did you mean U+0027 followed by an a?

Click to expand...

The \x escape sequence is a really, really bad one to use in almost
all situations. Especially in this case, as using a ' would be a lot
simpler to start with.

Click to expand...

Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

Therefore the spaces that preceed and follow the first token are included in
the match ( you have to look real careful to see that in the result) but in
the second ( ) also match because they're not ' or alphanumeric.

Try this instead:-

"(?<!')%(?!')[^%]+%(?!')"

Thanks Anthony and Jon.

Yes, it makes sense that I shouldn't be unnecessarily escaping the
apostrophe.

Anthony, the pattern you provided above works great, thanks.

Eric

Anthony Jones · Feb 1, 2008

Jon Skeet said:
Anthony Jones said:

Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

Click to expand...

No it doesn't. The escaping isn't just unnecessary, it's broken.

To see what I mean, compile and run this:

using System;
class Test
{
static void Main()
{
string x = "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]";
Console.WriteLine(x);
}
}

You won't see a quote there at all, nor the 'a' of a-z.

The string is equivalent to

"[^\u027A-zA-Z0-9]%[^%]+%[^\u027A-zA-Z0-9]"

Based on the OPs described output it was clear that he wasn't getting an
error that RegExp would throw if that exact syntax was used. He was
probably using @"[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]" in which case the
RegExp parser would see the \x27 correctly.

Jon Skeet [C# MVP] · Feb 2, 2008

Based on the OPs described output it was clear that he wasn't getting an
error that RegExp would throw if that exact syntax was used. He was
probably using @"[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]" in which case the
RegExp parser would see the \x27 correctly.

Fair enough - although in that case it's just a bug waiting to occur
when some maintenance programmer decides to remove the @. It's also an
example of why it's a good idea to post complete, runnable code - it
removes sources of ambiguity like this.

I've been meaning to add the revolting \x escape (in C#, not regex) to
my brainteaser list. Something along the lines of:

// Character 9 = tab
Console.WriteLine("You say:\x9Good compiler!");
Console.WriteLine("You say:\x9Bad compiler!");

I wonder how many people would quickly spot that the second line will
actually try to output U+9BAD (which isn't in use as far as I can
tell).

Are you a RegEx bandido?	2	Oct 16, 2008
Regular Expression Help	2	Feb 26, 2007
Regular Expressions	4	Aug 15, 2005
Regex: Capturing HTML	1	Oct 11, 2005
Regular expression	4	Jan 16, 2013
Regular Expression Help	1	Jan 9, 2008
Regular expressions	3	Mar 24, 2008
Best approach to validating "set" property ..	9	Dec 13, 2003

Regular expression returns round brackets

Eric

Jon Skeet [C# MVP]

Anthony Jones

Jon Skeet [C# MVP]

Eric

Anthony Jones

Jon Skeet [C# MVP]

Ask a Question

Similar Threads