Regular expression returns round brackets

  • Thread starter Thread starter Eric
  • Start date Start date
E

Eric

I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%

I want to return any instance of a token that is not surrounded by
single quotes. For example:

Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".

My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"

If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."

I am getting back "%token1%" and "(%token3%)".

Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

Thanks,
Eric
 
I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%

I want to return any instance of a token that is not surrounded by
single quotes. For example:

Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".

My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"

If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."

I am getting back "%token1%" and "(%token3%)".

Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

I suspect it's because of \x27a. Did you really mean Unicode U+027A,
or did you mean U+0027 followed by an a?

The \x escape sequence is a really, really bad one to use in almost
all situations. Especially in this case, as using a ' would be a lot
simpler to start with.

Jon
 
Jon Skeet said:
I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%

I want to return any instance of a token that is not surrounded by
single quotes. For example:

Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".

My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"

If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."

I am getting back "%token1%" and "(%token3%)".

Why am I getting back round brackets around %token3%? How can I still
get back %token3%, but without the round brackets?

I suspect it's because of \x27a. Did you really mean Unicode U+027A,
or did you mean U+0027 followed by an a?

The \x escape sequence is a really, really bad one to use in almost
all situations. Especially in this case, as using a ' would be a lot
simpler to start with.


Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

Therefore the spaces that preceed and follow the first token are included in
the match ( you have to look real careful to see that in the result) but in
the second ( ) also match because they're not ' or alphanumeric.

Try this instead:-

"(?<!')%(?!')[^%]+%(?!')"
 
Anthony Jones said:
Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

No it doesn't. The escaping isn't just unnecessary, it's broken.

To see what I mean, compile and run this:

using System;
class Test
{
static void Main()
{
string x = "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]";
Console.WriteLine(x);
}
}

You won't see a quote there at all, nor the 'a' of a-z.

The string is equivalent to

"[^\u027A-zA-Z0-9]%[^%]+%[^\u027A-zA-Z0-9]"

Definitely not what's intended. I dare say it's not responsible for the
brackets issue, but the regex isn't doing what it's meant to at the
moment.
 
I have a string that contains tokens surrounded by percentage signs,
e.g. %token1%
I want to return any instance of a token that is not surrounded by
single quotes.  For example:
Using the input string: "This is %token1% and here is '%token2%'."
I want to match "%token1%" but not "%token2%".
My match pattern is: "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]"
If my input string is: "This is %token1% and here is '%token2%' and
also (%token3%)."
I am getting back "%token1%" and "(%token3%)".
Why am I getting back round brackets around %token3%?  How can I still
get back %token3%, but without the round brackets?
I suspect it's because of \x27a. Did you really mean Unicode U+027A,
or did you mean U+0027 followed by an a?
The \x escape sequence is a really, really bad one to use in almost
all situations. Especially in this case, as using a ' would be a lot
simpler to start with.

Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

Therefore the spaces that preceed and follow the first token are included in
the match ( you have to look real careful to see that in the result) but in
the second ( ) also match because they're not ' or alphanumeric.

Try this instead:-

"(?<!')%(?!')[^%]+%(?!')"

Thanks Anthony and Jon.

Yes, it makes sense that I shouldn't be unnecessarily escaping the
apostrophe.

Anthony, the pattern you provided above works great, thanks.

Eric
 
Jon Skeet said:
Anthony Jones said:
Nope it has nothing to do with the \x27 but I agree there is no need to
escape it.

The pattern matches anything that isn't a ' or alphanumeric that preceeds a
% followed by anything isn't a % that is followed by a % and anything isn't
a ' or alphanumeric.

No it doesn't. The escaping isn't just unnecessary, it's broken.

To see what I mean, compile and run this:

using System;
class Test
{
static void Main()
{
string x = "[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]";
Console.WriteLine(x);
}
}

You won't see a quote there at all, nor the 'a' of a-z.

The string is equivalent to

"[^\u027A-zA-Z0-9]%[^%]+%[^\u027A-zA-Z0-9]"


Based on the OPs described output it was clear that he wasn't getting an
error that RegExp would throw if that exact syntax was used. He was
probably using @"[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]" in which case the
RegExp parser would see the \x27 correctly.
 
Based on the OPs described output it was clear that he wasn't getting an
error that RegExp would throw if that exact syntax was used. He was
probably using @"[^\x27a-zA-Z0-9]%[^%]+%[^\x27a-zA-Z0-9]" in which case the
RegExp parser would see the \x27 correctly.

Fair enough - although in that case it's just a bug waiting to occur
when some maintenance programmer decides to remove the @. It's also an
example of why it's a good idea to post complete, runnable code - it
removes sources of ambiguity like this.

I've been meaning to add the revolting \x escape (in C#, not regex) to
my brainteaser list. Something along the lines of:

// Character 9 = tab
Console.WriteLine("You say:\x9Good compiler!");
Console.WriteLine("You say:\x9Bad compiler!");

I wonder how many people would quickly spot that the second line will
actually try to output U+9BAD (which isn't in use as far as I can
tell).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Are you a RegEx bandido? 2
Regular Expression Help 2
Regular Expressions 4
Regex: Capturing HTML 1
Regular expression 4
Regular Expression Help 1
Regular expressions 3
Best approach to validating "set" property .. 9

Back
Top