c# regex word boundaries

G

Guest

Hi All,
Being a bit of a newbie with regex, I am confused when using word boundaries.

For instance, I want to replace all the stand alone '.5k' that occur in an
input string, with 500. In other words

"this is a .5k example" goes to "this is a 500 example"

The replace should not touch '.5k' that occurs inside a word. For example:

"this 30.5k is not an example" should be unchanged.

So, I put together the regex below, thinking that the \b would match word
boundaries, and only replace stand alone occurences of '.5k'

Regex r = new Regex(@"\b\.5k\b");
MatchCollection mColl = r.Matches(txtInput.Text);
StringBuilder sb = new StringBuilder(txtInput.Text);
foreach (Match m in mColl)
{
sb.Remove(m.Index, 3);
sb.Insert(m.Index, "500");
}
txtResults.Text = sb.ToString();

(I used the stringbuilder Remove method, rather than the regex replace
method, since the documentation states that \b matches a backspace when used
in a replace operation.)

If the txtInput.Text is this

..5k is an example, not like this 30.5k but this .5k or this .5k

the txtResults.Text is this

..5k is an example, not like this 30500 but this .5k or this .5k

which is the complete opposite of what I would expect.

Also, if I replace the regex with this @"\B\.5k", the output from the above
code is


500 is an example, not like this 30.5k but this 500 or this 500

But doesn't \B mean that the match must not occur on a word boundary? (Is
not the change from a [space] to [5] a word boundary?)

I am more than willing to believe that the fault is my comprehension of this
stuff, but I am a bit stuck to see where I am going wrong at the moment.

So, any pointers as to where to look would be most appreciated. Many thanks
regards,
Gary
 
G

Guest

..NET appears to use these symbols backwards. It appears .NET thinks \B is on
a word boundary and \b is not.

Very strange.

Ciaran O'Donnell
 
G

Guest

Hi Ciaran,

Thanks for the quick reply.
I just tried this regex

Regex r = new Regex(@"\B\.5k\B")

in the code, thinking that \B might then work as the word boundary. But the
input string

..5k is an example, not like this 30.5k but this .5k or this .5k

is now totally untouched. So the above regex does not match anything. (and
in fact if I debug the variable mColl it has a count of zero - showing there
were no matches).

So, I don't quite understand that either. Thanks anyway for the help, but I
am still a bit confused - no change there then :cool:
cheers,
Gary


Ciaran O''Donnell said:
.NET appears to use these symbols backwards. It appears .NET thinks \B is on
a word boundary and \b is not.

Very strange.

Ciaran O'Donnell

Gary Bond said:
Hi All,
Being a bit of a newbie with regex, I am confused when using word boundaries.

For instance, I want to replace all the stand alone '.5k' that occur in an
input string, with 500. In other words

"this is a .5k example" goes to "this is a 500 example"

The replace should not touch '.5k' that occurs inside a word. For example:

"this 30.5k is not an example" should be unchanged.

So, I put together the regex below, thinking that the \b would match word
boundaries, and only replace stand alone occurences of '.5k'

Regex r = new Regex(@"\b\.5k\b");
MatchCollection mColl = r.Matches(txtInput.Text);
StringBuilder sb = new StringBuilder(txtInput.Text);
foreach (Match m in mColl)
{
sb.Remove(m.Index, 3);
sb.Insert(m.Index, "500");
}
txtResults.Text = sb.ToString();

(I used the stringbuilder Remove method, rather than the regex replace
method, since the documentation states that \b matches a backspace when used
in a replace operation.)

If the txtInput.Text is this

.5k is an example, not like this 30.5k but this .5k or this .5k

the txtResults.Text is this

.5k is an example, not like this 30500 but this .5k or this .5k

which is the complete opposite of what I would expect.

Also, if I replace the regex with this @"\B\.5k", the output from the above
code is


500 is an example, not like this 30.5k but this 500 or this 500

But doesn't \B mean that the match must not occur on a word boundary? (Is
not the change from a [space] to [5] a word boundary?)

I am more than willing to believe that the fault is my comprehension of this
stuff, but I am a bit stuck to see where I am going wrong at the moment.

So, any pointers as to where to look would be most appreciated. Many thanks
regards,
Gary
 
K

Kevin Spencer

Hi Gary,

Try the following:

(?<!\d)\.\d+k

This uses a negative look-behind. The rules can be explained as:

A match is a dot followed by 1 or more number characters, followed by the
letter 'k',
ONLY if it is NOT immediately preceded by a number character (negative
look-behind)

--
HTH,

Kevin Spencer
Microsoft MVP
Ministry of Software Development
http://unclechutney.blogspot.com

If you have little, is that your lot?


Gary Bond said:
Hi Ciaran,

Thanks for the quick reply.
I just tried this regex

Regex r = new Regex(@"\B\.5k\B")

in the code, thinking that \B might then work as the word boundary. But
the
input string

.5k is an example, not like this 30.5k but this .5k or this .5k

is now totally untouched. So the above regex does not match anything. (and
in fact if I debug the variable mColl it has a count of zero - showing
there
were no matches).

So, I don't quite understand that either. Thanks anyway for the help, but
I
am still a bit confused - no change there then :cool:
cheers,
Gary


Ciaran O''Donnell said:
.NET appears to use these symbols backwards. It appears .NET thinks \B is
on
a word boundary and \b is not.

Very strange.

Ciaran O'Donnell

Gary Bond said:
Hi All,
Being a bit of a newbie with regex, I am confused when using word
boundaries.

For instance, I want to replace all the stand alone '.5k' that occur in
an
input string, with 500. In other words

"this is a .5k example" goes to "this is a 500 example"

The replace should not touch '.5k' that occurs inside a word. For
example:

"this 30.5k is not an example" should be unchanged.

So, I put together the regex below, thinking that the \b would match
word
boundaries, and only replace stand alone occurences of '.5k'

Regex r = new Regex(@"\b\.5k\b");
MatchCollection mColl = r.Matches(txtInput.Text);
StringBuilder sb = new StringBuilder(txtInput.Text);
foreach (Match m in mColl)
{
sb.Remove(m.Index, 3);
sb.Insert(m.Index, "500");
}
txtResults.Text = sb.ToString();

(I used the stringbuilder Remove method, rather than the regex replace
method, since the documentation states that \b matches a backspace when
used
in a replace operation.)

If the txtInput.Text is this

.5k is an example, not like this 30.5k but this .5k or this .5k

the txtResults.Text is this

.5k is an example, not like this 30500 but this .5k or this .5k

which is the complete opposite of what I would expect.

Also, if I replace the regex with this @"\B\.5k", the output from the
above
code is


500 is an example, not like this 30.5k but this 500 or this 500

But doesn't \B mean that the match must not occur on a word boundary?
(Is
not the change from a [space] to [5] a word boundary?)

I am more than willing to believe that the fault is my comprehension of
this
stuff, but I am a bit stuck to see where I am going wrong at the
moment.

So, any pointers as to where to look would be most appreciated. Many
thanks
regards,
Gary
 
M

Martin Honnen

Gary said:
Being a bit of a newbie with regex, I am confused when using word boundaries.

A word boundary appears between a word character (\w) and a non word
character (\W).
For instance, I want to replace all the stand alone '.5k' that occur in an
input string, with 500.

'.' is a non word character, '5' and 'k' a word character so there is
one word boundary in there, between '.' and '5'.
 
G

Guest

Hi Martin,

Many thanks for the help.

I think I see what you mean : the match does not work because the transition
from a '.' to a space is not a word boundary since they are both \W, (non
word) characters. Therefore my match can not work since '.5k' is never
surrounded by word boundaries.

Just to check that out, I tried the same regex, (@"\b\.5k\b"), on this string:


..5k is an example, not like this 30.5k but this a.5k or this .5k

and sure enough the answer was

..5k is an example, not like this 30500 but this a500 or this .5k

which makes sense now: the only \b word boundaries we are interested in are
between '0' and '.' in '30.5k', and between 'a' and '.' in 'a.5k'.

brilliant - thanks again,
cheers,
Gary.
 
G

Guest

Hi Kevin,

Brilliant, that seems to work fine, so that gets the problem sorted.

I also got my misunderstanding cleared up I think - see the answer from
Martin, below.

Thanks again,
cheers,
Gary.

Kevin Spencer said:
Hi Gary,

Try the following:

(?<!\d)\.\d+k

This uses a negative look-behind. The rules can be explained as:

A match is a dot followed by 1 or more number characters, followed by the
letter 'k',
ONLY if it is NOT immediately preceded by a number character (negative
look-behind)

--
HTH,

Kevin Spencer
Microsoft MVP
Ministry of Software Development
http://unclechutney.blogspot.com

If you have little, is that your lot?


Gary Bond said:
Hi Ciaran,

Thanks for the quick reply.
I just tried this regex

Regex r = new Regex(@"\B\.5k\B")

in the code, thinking that \B might then work as the word boundary. But
the
input string

.5k is an example, not like this 30.5k but this .5k or this .5k

is now totally untouched. So the above regex does not match anything. (and
in fact if I debug the variable mColl it has a count of zero - showing
there
were no matches).

So, I don't quite understand that either. Thanks anyway for the help, but
I
am still a bit confused - no change there then :cool:
cheers,
Gary


Ciaran O''Donnell said:
.NET appears to use these symbols backwards. It appears .NET thinks \B is
on
a word boundary and \b is not.

Very strange.

Ciaran O'Donnell

:

Hi All,
Being a bit of a newbie with regex, I am confused when using word
boundaries.

For instance, I want to replace all the stand alone '.5k' that occur in
an
input string, with 500. In other words

"this is a .5k example" goes to "this is a 500 example"

The replace should not touch '.5k' that occurs inside a word. For
example:

"this 30.5k is not an example" should be unchanged.

So, I put together the regex below, thinking that the \b would match
word
boundaries, and only replace stand alone occurences of '.5k'

Regex r = new Regex(@"\b\.5k\b");
MatchCollection mColl = r.Matches(txtInput.Text);
StringBuilder sb = new StringBuilder(txtInput.Text);
foreach (Match m in mColl)
{
sb.Remove(m.Index, 3);
sb.Insert(m.Index, "500");
}
txtResults.Text = sb.ToString();

(I used the stringbuilder Remove method, rather than the regex replace
method, since the documentation states that \b matches a backspace when
used
in a replace operation.)

If the txtInput.Text is this

.5k is an example, not like this 30.5k but this .5k or this .5k

the txtResults.Text is this

.5k is an example, not like this 30500 but this .5k or this .5k

which is the complete opposite of what I would expect.

Also, if I replace the regex with this @"\B\.5k", the output from the
above
code is


500 is an example, not like this 30.5k but this 500 or this 500

But doesn't \B mean that the match must not occur on a word boundary?
(Is
not the change from a [space] to [5] a word boundary?)

I am more than willing to believe that the fault is my comprehension of
this
stuff, but I am a bit stuck to see where I am going wrong at the
moment.

So, any pointers as to where to look would be most appreciated. Many
thanks
regards,
Gary
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top