Splitting a string with Regex and keep the separator

N

nagar

I need to split a string whenever a separator string is present (lets
sey #Key(val) where val is a variable) and rejoin it in the proper
order after doing some processing.

Is there a way to use the Regex.Split function to split the string
whenever the #Key(val) occurrs but that keeps the #Key(val)
occurrences to that I can reconstruct the final string after doing
certain operations on each token (I need to basically convert each
string into an array of characters but I need to do this differently
is the string is a #Key(val)

Thanks.
Andrea
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 4-6-2007 23:16:
I need to split a string whenever a separator string is present (lets
sey #Key(val) where val is a variable) and rejoin it in the proper
order after doing some processing.

Is there a way to use the Regex.Split function to split the string
whenever the #Key(val) occurrs but that keeps the #Key(val)
occurrences to that I can reconstruct the final string after doing
certain operations on each token (I need to basically convert each
string into an array of characters but I need to do this differently
is the string is a #Key(val)

Thanks.
Andrea

This code splits each line differently:

Regex rx = new Regex(@"(?=#\w+\(\w+\))|(?<=#\w+\(\w+\))",
RegexOptions.None);
string[] arr = rx.Split(input);

It looks for every point at the beginning or the end of the pattern
you're looking for. it isn't the fastest on large inputs I guess, but I
haven't tested.

You might be better off experimenting with a MatchEvaluator and a well
written replace call, but to help you with that I'd need a little more
info on what kind of string manipulation you'd be doing.

Jesse
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 4-6-2007 23:16:
I need to split a string whenever a separator string is present (lets
sey #Key(val) where val is a variable) and rejoin it in the proper
order after doing some processing.

Is there a way to use the Regex.Split function to split the string
whenever the #Key(val) occurrs but that keeps the #Key(val)
occurrences to that I can reconstruct the final string after doing
certain operations on each token (I need to basically convert each
string into an array of characters but I need to do this differently
is the string is a #Key(val)

Thanks.
Andrea

This should work even better:

Regex rx2 = new
Regex(@"(?<keyval>#\w+\(\w+\))|(?<other>((?!#\w+\(\w+\)).)*)",
RegexOptions.None);
string result = rx2.Replace("input", new
MatchEvaluator(ManipulateString));



private string ManipulateString(Match target)
{
if (target.Groups["keyval"].Success)
{
return ManupulateKeyVal(target.Groups["keyval"].Value);
}

else if (target.Groups["other"].Success)
{
return ManupulateOther(target.Groups["other"].Value);
}
}

This will pass the found pieces in order to the manipulate function and
pass the result into a new string when done.

Kind regards,

Jesse
 
W

Walter Wang [MSFT]

Hi Andrea,

I'm not sure if I fully understand your question. Would you please let us
know if Jesse's reply helps? Thanks.


Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 
N

nagar

Thanks. I'll try that code. Just to explain things better, I don't
need to create an output string. I need to convert the strings to an
array of keycodes to send text. So I just need to have a list of
tokens (and know if they are normal text or #KeyVal) and treat them
differently.

I'll test the code and let you know.
Thanks.
Andrea
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 5-6-2007 9:00:
Thanks. I'll try that code. Just to explain things better, I don't
need to create an output string. I need to convert the strings to an
array of keycodes to send text. So I just need to have a list of
tokens (and know if they are normal text or #KeyVal) and treat them
differently.

I'll test the code and let you know.
Thanks.
Andrea


Ok, I think I get where you're going. Try this:


Regex rx2 = new
Regex(@"(?<keyval>#\w+\(\w+\))|(?<other>((?!#\w+\(\w+\)).)*)",
RegexOptions.None);
Matches ms = rx2.Matches("input");

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
ManupulateKeyVal(m.Groups["keyval"].Value);
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}


That should work best. I thought you were creating a new string at first ;)

Jesse
 
N

nagar

Thanks Jesse. I'll try it and let you know.
I'm fashinated by how you can do with Regex :) Could you briefly
comment on the regex you used? How does Matches groups work?
Thanks.
Andrea
 
N

nagar

It work like a charm. Thanks.
Is there also a regex to get the information inside the round brackets
and tokenize it where there's a space?

Given

#Key(SHIFT F2)

I would like to get
SHIFT
F2

Thanks so much.
Andrea
 
N

nagar

One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 5-6-2007 22:01:
It work like a charm. Thanks.
Is there also a regex to get the information inside the round brackets
and tokenize it where there's a space?

Given

#Key(SHIFT F2)

I would like to get
SHIFT
F2

Thanks so much.
Andrea

That would be possible.

It would look something like:

#Key\(((?<token>(?>\w+))\s*)+\)

This will put all the single tokens in match.Groups["token"].Captures[*]

I'd have to look into the problem of two adjacent keyval thingies. I'll
get back to you on that.

Jesse
 
N

nagar

Thanks for your help Jesse.

I can parse the single tokens later.

I just need to be able to detect the

#Key(val) like #Key(SHIFT ALT) inside the text. I saw that the regex
you sent me doesn't work properly if:

1. The #Key(val) contains a space between parenthesis
2. There are two or more #Key(val) attached
3. I saw it detects as a key any string that is preceded by a # sign

Thanks.
Andrea
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 5-6-2007 22:06:
One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea

After some careful testing I got the other issues fixed as well. The
regex is quite big already. I'll try to explain what's going on where.

First, the regex:

\G((?<other>((?!#[^\(]+\([^\)]+\)).)+)|(?<keyval>#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)))

To extract the fields you can use this:

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
string key = m.Groups["key"].Value;
foreach (Capture c in m.Groups["token"].Captures)
{
string token = c.Value;
}
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}

And now for the explanation:

\G -> Make sure every new match is directly adjacent to the previous
one, so we're not skipping invalid input

(?<other>) -> Match the 'other' text into a group named "other"
((?!#[^\(]+\([^\)]+\)).)+ Match every character that isn't the start of
a key/val pair. I'm doing this by looking ahead to see if a keyval
structure is found, and if it isn't I add one character to the match (.).

If we're at the end of an "other section" there's two options, either
the end of the string, in which case the regex just stops matching, or
there's the start of a key/val thingy.

(?<keyval>) -> match the whole key/val structure into a group named "keyval"

#(?<key>\w+)\( -> match the key an put it in a group named "key". The
key comes directly after a "#" and only contains one or more
alphanumeric characters (\w+) followed by "("

(((?<token>(?>\w+))\s*)+) -> Match every token into a group called
"token". If this group captures multiple tokens they're added to the
group's Captures collection in the order in which they're found. A token
is made up of one or more alphanumeric characters (\w+). It can be
followed by zero or more spaces. The (?>...) construction is used to
prevent too much backtracking going on. The whole
token-followed-by-space can exist multiple times. As the final token
will not have a space behind it I used \s*.

\) -> and finally the closing parenthesis.

Keep in mind that if you use the RegexOptions.IgnorePatternWhitespace,
you can reflow the regex to be easier to read. It's also easier to add
comments that way.

@"
(?# Start of the previous match)
\G
(
(?#
Match any character until you fin the start of
A key/val pair.
)
(?<other>((?!#[^\(]+\([^\)]+\)).)+)
|
(?#
Match a key/val pair. Put the keyname in a group
and every token in another.
)
(?<keyval>#(?<key>\w+)\(((?<token>\w+)\s*)+\))
)
";

One alternative to this whole approach I haven't tested yet, but would
work none the less is to only look for the special key/val thingies with
only the right subexpression:

#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)

And query the start/end location of each match to determine if there
were any other characters since the last found match. You can then
extract those characters with a substring function. I'm not sure which
option is faster, but I would not be surprised if the substring option
would work even better, though it would contain more coding.

Jesse Houwing
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 6-6-2007 0:11:
Thanks for your help Jesse.

I can parse the single tokens later.

I just need to be able to detect the

#Key(val) like #Key(SHIFT ALT) inside the text. I saw that the regex
you sent me doesn't work properly if:

1. The #Key(val) contains a space between parenthesis

Just add \s* right after the parenthesis, which should fix this.
2. There are two or more #Key(val) attached

Fixed that, see my other long post.
3. I saw it detects as a key any string that is preceded by a # sign

Ahhh I had guessed this was something variable. You can of course use
Key instead of \w+ here to fix it.

This is an updated expression (see the explanation in my other mail):

\G((?<other>((?!#Key\([^\)]+\)).)+)|(?<keyval>#Key\(\s*((?<token>\w+)\s*)+\)))

Jesse
 
N

nagar

Thanks so much for your help Jesse. I could implement it using your
suggestions.

I ended up using the following expression:
\G(?<keyval>#Key\([\w\s]+\))|(?<other>((?!#Key\([\w\s]+\)).)*)

which seems to work fine for me (it detects when two #Key() are places
one beside the other.

I want also to thank you for the regex explanation. It's always a
difficult topic.
Andrea

* (e-mail address removed) wrote, On 5-6-2007 22:06:
One more thing Jesse.
I noticed that the Key(val) is not interpreted correctly if I have two
expressions attached to one another.

For example

test#Key(F1) is interpreted correctly
test#Key(F1)#Key(F2) is not

How should I change the expression?
Thanks again.
Andrea

After some careful testing I got the other issues fixed as well. The
regex is quite big already. I'll try to explain what's going on where.

First, the regex:

\G((?<other>((?!#[^\(]+\([^\)]+\)).)+)|(?<keyval>#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)))

To extract the fields you can use this:

foreach (Match m in ms)
{
if (m.Groups["keyval"].Success)
{
string key = m.Groups["key"].Value;
foreach (Capture c in m.Groups["token"].Captures)
{
string token = c.Value;
}
}

else if (m.Groups["other"].Success)
{
ManupulateOther(m.Groups["other"].Value);
}
}

And now for the explanation:

\G -> Make sure every new match is directly adjacent to the previous
one, so we're not skipping invalid input

(?<other>) -> Match the 'other' text into a group named "other"
((?!#[^\(]+\([^\)]+\)).)+ Match every character that isn't the start of
a key/val pair. I'm doing this by looking ahead to see if a keyval
structure is found, and if it isn't I add one character to the match (.).

If we're at the end of an "other section" there's two options, either
the end of the string, in which case the regex just stops matching, or
there's the start of a key/val thingy.

(?<keyval>) -> match the whole key/val structure into a group named "keyval"

#(?<key>\w+)\( -> match the key an put it in a group named "key". The
key comes directly after a "#" and only contains one or more
alphanumeric characters (\w+) followed by "("

(((?<token>(?>\w+))\s*)+) -> Match every token into a group called
"token". If this group captures multiple tokens they're added to the
group's Captures collection in the order in which they're found. A token
is made up of one or more alphanumeric characters (\w+). It can be
followed by zero or more spaces. The (?>...) construction is used to
prevent too much backtracking going on. The whole
token-followed-by-space can exist multiple times. As the final token
will not have a space behind it I used \s*.

\) -> and finally the closing parenthesis.

Keep in mind that if you use the RegexOptions.IgnorePatternWhitespace,
you can reflow the regex to be easier to read. It's also easier to add
comments that way.

@"
(?# Start of the previous match)
\G
(
(?#
Match any character until you fin the start of
A key/val pair.
)
(?<other>((?!#[^\(]+\([^\)]+\)).)+)
|
(?#
Match a key/val pair. Put the keyname in a group
and every token in another.
)
(?<keyval>#(?<key>\w+)\(((?<token>\w+)\s*)+\))
)
";

One alternative to this whole approach I haven't tested yet, but would
work none the less is to only look for the special key/val thingies with
only the right subexpression:

#(?<key>\w+)\(((?<token>(?>\w+))\s*)+\)

And query the start/end location of each match to determine if there
were any other characters since the last found match. You can then
extract those characters with a substring function. I'm not sure which
option is faster, but I would not be surprised if the substring option
would work even better, though it would contain more coding.

Jesse Houwing
 
J

Jesse Houwing

* (e-mail address removed) wrote, On 8-6-2007 0:22:
Thanks so much for your help Jesse. I could implement it using your
suggestions.

I ended up using the following expression:
\G(?<keyval>#Key\([\w\s]+\))|(?<other>((?!#Key\([\w\s]+\)).)*)

which seems to work fine for me (it detects when two #Key() are places
one beside the other.

I want also to thank you for the regex explanation. It's always a
difficult topic.

You're welcome :)

Jesse
 
W

Walter Wang [MSFT]

Hi Andrea,

Please feel free to let me know if there's anything I can help. Thanks.


Regards,
Walter Wang ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
When responding to posts, please "Reply to Group" via your newsreader so
that others may learn and benefit from your issue.
==================================================

This posting is provided "AS IS" with no warranties, and confers no rights.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top