Challenge - Regular Expression that divides a string at tokens

R

Roger Frost

Hi all

I've been messing with this since early yesterday. I thought it might come
to me in my sleep, but no such luck.

Here is the basic problem, I need to split a given string into sub-strings
of it's "token" and "non-token' parts.

For instance, the string "This is {blue}, this is {red}, and this is {green}."

Should result in:

"This is "
"{blue}"
", this is "
"{red}"
", and this is "
"{green}"
"."

Now, I can do this in two parts, seperating the tokens from the literals
(the output includes "}" and/or "{" on the literals, but I can deal with
this). What I can't seem to do is combine the two to get the above results,
which is what I need, it allows me to rebuild the string in the correct order
easily with minimal code, nevermind that, the important part is that I need
to do this with Regular Expressions.

Here is a complete example program:

using System;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
class Program
{
static void Main(string[] args)
{
Console.WriteLine("Press Enter To Start.");
Console.ReadLine();

string mystr = "{id}: The best {item:{category}} of all {items}
in {country}{industry} of the world.";

mystr = mystr.Replace("}{", "} {"); // Just some validation to
make things simpler

string matchTokens = @"{(.+?)(}?)}";
string matchLiterals =
@"^([^{}]+?){|}([^{}]+?){|}([^{}]+?)$|^([^{}]+?)$";

Regex findTokens = new Regex(matchTokens);
Regex findLiterals = new Regex(matchLiterals);

MatchCollection tokens = findTokens.Matches(mystr);
MatchCollection literals = findLiterals.Matches(mystr);

foreach (Match m in tokens)
{ Console.WriteLine(m.Value); }

Console.WriteLine();

foreach (Match m in literals)
{ Console.WriteLine(m.Value); }

Console.WriteLine();

Console.WriteLine("Press Enter To Exit.");
Console.ReadLine();
}
}
}

I've tried the following pattern:

string matchTokens =
@"({(.+?)(}?)})|^([^{}]+?){|}([^{}]+?){|}([^{}]+?)$|^([^{}]+?)$";

It's just a combination of the two, but outputs the same as the matchTokens
pattern in the example.

If any
 
R

Roger Frost

[posted before completed]

....if anyone can help with this I would really appreciate it.

Thanks,
Roger
 
R

Reece

Not answering your question as asked, but it seems to me it would be trivial
to do what you want using Mystring.Split(array) where your array has "{" and
"}" in it. Then you could re-concatenate those as would be appropropriate.

Thus you would get an array like:

"This is "
"blue"
", this is "
"red"
", and this is "
"green"
"."
and without much thinking I think you could figure out how to make all the
odd numbered elements of that array start with "{" and end with "}".

Good luck.

Reece
 
R

Roger Frost

Thanks for the reply Reece.

I've worked out various solutions to do this with methods on the string
class, such as split(), but they all lead to a very ugly series of more
string method calls. This is one reason why I need to use Regular
Expressions specifically.

The input string could begin with a token, and tokens can be nested (see the
example program), or could have zero tokens, or could be only tokens with no
literal text at all.

The ending algorithm is recursive object creation, and "inbetween" creation
I need to keep the string manipulation to a brutal minimum.

Your solution is a perfect workaround for the information you had, but this
is why I specifically said Regular Expressions. I'm not criticizing you,
infact that's good thinking! Do you have any idaes that use pure RegEx?

Thanks and keep up the good work,
Roger
 
R

Roger Frost

I've got a little closer with:

string matchTokens = @"(({)({?)(.+?)(}?)(}))|([^}][{}]?(.+?)[{}]?[{$])";

Maybe a RegEx guru can help me out.

Input:

string mystr = "{id}: The best {item} of all {items:{item}} in
{country:{something}}{industry} of the world.";

Output:

{id}
: The best {
item} of all {
items:{
item}} in {
country:{
something}} {

Im looking for something more like:

{id}
: The best
{item}
of all
{items:{item}}
in
{country:{something}}

{industry}
of the world.

Thanks,
Roger
 
R

Roger Frost

Okay, I got it, finally deciding to use lookaround, I came up with:

string matchTokens = @"(?<=})(.+?)(?={|$)|{({?)(.+?)(}?)(})";

Input;

string mystr = "{id}: The best {item} of all {items:{item}} in
{country:{something}}{industry} of the world.";

Output:

{id}
: The best
{item}
of all
{items:{item}}
in
{country:{something}}

{industry}
of the world.
 
J

J.B. Moreno

Roger said:
I've got a little closer with:

string matchTokens = @"(({)({?)(.+?)(}?)(}))|([^}][{}]?(.+?)[{}]?[{$])";

Maybe a RegEx guru can help me out.

Input:

string mystr = "{id}: The best {item} of all {items:{item}} in
{country:{something}}{industry} of the world.";
-snip-

Im looking for something more like:

{id}
: The best
{item}
of all
{items:{item}}
in
{country:{something}}

{industry}
of the world.

You might want to take a look at:
http://www.m-8.dk/resources/RegEx-Balancing-Group.aspx
 
R

Roger Frost

J.B. Moreno said:

J.B. Thanks for replying. Earlier I posted my final pattern, but after
that I noticed more problems with the nested tokens, so using Google results
for ".net regex nested brackets" (which includes your link), I was able to
come up with my final revised pattern:

String matchParts =
@"{(?>{(?<bal>)|}(?<-bal>)|.?)*(?(bal)(?!))}|(?<=}|^)(.+?)(?={|$)";

....which satisfies all of the requirements.

I hope this helps someone else in the future... "regex recursion" is another
good keyword search.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top