Regex: matching comma separated list?

B

Bob

I think this is very simple but I am having difficult doing it. Basically
take a comma separated list:
abc, def, ghi, jk

A list with only one token does not have any commas:
abc

The first letter of each token (abc) must not be a number. I am simply
trying to parse it to get an array of tokens:
abc
def
ghi
jk

....or for the single token one:
abc

I can easily do this with String.Replace and String.Split, but would like to
do this with regular expressions. Yet I cannot seem to get it to work, here
is what I have so far:

String input = "abc, def, ghi, jk";
String pattern = @"^((?<name>\D.*?)(\x2C )?)+?$";
Match match = Regex.Match(input, pattern, RegexOptions.ExplicitCapture);

Any input would be appreciated,

Thanks
 
K

Kevin Spencer

I don't think Regular Expressions is the right tool for this job, Bob.
Regular Expressions are used to search for patterns, that is, strings which
share certain characteristics in common, but are not identical. In your
case, you want to convert a comma-delmited string into an array, and
String.Split() does just that.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
A watched clock never boils.
 
G

Greg Bacon

: I think this is very simple but I am having difficult doing it. Basically
: take a comma separated list:
: abc, def, ghi, jk
:
: A list with only one token does not have any commas:
: abc
:
: The first letter of each token (abc) must not be a number. I am simply
: trying to parse it to get an array of tokens:
: abc
: def
: ghi
: jk
:
: ...or for the single token one:
: abc
:
: I can easily do this with String.Replace and String.Split, but would like to
: do this with regular expressions. Yet I cannot seem to get it to work, here
: is what I have so far:
:
: String input = "abc, def, ghi, jk";
: String pattern = @"^((?<name>\D.*?)(\x2C )?)+?$";
: Match match = Regex.Match(input, pattern, RegexOptions.ExplicitCapture);
:
: Any input would be appreciated,

Consider the following code:

static void Main(string[] args)
{
string[] inputs = new string[]
{
"abc, def, ghi, jk",
"abc",
"good, 1bad, good, 2bad",
"trailingcomma,",
",",
",,",
",,,",
};

string pattern =
@"^(
(
| # ignore empties
(?<token>\D.*?) # a token worth keeping
|\d.*? # or one to ignore
)
\s* # eat trailing whitespace
(,\s*|$) # separator or done
)+$ # catch a sequence of the above
";

Regex tokens = new Regex(pattern, RegexOptions.IgnorePatternWhitespace);

foreach (string input in inputs)
{
Match m = tokens.Match(input);

Console.WriteLine("input = [" + input + "]:");
if (m.Success)
{
if (m.Groups["token"].Captures.Count > 0)
foreach (Capture c in m.Groups["token"].Captures)
Console.WriteLine(" - [" + c.Value + "]");
else
Console.WriteLine(" - no captures");
}
else
Console.WriteLine(" - no match.");
}
}

Its output is

input = [abc, def, ghi, jk]:
- [abc]
- [def]
- [ghi]
- [jk]
input = [abc]:
- [abc]
input = [good, 1bad, good, 2bad]:
- [good]
- [good]
input = [trailingcomma,]:
- [trailingcomma]
input = [,]:
- no captures
input = [,,]:
- no captures
input = [,,,]:
- no captures

It's easy to anticipate Jon Skeet's objections to the regular
expression above, and he'd certainly be on solid ground. Passing the
result of a split through a filter would be much clearer, e.g.,

public static void ExtractGoodTokens(string[] inputs)
{
Regex goodtoken = new Regex(@"^\D");

foreach (string input in inputs)
{
ArrayList goodtokens = new ArrayList();

foreach (string token in Regex.Split(input, @"\s*,\s*"))
if (goodtoken.IsMatch(token))
goodtokens.Add(token);

Console.WriteLine("input = [" + input + "]:");
if (goodtokens.Count > 0)
foreach (string token in goodtokens)
Console.WriteLine(" - [" + token + "]");
else
Console.WriteLine(" - none");
}
}

Hope this helps,
Greg
 
K

Kevin Spencer

How about

string[] aryList = strList.Split(new char[] {','});

???

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
A watched clock never boils.

Greg Bacon said:
: I think this is very simple but I am having difficult doing it.
Basically
: take a comma separated list:
: abc, def, ghi, jk
:
: A list with only one token does not have any commas:
: abc
:
: The first letter of each token (abc) must not be a number. I am simply
: trying to parse it to get an array of tokens:
: abc
: def
: ghi
: jk
:
: ...or for the single token one:
: abc
:
: I can easily do this with String.Replace and String.Split, but would
like to
: do this with regular expressions. Yet I cannot seem to get it to work,
here
: is what I have so far:
:
: String input = "abc, def, ghi, jk";
: String pattern = @"^((?<name>\D.*?)(\x2C )?)+?$";
: Match match = Regex.Match(input, pattern, RegexOptions.ExplicitCapture);
:
: Any input would be appreciated,

Consider the following code:

static void Main(string[] args)
{
string[] inputs = new string[]
{
"abc, def, ghi, jk",
"abc",
"good, 1bad, good, 2bad",
"trailingcomma,",
",",
",,",
",,,",
};

string pattern =
@"^(
(
| # ignore empties
(?<token>\D.*?) # a token worth keeping
|\d.*? # or one to ignore
)
\s* # eat trailing whitespace
(,\s*|$) # separator or done
)+$ # catch a sequence of the above
";

Regex tokens = new Regex(pattern,
RegexOptions.IgnorePatternWhitespace);

foreach (string input in inputs)
{
Match m = tokens.Match(input);

Console.WriteLine("input = [" + input + "]:");
if (m.Success)
{
if (m.Groups["token"].Captures.Count > 0)
foreach (Capture c in m.Groups["token"].Captures)
Console.WriteLine(" - [" + c.Value + "]");
else
Console.WriteLine(" - no captures");
}
else
Console.WriteLine(" - no match.");
}
}

Its output is

input = [abc, def, ghi, jk]:
- [abc]
- [def]
- [ghi]
- [jk]
input = [abc]:
- [abc]
input = [good, 1bad, good, 2bad]:
- [good]
- [good]
input = [trailingcomma,]:
- [trailingcomma]
input = [,]:
- no captures
input = [,,]:
- no captures
input = [,,,]:
- no captures

It's easy to anticipate Jon Skeet's objections to the regular
expression above, and he'd certainly be on solid ground. Passing the
result of a split through a filter would be much clearer, e.g.,

public static void ExtractGoodTokens(string[] inputs)
{
Regex goodtoken = new Regex(@"^\D");

foreach (string input in inputs)
{
ArrayList goodtokens = new ArrayList();

foreach (string token in Regex.Split(input, @"\s*,\s*"))
if (goodtoken.IsMatch(token))
goodtokens.Add(token);

Console.WriteLine("input = [" + input + "]:");
if (goodtokens.Count > 0)
foreach (string token in goodtokens)
Console.WriteLine(" - [" + token + "]");
else
Console.WriteLine(" - none");
}
}

Hope this helps,
Greg
--
I have felt for a long time that a talent for programming consists largely
of the abilty to switch readily from microscopic to macroscopic views of
things, i.e., to change levels of abstraction fluently.
-- Donald E. Knuth, "Structured Programming with go to Statements"
 
K

Kevin Spencer

Forgot to add, remove the members that start with a number.

--
HTH,

Kevin Spencer
Microsoft MVP
..Net Developer
A watched clock never boils.

Greg Bacon said:
: I think this is very simple but I am having difficult doing it.
Basically
: take a comma separated list:
: abc, def, ghi, jk
:
: A list with only one token does not have any commas:
: abc
:
: The first letter of each token (abc) must not be a number. I am simply
: trying to parse it to get an array of tokens:
: abc
: def
: ghi
: jk
:
: ...or for the single token one:
: abc
:
: I can easily do this with String.Replace and String.Split, but would
like to
: do this with regular expressions. Yet I cannot seem to get it to work,
here
: is what I have so far:
:
: String input = "abc, def, ghi, jk";
: String pattern = @"^((?<name>\D.*?)(\x2C )?)+?$";
: Match match = Regex.Match(input, pattern, RegexOptions.ExplicitCapture);
:
: Any input would be appreciated,

Consider the following code:

static void Main(string[] args)
{
string[] inputs = new string[]
{
"abc, def, ghi, jk",
"abc",
"good, 1bad, good, 2bad",
"trailingcomma,",
",",
",,",
",,,",
};

string pattern =
@"^(
(
| # ignore empties
(?<token>\D.*?) # a token worth keeping
|\d.*? # or one to ignore
)
\s* # eat trailing whitespace
(,\s*|$) # separator or done
)+$ # catch a sequence of the above
";

Regex tokens = new Regex(pattern,
RegexOptions.IgnorePatternWhitespace);

foreach (string input in inputs)
{
Match m = tokens.Match(input);

Console.WriteLine("input = [" + input + "]:");
if (m.Success)
{
if (m.Groups["token"].Captures.Count > 0)
foreach (Capture c in m.Groups["token"].Captures)
Console.WriteLine(" - [" + c.Value + "]");
else
Console.WriteLine(" - no captures");
}
else
Console.WriteLine(" - no match.");
}
}

Its output is

input = [abc, def, ghi, jk]:
- [abc]
- [def]
- [ghi]
- [jk]
input = [abc]:
- [abc]
input = [good, 1bad, good, 2bad]:
- [good]
- [good]
input = [trailingcomma,]:
- [trailingcomma]
input = [,]:
- no captures
input = [,,]:
- no captures
input = [,,,]:
- no captures

It's easy to anticipate Jon Skeet's objections to the regular
expression above, and he'd certainly be on solid ground. Passing the
result of a split through a filter would be much clearer, e.g.,

public static void ExtractGoodTokens(string[] inputs)
{
Regex goodtoken = new Regex(@"^\D");

foreach (string input in inputs)
{
ArrayList goodtokens = new ArrayList();

foreach (string token in Regex.Split(input, @"\s*,\s*"))
if (goodtoken.IsMatch(token))
goodtokens.Add(token);

Console.WriteLine("input = [" + input + "]:");
if (goodtokens.Count > 0)
foreach (string token in goodtokens)
Console.WriteLine(" - [" + token + "]");
else
Console.WriteLine(" - none");
}
}

Hope this helps,
Greg
--
I have felt for a long time that a talent for programming consists largely
of the abilty to switch readily from microscopic to macroscopic views of
things, i.e., to change levels of abstraction fluently.
-- Donald E. Knuth, "Structured Programming with go to Statements"
 
M

Marcus Andrén

I can easily do this with String.Replace and String.Split, but would like to
do this with regular expressions. Yet I cannot seem to get it to work, here
is what I have so far:

String input = "abc, def, ghi, jk";
String pattern = @"^((?<name>\D.*?)(\x2C )?)+?$";

This pattern is far from what you want.

First of all, it is easy to see that as you start with ^ and end with
$ you will always either match the complete string or nothing at all.

Secondly, Groups doesn't multiple matches, they only store the last
match in the given regular expression match. All ExplicitCapture does
is t make sure (\x2C ) as well as the outer parantheses don't count as
groups. The "name" group will only contain the characters captured on
the last loop.

This leads to the third problem. As the regex is written it will
capture a single character and than simply loop and repeat.

This is how it should be done:
(Using RegexOptions.IgnorePatternWhitespace)

string patternSplit =
@"
(?<=,|^) #The character preceding the match is either a comma or
#the beginning of the string

\D.*? #The string itself should be a non digit follow by
#any number of characters

(?=,|$) #The first character after the match should be , or
#the end of the string
";

This will find all the valid substrings while ignoring those beginning
with a digit.

It will however not make a noise if the string consists of invalid
entries. For example "12abc,def,ghi" will return "def" and "ghi" as
the two matches while just ignoring 12abc.

If you need to validate that the string doesn't contain any invalid
entries, you will have to write a seperate regular expressions that
tries to capture the entire string.
 
G

Greg Bacon

: Forgot to add, remove the members that start with a number.

Isn't that what I did in the second snippet?

Greg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top