String.Split(), Regex.Split() - empty String

R

Rico

If there are consecutive occurrences of characters from the given
delimiter, String.Split() and Regex.Split() produce an empty string as the
token that's between such consecutive occurrences. It sounds like making
sense, but has anyone ever found this useful? Can this 'feature' be
disabled?

After having used StringTokenizer from the J-language that's not to be
named, it's annoyed me for hours before I figured out that it was just a
matter of modifying our own Split method to ignore tokens returned when
some startIndex is the same as some current pointer.

Now... I went through the trouble of using NHibernate to get rid of SQL
strings.. only to find myself rolling my own Tokenizer... it feels weird.

Rico.
 
G

Greg Bacon

: If there are consecutive occurrences of characters from the given
: delimiter, String.Split() and Regex.Split() produce an empty string as
: the token that's between such consecutive occurrences. It sounds like
: making sense, but has anyone ever found this useful? Can this
: 'feature' be disabled?
: [...]

Sure. If you mean a run of multiple separators is semantically
equivalent to a single separator, then say what you mean:

static void Main(string[] args)
{
string input = "one-two--three---four----five";

foreach (string pattern in new string[] { "-", "-+" })
{
Console.WriteLine("Splitting on " + pattern + "...");

Regex separator = new Regex(pattern);
foreach (string field in separator.Split(input))
Console.WriteLine(" - [" + field + "]");
}
}

The above program's output is

Splitting on -...
- [one]
- [two]
- []
- [three]
- []
- []
- [four]
- []
- []
- []
- [five]
Splitting on -+...
- [one]
- [two]
- [three]
- [four]
- [five]

Hope this helps,
Greg
 
R

Rico

Sure. If you mean a run of multiple separators is semantically
equivalent to a single separator, then say what you mean:

static void Main(string[] args)
{
string input = "one-two--three---four----five";

foreach (string pattern in new string[] { "-", "-+" })
{
Console.WriteLine("Splitting on " + pattern + "...");

Regex separator = new Regex(pattern);
foreach (string field in separator.Split(input))
Console.WriteLine(" - [" + field + "]");
}
}

The above program's output is

Hope this helps,

It helped. Thanks a lot.

However, how do I get Regex to handle my intention when the input is
"-one-two--three---four----five-" ?

I don't want the first and last empty strings returned. If the delimiter
is a run of empty spaces, then I can Trim() the input, but what when it's
not?

Rico.
 
G

Greg Bacon

: [...]
: However, how do I get Regex to handle my intention when the input is
: "-one-two--three---four----five-" ?
:
: I don't want the first and last empty strings returned. If the
: delimiter is a run of empty spaces, then I can Trim() the input, but
: what when it's not?

Oh, sorry, my sample code handled a separator, not a delimiter.

This should be more to your liking:

static void Main(string[] args)
{
string input = "-one-two--three---four----five---";

string delimiter = "-+";
string pattern = String.Format(
@"^({0}|{0}(?<field>.+?)(?={0}))*$",
delimiter);

Regex delimited = new Regex(pattern);
Match m = delimited.Match(input);
if (m.Success)
foreach (Capture c in m.Groups["field"].Captures)
Console.WriteLine("[" + c + "]");
else
Console.WriteLine("no match");
}

One area to note is the first alternative that matches only the
delimiter. In cases with multiple trailing delimiters, e.g.,
"...-five---", this subpattern disambiguates by telling the matcher
to treat them as a single delimiter and not two.

You could also use \G:

static void Main(string[] args)
{
string input = "----one-----two--three---four----five---";

Regex delimited = new Regex(@"(-+|\G)(?<field>.+?)-+");
Match m = delimited.Match(input);
while (m.Success)
{
Console.WriteLine("[" + m.Groups["field"] + "]");

m = m.NextMatch();
}
}

Hope this helps,
Greg
 
Top