String.Split needs an enhancement to ignore empty fields

  • Thread starter Thread starter cody
  • Start date Start date
C

cody

If String.Split doesn't fit your needs you have to create your own split
method which isn't very complicated. String.Split is designed that it meets
the most common application needs.
 
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.
 
cody said:
If String.Split doesn't fit your needs you have to create your own split
method which isn't very complicated. String.Split is designed that it meets
the most common application needs.

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Which is what I have done. But parsing strings of data with multiple
whitespace characters between fields *is* a very common operation. So I
am disagreeing with the part about "meeting the most common application
needs."

Anyway, I just sent it out in case somebody thought "oh, yea, that would
be a good idea."

David Logan
 
Have you considered using regular expressions (REGEX) to split the string? I have used it to accomplish what you describe.

See System.Text.RegularExpressions
 
Yes, I have considered it, but I prefer not to use a very expensive
regex for an otherwise simple split. String.Split is perfect save the
fact that in something like:
"abc def ghi jkl mnop"

I get an array of 80 elements instead of 5.


I prefer to save regex for parsing strings when:

1) You don't know what you're going to get next
(in a loop of string processing), or
2) There are various optional pieces in a string
that may or may not occur.

In these instances, simple splitting and checking results is already
pretty expensive, so using regex isn't a stretch.

David Logan
 
We need an additional function in the String class. We need the ability
Which is what I have done. But parsing strings of data with multiple
whitespace characters between fields *is* a very common operation. So I
am disagreeing with the part about "meeting the most common application
needs."

Anyway, I just sent it out in case somebody thought "oh, yea, that would
be a good idea."

In that case RegEx.Split(string delim) is your friend.
Use @"\s" as separator in your case (IIRC).
 
I have to agree with David on this one. Every time I looked at StringSplit
to do simple splitting I gave up on it because of all the extra empty
strings.

Philippe
 
David,
In addition to the other comments.

There are three Split functions in .NET:

Use Microsoft.VisualBasic.Strings.Split if you need to split a string based
on a specific word (string). It is the Split function from VB6.

Use System.String.Split if you need to split a string based on a collection
of specific characters. Each individual character is its own delimiter.

Use System.Text.RegularExpressions.RegEx.Split to split based
on matching patterns.

In your example I would use RegEx.Split, unless it was proven via profiling
to be a performance problem in the routine you are using (remember the 80-20
rule).

Hope this helps
Jay
 
I was unaware of the .VisualBasic. namespace routines.

Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility.

What is the "80/20" rule?

David Logan
 
David,
Performance may or may not be a problem depending upon which packets I
would need to parse in this manner. I just try to avoid regex in general
unless I need its flexibility.
Generally if I am going to be reusing the same RegEx, I apply the
RegexOptions.Compiled option and keep the RegEx itself in a static member.
What is the "80/20" rule?
I've heard various variations of it, basically 80% of the time is spent in
20% of the code.

Basically I write "correct" code first, rather then worry how well it will
perform, I only go back & optimize routines, once those routines have proven
to be a performance problem... By "correct" I primarily mean OOP, plus using
the tools available, such as RegEx to solve a problem, if those tools fit
the requirement. Of course "correct" is subjective.

Hope this helps
Jay
 
I am currently using a homegrown method:

protected String[] SplitNoEmpty(String data)
{
ArrayList fieldarray = new ArrayList();
foreach (string field in data.Split(' '))
if (field.Length > 0) fieldarray.Add(field);
String[] ret = new String[fieldarray.Count];
for(int x=0;x<fieldarray.Count;x++)
ret[x]=(String)fieldarray[x];
return ret;
}

I mentioned it mainly because splitting strings over multiple whitespace
is such a common operation I think it would be worthwhile to consider
implementing in the common libraries.

David Logan
 
David Logan said:
ArrayList fieldarray = new ArrayList();
[...]
String[] ret = new String[fieldarray.Count];
for(int x=0;x<fieldarray.Count;x++)
ret[x]=(String)fieldarray[x];
return ret;

FYI, there's a more concise way of doing that:

return (string[]) fieldarray.ToArray(typeof(string));
splitting strings over multiple whitespace is such
a common operation I think it would be worthwhile
to consider implementing in the common libraries.

I agree. An extra bool parameter to String.Split, indicating whether
to omit zero-length strings from the resulting array, wouldn't hurt.

P.
 
Hi,

David Logan said:
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one
you shouldn't be concerned about writing an inefficient pattern.



HTH
greetings
 
I completely agree that, for now, Regex is the best solution for most of us.

I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an order of magnitude slower; however, it is a good solution.

Unless your application performance constraints are very strict, I would use Regex.


BMermuys said:
Hi,

David Logan said:
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one
you shouldn't be concerned about writing an inefficient pattern.



HTH
greetings
 
If you would have used a compiled RegEx instead of everytime calling
RegEx.Split() which compiles the RegEx everytime again, I suspect that
RegEx.Split() would have been even fast than String.Split().

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Bill O'Neill said:
I completely agree that, for now, Regex is the best solution for most of us.

I wrote a test that split David's string 10,000 times. The string.split
method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.
Unless your application performance constraints are very strict, I would use Regex.


BMermuys said:
Hi,

David Logan said:
We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one
you shouldn't be concerned about writing an inefficient pattern.



HTH
greetings
 
That *is* using a compiled Regex instance, and not the static Split method.

cody said:
If you would have used a compiled RegEx instead of everytime calling
RegEx.Split() which compiles the RegEx everytime again, I suspect that
RegEx.Split() would have been even fast than String.Split().

--
cody

Freeware Tools, Games and Humour
http://www.deutronium.de.vu || http://www.deutronium.tk
Bill O'Neill said:
I completely agree that, for now, Regex is the best solution for most of us.

I wrote a test that split David's string 10,000 times. The string.split
method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.
Unless your application performance constraints are very strict, I would use Regex.


BMermuys said:
Hi,

We need an additional function in the String class. We need the ability
to suppress empty fields, so that we can more effectively parse. Right
now, multiple whitespace characters create multiple empty strings in the
resulting string array.

string [] fields = Regex.Split (strInput, "\\s+");

Why bother writing it yourself if it can be done as easely. There is
nothing wrong with regex.

I don't like the argument that it shouldn't be used in simple cases, for one
you shouldn't be concerned about writing an inefficient pattern.



HTH
greetings
 
Bill said:
I completely agree that, for now, Regex is the best solution for most of us.

I wrote a test that split David's string 10,000 times. The string.split method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an order of magnitude slower; however, it is a good solution.

That's exactly why I reserve regex to cases where it's really useful.
Unless your application performance constraints are very strict, I would use Regex.

Why use a very inefficient method when there is a perfectly good and
efficient one? And it *could* be supported by the library. :)

David Logan
 
David,
That's exactly why I reserve regex to cases where it's really useful.

Yes the Regex took almost 10 times longer, however what happens when the
RegEx is only 1% or even .01% of the total cost of your routine, is it
really worth worrying about?

By routine I mean what you do with the array after splitting it. For example
placing the values into a DataTable. If the cost of using the DataTable is
significantly more then cost of the RegEx is it really worth worring about
avoiding the RegEx?

My concern with coding around it, is how much memory pressure (work for the
GC) are you creating to avoid the time on the RegEx. Are you simply robbing
Peter to pay Paul?

Which is where I would not avoid the RegEx, simply because RegEx is slow, I
would use the RegEx because it is quicker coding, and its a good fit for
this problem. Once the RegEx was proven to be too high a cost of the
routine, via profiling (the CLR profiler for example) then I would take the
extra time to code a quicker solution...

Granted if we get the String.Split ignore empties option in Whidbey, the
option would be the better fit in Whidbey...

For info on the 80/20 rule & optimizing only the 20% see Martin Fowler's
article "Yet Another Optimization Article" at
http://martinfowler.com/ieeeSoftware/yetOptimization.pdf

For a list of Martin's articles see:

http://martinfowler.com/articles.html

Info on the CLR Profiler:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag/html/scalenethowto13.asp

http://msdn.microsoft.com/library/d...y/en-us/dndotnet/html/highperfmanagedapps.asp


Hope this helps
Jay



method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.
 
David,
Looking at this closer the expression you use makes a huge difference!

For example BMermuy's statement:

string [] fields = Regex.Split (strInput, "\\s+");

is slower then

string [] fields = Regex.Split (strInput, " +");


String.Split: 0.105655048337149
SplitNoEmpty: 0.168633723001108
regex(" +"): 0.286259287144036
regex("\\s+"): 0.713445703294692

If you know your string only has a space as a delimiter, then the RegEx time
is only about 2x the SplitNoEmpty routine, however if you can have any white
space character (\s is short hand for [\f\n\r\t\v\x85\p{Z}]) as a delimiter
then the time is about 7x...

Times are in seconds based on QueryPerformanceCounter &
QueryPerformanceFrequency, using a loop of 10,000 iterations. I compiled the
RegEx outside the loop.

Hope this helps
Jay

method took 0.143 seconds while Regex took 1.104 seconds. Regex is almost an
order of magnitude slower; however, it is a good solution.
 
Back
Top