Multiple blanks

tshad · Apr 21, 2009

Is there an easy way to take multiple blanks out of string?

I have a string that a user puts in but he may put multiple blanks in and I
would like to change 2 or 3 or 4... blanks to 1.

Thanks,

Tom

Pavel Minaev · Apr 21, 2009

Is there an easy way to take multiple blanks out of string?

I have a string that a user puts in but he may put multiple blanks in andI
would like to change 2 or 3 or 4... blanks to 1.

1) You can use Regex.Replace:

Regex.Replace(s, "\x20+", "\x20");

This is likely the fastest method if you precompile the regex.

2) You can use String.Split with
StringSplitOptions.RemoveEmptyEntries, and then String.Join the
result.

3) You can use Mark's suggestion.

Jesse Houwing · Apr 21, 2009

Hello Pavel,

1) You can use Regex.Replace:

Regex.Replace(s, "\x20+", "\x20");

That suggests only a space... there are a lot of blanks a user might be able
to enter (especially when copy pasting)...

I'd suggest

private static Regex rx = new Regex("\s+", RegexOptions.Compiled);

and then use this in your method:

rx.Replace(userInput, [ ]);

this will search and replace every occurance of multiple whitespaces to one
space.

Jesse

This is likely the fastest method if you precompile the regex.

2) You can use String.Split with
StringSplitOptions.RemoveEmptyEntries, and then String.Join the
result.

3) You can use Mark's suggestion.

But change it a bit. Contains will, buy default, start to look all the way
from the front each time. This is ok for small strings, but when going through
a large input (I usually test such things by copy pasting the whole contents
of "The Lord of the Rings" into a textbox), you're in trouble.

You already know where the last occurance of two spaces was found, so you
can remove those:

string tmp = userInput;
for (int i=0; i<tmp.Length, i++)
{
i= tmp.IndexOf(" ", i); // find the position of the two spaces, and
start looking at i, set i to the last position found
if (i == -1) { break; }
tmp = tmp.Remove(i, 1) // remove the first space
}
string result = tmp;

this will only pass the whole strign once.

It's probably even faster when doing this by loading the string into a stringbuilder
and work from there:

StringBuilder tmp = new StringBuilder(userInput);
for (int i = 0; i < tmp.Length -1 ; i++)
{
if (tmp.Chars == ' ' && tmp.Chars[i+1] == ' ') // find the position
of the two spaces, at the current position
{
tmp.Remove(i, 1); // Remove the first space if you've found it
i = i--; // continue at the current position.
}
}
string result = tmp.ToString();

My guess is that the last options will be the fastest. The problem with these
string and stringbuilder options is that you'll need to make multiple passes
if you also want to remove multiple tabs, or other (whitespace) characters.
If you want that, look at the IndexOfAny function for the string based solution,
extend your if statement to go though multiple comparisons for the StringBuilder
option, or just use the regex above, which is getting more and more maintainable
compared to the other options .

Pavel Minaev · Apr 21, 2009

But change it a bit. Contains will, buy default, start to look all the way
from the front each time. This is ok for small strings, but when going through
a large input (I usually test such things by copy pasting the whole contents
of "The Lord of the Rings" into a textbox), you're in trouble.

You already know where the last occurance of two spaces was found, so you
can remove those:

string tmp = userInput;
for (int i=0; i<tmp.Length, i++)
{
i= tmp.IndexOf(" ", i); // find the position of the two spaces, and
start looking at i, set i to the last position found
if (i == -1) { break; }
tmp = tmp.Remove(i, 1) // remove the first space}

string result = tmp;

this will only pass the whole strign once.

Yes, but it will cause many more copies of string to be created: one
for every removed space, while Mark's version does one for each pass -
if a string has e.g. 20 occurences of 2-space blanks, and no
occurences of any blanks larger than that, then Mark's version will
only create one new copy of the string (and do 2 scans), but yours
will create 20. Those copies are going to be muchmore expensive than
scans.

It's probably even faster when doing this by loading the string into a stringbuilder
and work from there:

StringBuilder tmp = new StringBuilder(userInput);
for (int i = 0; i < tmp.Length -1 ; i++)
{
if (tmp.Chars == ' ' && tmp.Chars[i+1] == ' ') // findthe position
of the two spaces, at the current position
{
tmp.Remove(i, 1); // Remove the first space if you've found it
i = i--; // continue at the current position.
}}

string result = tmp.ToString();

This is better as it avoids the copies, but Remove(i) is still an O(N)
operation (because it has to copy-shift all elements in the array
following the one being removed), so the whole algorithm is O(N^2).

A much better way to do it is to scan the input string and _build_ the
StringBuilder char-by-char. E.g.:

string input = ...;
StringBuilder result = new StringBuilder(input.Length);
char prev = '\0'; // if trimming blanks at start is desired, use
\x20
foreach (var ch in input) {
if (ch == '\x20' && prev == '\x20') continue; // perform any
additional whitespace checks as needed
result.Append(ch);
prev = ch;
}

However, I don't see any advantages of this version against the
compiled regex, performance-wise, and the latter is clearly shorter
(and, perhaps more importantly, it just says what to do, and not how
to do that).

Jesse Houwing · Apr 21, 2009

Hello Pavel,

But change it a bit. Contains will, buy default, start to look all
the way from the front each time. This is ok for small strings, but
when going through a large input (I usually test such things by copy
pasting the whole contents of "The Lord of the Rings" into a
textbox), you're in trouble.

You already know where the last occurance of two spaces was found, so
you can remove those:

string tmp = userInput;
for (int i=0; i<tmp.Length, i++)
{
i= tmp.IndexOf(" ", i); // find the position of the two spaces, and
start looking at i, set i to the last position found
if (i == -1) { break; }
tmp = tmp.Remove(i, 1) // remove the first space}
string result = tmp;

this will only pass the whole strign once.

Click to expand...

Yes, but it will cause many more copies of string to be created: one
for every removed space, while Mark's version does one for each pass -
if a string has e.g. 20 occurences of 2-space blanks, and no
occurences of any blanks larger than that, then Mark's version will
only create one new copy of the string (and do 2 scans), but yours
will create 20. Those copies are going to be muchmore expensive than
scans.

It's probably even faster when doing this by loading the string into
a stringbuilder and work from there:

StringBuilder tmp = new StringBuilder(userInput);
for (int i = 0; i < tmp.Length -1 ; i++)
{
if (tmp.Chars == ' ' && tmp.Chars[i+1] == ' ') // find the
position
of the two spaces, at the current position
{
tmp.Remove(i, 1); // Remove the first space if you've found it
i = i--; // continue at the current position.
}}
string result = tmp.ToString();

Click to expand...

This is better as it avoids the copies, but Remove(i) is still an O(N)
operation (because it has to copy-shift all elements in the array
following the one being removed), so the whole algorithm is O(N^2).

I always thought stringbuilder used a kind of partition table like structure
to mark which deleted parts to ignore when building the string, but I never
looked at the implementation, so you're probably right.

A much better way to do it is to scan the input string and _build_ the
StringBuilder char-by-char. E.g.:

string input = ...;
StringBuilder result = new StringBuilder(input.Length);
char prev = '\0'; // if trimming blanks at start is desired, use
\x20
foreach (var ch in input) {
if (ch == '\x20' && prev == '\x20') continue; // perform any
additional whitespace checks as needed
result.Append(ch);
prev = ch;
}

Click to expand...

Agreed. Should have thought of that

However, I don't see any advantages of this version against the
compiled regex, performance-wise, and the latter is clearly shorter
(and, perhaps more importantly, it just says what to do, and not how
to do that).

Click to expand...

Well the regex tends to be slower, though easier to read and maintain. I
have tried performance tests against regex in many examples, and usually
the string manipulation code wins, but the margins are usually so small that
I would ignore them. I tend to like regex .

Pavel Minaev · Apr 22, 2009

I always thought stringbuilder used a kind of partition table like structure
to mark which deleted parts to ignore when building the string, but I never
looked at the implementation, so you're probably right.

I haven't looked at implementation either, but I'm pretty sure that
StringBuilder works with an array of chars internally, in the same way
List<T> does.

In fact, if I remember correctly, it actually encapsulates an instance
of String directly, and mutates that (by arcane means). So it can
avoid copying the generated string data if you only call ToString()
once, and then discard the builder (so it can just hand you over the
string instance that it was mutating).

Well the regex tends to be slower, though easier to read and maintain. I
have tried performance tests against regex in many examples, and usually
the string manipulation code wins

I'm pretty sure it depends on what kind of regex. Some patterns are
inherently slower than hand-crafted mess of IndexOf and Substring; but
for a forward-only scanner, it's probably the same with
precompilation.

Of course, as usual, it takes a profiler to find out...

tshad · Apr 23, 2009

Mark Rae said:
string strStart = "This is a string with multiple
blanks";
string strEnd = strStart;
while (strEnd.Contains(" "))
{
strEnd = strEnd.Replace(" ", " ");
}

There might be a more efficient method involving RegEx...

There was.

I tried this which works pretty good.

City = Regex.Replace("Hacienda Heights", @"\s+", " ");

I wanted to get away with using some type of loop to take the blanks out.

Thanks,

Tom

Multiple blanks

tshad

Pavel Minaev

Jesse Houwing

Pavel Minaev

Jesse Houwing

Pavel Minaev

tshad