remove double space from string

M

mp

how to replace multiple blank spaces with a single space
and how to do that to multiple strings in a list of strings
(can't modify the iterator in a foreach loop)

I want to make all white space in a string one space char only.
so if there are two(or more) spaces contiguous, i want to remove all but one
of them.

this works but is probably not the best way to do it....
public void TestStringReplace()
{
List<string> oldList = new List<string>();
List<string> newList = new List<string>();

oldList.Add("this has some white space");

oldList.Add("this also has some white space");

Debug.Print("old list");
foreach (String s in oldList)
{
Debug.Print(s);
}
int lastLength = 0;
int thisLength = 0;

foreach (String s in oldList)//i can't modify s directly since
it's the enumerator
{
StringBuilder sb = new StringBuilder();
sb.Append(s);
do
{
lastLength = sb.Length;
sb.Replace(" ", " ");
thisLength = sb.Length;
} while (thisLength != lastLength);

newList.Add(sb.ToString());

}

oldList = newList;
Debug.Print("modified list");
foreach (String s in oldList)
{
Debug.Print(s);
}

}

i'm sure there's a better way to do that but that's what i came up with so
far
mark
 
B

Boreas

how to replace multiple blank spaces with a single space
and how to do that to multiple strings in a list of strings

run through the strings, skipping if the pervious is a space, add the
characters back up again
 
M

mp

mp said:
how to replace multiple blank spaces with a single space

well after 6 hrs this showed up (after i re-posted under same subject line a
few hrs ago)

message properties:
Date: Wed, 1 Dec 2010 14:25:00 -0600
Injection-Date: Wed, 1 Dec 2010 20:25:01 +0000 (UTC)

so consider this thread extraneous
mark
 
M

mp

Boreas said:
run through the strings, skipping if the pervious is a space, add the
characters back up again

yes, that could work too. do you think it more efficient/whatever than
using stringBuider.Replace() ?
thanks for the response.
mark
 
A

Arne Vajhøj

how to replace multiple blank spaces with a single space
and how to do that to multiple strings in a list of strings
(can't modify the iterator in a foreach loop)

I want to make all white space in a string one space char only.
so if there are two(or more) spaces contiguous, i want to remove all but one
of them.

this works but is probably not the best way to do it....
public void TestStringReplace()
{
List<string> oldList = new List<string>();
List<string> newList = new List<string>();

oldList.Add("this has some white space");

oldList.Add("this also has some white space");

Debug.Print("old list");
foreach (String s in oldList)
{
Debug.Print(s);
}
int lastLength = 0;
int thisLength = 0;

foreach (String s in oldList)//i can't modify s directly since
it's the enumerator
{
StringBuilder sb = new StringBuilder();
sb.Append(s);
do
{
lastLength = sb.Length;
sb.Replace(" ", " ");
thisLength = sb.Length;
} while (thisLength != lastLength);

newList.Add(sb.ToString());

}

oldList = newList;
Debug.Print("modified list");
foreach (String s in oldList)
{
Debug.Print(s);
}

}

i'm sure there's a better way to do that but that's what i came up with so
far

The shortest code is probably the regex solution:

foreach(String s in oldList)
{
newList.Add(Regex.Replace(s, "(?<= ) ", ""));
}

Arne
 
B

Boreas

mp said:
yes, that could work too. do you think it more efficient/whatever than
using stringBuider.Replace() ?
thanks for the response.
mark
As I understod it, the blank spaces differ in length (*).
Doing replace, you'd have to keep on running through
the string until no more double blank spaces are found.
If my assumption (*) is correct, that would be inefficient.
 
A

Arne Vajhøj

As I understod it, the blank spaces differ in length (*).
Doing replace, you'd have to keep on running through
the string until no more double blank spaces are found.
If my assumption (*) is correct, that would be inefficient.

One pass solutions are possible.

Whether a multi pass solution is really a problem we
don't know.

Odds is against it.

Arne
 
M

mp

Arne Vajhøj said:
One pass solutions are possible.

Whether a multi pass solution is really a problem we
don't know.

Odds is against it.

Arne

surprisingly (unless i'm doing something wrong in my test)
the regex solution takes twice as long as my loop solution
Time for function :
RegexVersion
293 milliseconds
Time for function :
LoopVersion
111 milliseconds

public void CompareTimingMethods()
{
int NumTests = 10000;
List<string> oldList = new List<string>();
oldList.Add("this has some white space");
oldList.Add("this also has some white space");
List<string> newList2 =null ;
List<string> newList=null ;
Profile.StartTiming("RegexVersion");
for (int idx = 0; idx < NumTests; idx++)
{ newList2 =
RemoveExcessWhitespaceFromListOfStrings2(oldList); }
Profile.StopTiming("RegexVersion");

Profile.StartTiming("LoopVersion");
for (int idx = 0; idx < NumTests; idx++)
{newList = RemoveExcessWhitespaceFromListOfStrings(oldList);}
Profile.StopTiming("LoopVersion");

Profile.Report( );
}

the respective functions above call these two routines for each element in
oldList
regex version
private string RemoveExcessWhitespace2(string inputString)
{
string rep = @"\s+";
return Regex.Replace(inputString, rep, " ");
}
loop version
private string RemoveExcessWhitespace(string inputString)
{
StringBuilder sb = new StringBuilder();
sb.Append(inputString);
int lastLength = 0;
int thisLength = 0;
sb.Replace('\t', ' ');
do
{
lastLength = sb.Length;
sb.Replace(" ", " ");
thisLength = sb.Length;
} while (thisLength != lastLength);
sb.Replace(") ", ")");
sb.Replace(" (", "(");

return sb.ToString();
}

the regex version is probably not doing:
sb.Replace(") ", ")");
sb.Replace(" (", "(");

but that's a simple matter to add, what i'm really timing is if the loop
version is more costly...it doesn't seem to be
maybe if i add more spaces in the timing comparisons would change...i'll try
that too
mark
 
M

mp

mp said:
One pass solutions are possible.

Whether a multi pass solution is really a problem we
don't know.

Odds is against it.

Arne

surprisingly (unless i'm doing something wrong in my test)
the regex solution takes twice as long as my loop solution
Time for function :
RegexVersion
293 milliseconds
Time for function :
LoopVersion
111 milliseconds
[]


but that's a simple matter to add, what i'm really timing is if the loop
version is more costly...it doesn't seem to be
maybe if i add more spaces in the timing comparisons would change...i'll
try that too
mark

yep that favors the regex version...the more spaces the better regex
compares to loop
by adding this line to the test string list
oldList.Add("this has a lot of space ".PadLeft(1000)+ " inside it ");
(it looks funny but i couldn't find an equivalent of vb6 "Space(1000)"
(i also modified the regex version with a replace for the ") " and " ("
options )

private string RemoveExcessWhitespace2(string inputString)

{

string rep = @"\s+";

// another regex option:
//from
http://www.codeproject.com/Messages/1721335/How-to-replace-multiple-spaces-in-a-string-to-sing.aspx
//str2 = Regex.Replace(str2,@"^\s*(.*?)\s*$", "$1");



return Regex.Replace(inputString, rep, " ").Replace(") ", ")").Replace(" (",
"(");

}

i now get these times

Time for function :
RegexVersion
305 milliseconds
Time for function :
LoopVersion
745 milliseconds
 
M

mp

Peter Duniho said:
[...]
i now get these times

Time for function :
RegexVersion
305 milliseconds
Time for function :
LoopVersion
745 milliseconds

You can also improve regex speed when performing the same thing over and
over by just creating a single (static) regex instance that you reuse.
When creating the instance, providing the RegexOptions.Compiled option
will help as well.

Pete

i was surprised to see i didn't need to create an instance to use regex
return Regex.Replace(inputString, rep, " ").Replace(") ", ")").Replace(" (",
"(").Trim();

it seems to act as if it were a static global object, i don't understand how
that's working but i saw in examples and used it that way

so you're suggesting

Private Static Regex rgx = new Regex();

or something like that?

hmmm that's wrong syntax it seems, I'll keep trying...

thanks

mark
 
M

mp

Peter Duniho said:
Inasmuch as there's anything like "global" in C#, all statics are
"global". That is, they exist at any time, from any code that meets
whatever accessibility restrictions exist, if any (i.e. obviously you
can't access a private static member from outside the class where that
member is declared).

The Regex class has static members to allow you to do basic regex
operations without creating an explicit instance of the Regex class. But
for scenarios where you are likely to want to perform the same operation
repeatedly, and especially frequently, it is useful to create a single new
instance of Regex and use the instance members for the work.


"private static Regex rgx = new Regex();"

C# is case-sensitive, and all keywords are all-lowercase.

But, yes.that's basically what I mean.

Pete

some day i'll get that though my thick head that case sensitivity rules :)
of course then they try to confuse me because there's String and also string
and things like that....
doesn't take much to throw me off <g>
mark
 
M

mp

Peter Duniho said:
On 12/3/10 7:16 PM, mp wrote: []
I'm sure you'll get used to it. :)

Pete

i'm trying to implement your suggestion of a static regex
in class:
namespace LispViewer
{
class cLispFile
{
private static Regex _rgx;

then in the timing test method(in cLispFile) i tried
public void CompareTimingMethods()
{
_rgx = new Regex();

i get the error:
Error 1 'System.Text.RegularExpressions.Regex.Regex()' is inaccessible due
to its protection level

i thought with private any method inside the class would have access to it
changing to public(which i wouldn't want in this case) doesn't help either
so there somethign else i'm missing
thanks
mark
 
M

mp

Peter Duniho said:
[...]
then in the timing test method(in cLispFile) i tried
public void CompareTimingMethods()
{
_rgx = new Regex();

i get the error:
Error 1 'System.Text.RegularExpressions.Regex.Regex()' is inaccessible
due
to its protection level

i thought with private any method inside the class would have access to
it
changing to public(which i wouldn't want in this case) doesn't help
either
so there somethign else i'm missing

Yes, there is. Look at the exact text of the error message again. Look
at what it's telling you is inaccessible.

It's not your field "_rgx". It's
"System.Text.RegularExpressions.Regex.Regex()", which is the parameterless
constructor for the Regex class itself. As you can see from the
documentation (http://msdn.microsoft.com/en-us/library/594w4665.aspx),
that constructor is protected, and so is accessible only to the Regex
class itself or a sub-class.

You should be initializing the instance with the actual pattern you plan
to reuse. And assuming you also want to have the regex compiled, you'll
want to pass that option to the constructor as well. That means you want
this one instead: http://msdn.microsoft.com/en-us/library/h5845fdz.aspx

Finally, note that initializing it anew each time you use it isn't going
to help. You need to initialize the field directly, or in the static
constructor for the "cLispFile" class, or at the very least use a "lazy"
pattern (e.g. "if (_rgx == null) _rgx = new Regex(.)"). That way, you
only ever create a single instance of the class, which is then reused
every time you need it.

Pete

it sounds like (if i'm understanding) if I create the regex in the
constructor of the class, i can only use one regex "search pattern"
since i'm using it in two alternate methods, to test timing of different
regex "searches"
i think i have to use your last suggestion
"if (_rgx == null) _rgx = new Regex(.)").
and then null it out after each test so the next test creates the next
search term
public void CompareTimingMethods()
{
int NumTests = 10000;
[...]
for (int idx = 0; idx < NumTests; idx++)
{ newList2 = RemoveExcessWhitespaceFromListOfStrings2(oldList); }
_rgx = null;
for (int idx = 0; idx < NumTests; idx++)
{ newList3 = RemoveExcessWhitespaceFromListOfStrings3(oldList); }
_rgx = null;

[...]

}

private List<string> RemoveExcessWhitespaceFromListOfStrings2(List<string>
oldList)
{
List<string> newList = new List<string>();
foreach (String s in oldList)
{
newList.Add(RemoveExcessWhitespace2b(s));
}
return newList;
}

//RemoveExcessWhitespaceFromListOfStrings3 is of course similar


private string RemoveExcessWhitespace2b(string inputString)
{
if(_rgx == null)
{_rgx = new Regex ( @"\s+",RegexOptions.Compiled );}
return _rgx.Replace(inputString, " ").Replace(") ",
")").Replace(" (", "(").Trim();
}
private string RemoveExcessWhitespace3b(string inputString)
{
if(_rgx == null)
{ _rgx = new Regex(@"\s{2,}", RegexOptions.Compiled); }
return _rgx.Replace(inputString, " ").Replace(") ",
")").Replace(" (", "(").Trim();
}






that gave the following times (showing "\s{2,}" to be the winner)

Time for function :
LoopVersion1
804 milliseconds
Time for function :
RegexVersion2
87 milliseconds
Time for function :
RegexVersion3
36 milliseconds
Time for function :
loopVersion4
249 milliseconds


so I guess, now knowing which pattern to use, i can create the _rgx object
further up the chain and just use the one pattern.

thanks much for your help

mrk
 
A

Arne Vajhøj

surprisingly (unless i'm doing something wrong in my test)
the regex solution takes twice as long as my loop solution
Time for function :
RegexVersion
293 milliseconds
Time for function :
LoopVersion
111 milliseconds

public void CompareTimingMethods()
{
int NumTests = 10000;
List<string> oldList = new List<string>();
oldList.Add("this has some white space");
oldList.Add("this also has some white space");
List<string> newList2 =null ;
List<string> newList=null ;
Profile.StartTiming("RegexVersion");
for (int idx = 0; idx< NumTests; idx++)
{ newList2 =
RemoveExcessWhitespaceFromListOfStrings2(oldList); }
Profile.StopTiming("RegexVersion");

Profile.StartTiming("LoopVersion");
for (int idx = 0; idx< NumTests; idx++)
{newList = RemoveExcessWhitespaceFromListOfStrings(oldList);}
Profile.StopTiming("LoopVersion");

Profile.Report( );
}

the respective functions above call these two routines for each element in
oldList
regex version
private string RemoveExcessWhitespace2(string inputString)
{
string rep = @"\s+";
return Regex.Replace(inputString, rep, " ");
}
loop version
private string RemoveExcessWhitespace(string inputString)
{
StringBuilder sb = new StringBuilder();
sb.Append(inputString);
int lastLength = 0;
int thisLength = 0;
sb.Replace('\t', ' ');
do
{
lastLength = sb.Length;
sb.Replace(" ", " ");
thisLength = sb.Length;
} while (thisLength != lastLength);
sb.Replace(") ", ")");
sb.Replace(" (", "(");

return sb.ToString();
}

the regex version is probably not doing:
sb.Replace(") ", ")");
sb.Replace(" (", "(");

but that's a simple matter to add, what i'm really timing is if the loop
version is more costly...it doesn't seem to be
maybe if i add more spaces in the timing comparisons would change...i'll try
that too

First:
* if you are a student and having fun with this, then just continue
* if you are being paid for this, then start by testing whether
performance is really a problem, if not then stop wasting time on it

Second, regarding the testing then look below for some test code.

Arne

====

using System;
using System.Text;
using System.Text.RegularExpressions;

namespace E
{
public interface MultiSpaceTrimmerTest
{
string Trim(string s);
string Name { get; }
}
public class WhileString : MultiSpaceTrimmerTest
{
public string Trim(string s)
{
string res = s;
int len;
do
{
len = res.Length;
res = res.Replace(" ", " ");
}
while(res.Length < len);
return res;
}
public string Name
{
get
{
return "while loop with String";
}
}
}
public class WhileStringBuilder : MultiSpaceTrimmerTest
{
public string Trim(string s)
{
StringBuilder res = new StringBuilder(s);
int len;
do
{
len = res.Length;
res.Replace(" ", " ");
}
while(res.Length < len);
return res.ToString();
}
public string Name
{
get
{
return "while loop with StringBuilder";
}
}
}
public class RegexNegativeLookBehind : MultiSpaceTrimmerTest
{
public string Trim(string s)
{
return Regex.Replace(s, "(?<= ) ", "");
}
public string Name
{
get
{
return "static Regex Replace with ?<=";
}
}
}
public class RegexOneOrMore : MultiSpaceTrimmerTest
{
public string Trim(string s)
{
return Regex.Replace(s, " +", " ");
}
public string Name
{
get
{
return "static Regex Replace with +";
}
}
}
public class RegexNegativeLookBehindOpt : MultiSpaceTrimmerTest
{
private readonly Regex re = new Regex("(?<= ) ",
RegexOptions.Compiled);
public string Trim(string s)
{
return re.Replace(s, "");
}
public string Name
{
get
{
return "reused Regex Replace with ?<=";
}
}
}
public class RegexOneOrMoreOpt : MultiSpaceTrimmerTest
{
private readonly Regex re = new Regex(" +", RegexOptions.Compiled);
public string Trim(string s)
{
return re.Replace(s, " ");
}
public string Name
{
get
{
return "reused Regex Replace with +";
}
}
}
public class ForLoop : MultiSpaceTrimmerTest
{
public string Trim(string s)
{
StringBuilder res = new StringBuilder();
res.Append(s[0]);
for(int i = 1; i < s.Length; i++)
{
if(s != ' ' || s[i-1] != ' ')
{
res.Append(s);
}
}
return res.ToString();
}
public string Name
{
get
{
return "for loop";
}
}
}
public class Program
{
private const string BASE = "x x x x x";
private const int N = 10000000;
private static void FuncTest()
{
Console.WriteLine(new WhileString().Trim(BASE));
Console.WriteLine(new WhileStringBuilder().Trim(BASE));
Console.WriteLine(new RegexNegativeLookBehind().Trim(BASE));
Console.WriteLine(new RegexOneOrMore().Trim(BASE));
Console.WriteLine(new RegexNegativeLookBehindOpt().Trim(BASE));
Console.WriteLine(new RegexOneOrMoreOpt().Trim(BASE));
Console.WriteLine(new ForLoop().Trim(BASE));
}
private static void PerfTest(string s, MultiSpaceTrimmerTest mstt)
{
long t1 = DateTime.Now.Ticks;
for(int i = 0; i < N/s.Length; i++)
{
mstt.Trim(s);
}
long t2 = DateTime.Now.Ticks;
Console.WriteLine(" {0} : {1}", mstt.Name, s.Length*(t2 -
t1)/N);
}
private static void PerfTest(string s)
{
Console.WriteLine("string with length {0}:", s.Length);
PerfTest(s, new WhileString());
PerfTest(s, new WhileStringBuilder());
PerfTest(s, new RegexNegativeLookBehind());
PerfTest(s, new RegexOneOrMore());
PerfTest(s, new RegexNegativeLookBehindOpt());
PerfTest(s, new RegexOneOrMoreOpt());
PerfTest(s, new ForLoop());
}
private static void PerfTest(int n)
{
StringBuilder data = new StringBuilder();
for(int i = 0; i < n; i++)
{
data.Append(BASE);
}
PerfTest(data.ToString());
}
private static void PerfTest()
{
PerfTest(1);
PerfTest(10);
PerfTest(100);
}
public static void Main(string[] args)
{
FuncTest();
PerfTest();
Console.ReadKey();
}
}
}
 
M

mp

Peter Duniho said:
it sounds like (if i'm understanding) if I create the regex in the
constructor of the class, i can only use one regex "search pattern"
since i'm using it in two alternate methods, to test timing of different
regex "searches" [...]

For future reference, note that there's nothing wrong with having more
than one static Regex object in your class. In the case of testing
performance, the approach you took is fine. But if you really had more
than one pattern you wanted to be able to use, based one whatever criteria
you want, you can just create more than one static Regex object for your
class.

class SomeClass
{
private static Regex _regex1 = new Regex("first regex pattern",
RegexOptions.Compiled);
private static Regex _regex2 = new Regex("second regex pattern",
RegexOptions.Compiled);

// etc.
}

Pete

Thanks,
I had thought of that also, but was just trying to go with the idea of just
reusing a single one.
thanks for all your help on this (and all my other questions too)
:)
i sure see the usefulness in learning that arcane(to me) regex syntax to
take advantage of it's power.

mark
 
M

mp

Arne Vajhøj said:
Arne Vajhøj said:
One pass solutions are possible.
[].

surprisingly (unless i'm doing something wrong in my test)
[...]

First:
* if you are a student and having fun with this, then just continue

i'm not a student in the normal sense of the word, just an old man who hopes
(perhaps erroneously)
that continued effort at learning the presently unknown will help delay the
onset of total senility...
:)
and who has become helplessly addicted to the enjoyment of learning
programming even though i have no practical use for any of it.
(i'm just taking occasional breaks from remodleing my house to dabble with
this attempt to learn c#)
* if you are being paid for this, then start by testing whether
performance is really a problem, if not then stop wasting time on it

i'm definitely NOT being paid for this (or anything else for that matter )
:-(
my wife, i'm sure, would echo the sentiment of wasting time if she knew i
was on the computer and not in the attic pulling wires and knocking out my
punch list...
:)

Second, regarding the testing then look below for some test code.

Arne
=>
namespace E
{
[...]

that is fantastic!!...i will spend more time studying the framework you are
presenting here...very interesting
thank you very much for spending so much time on this example...i will learn
a lot...
now i better get back out there in the cold and finish screwing on my window
trim :)
mark
 
A

Arne Vajhøj

i sure see the usefulness in learning that arcane(to me) regex syntax to
take advantage of it's power.

Regex is one of those things a programmer should learn
a bit about (similar to SQL, XPath etc.).

Arne
 
M

mp

Arne Vajhøj said:
Regex is one of those things a programmer should learn
a bit about (similar to SQL, XPath etc.).

Arne
and i much appreciate your and Peter's great help toward that end
mark
 
M

mp

Arne Vajhøj said:
[]

Second, regarding the testing then look below for some test code.

Arne

wow
surprisingly(to me) your (elegant) for loop is winner hands down!

while loop with String : 13
while loop with StringBuilder : 15
static Regex Replace with ?<= : 105
static Regex Replace with + : 61
reused Regex Replace with ?<= : 44
reused Regex Replace with + : 29
for loop : 7

while loop with String : 79
while loop with StringBuilder : 76
static Regex Replace with ?<= : 908
static Regex Replace with + : 488
reused Regex Replace with ?<= : 384
reused Regex Replace with + : 270
for loop : 66

while loop with String : 751
while loop with StringBuilder : 660
static Regex Replace with ?<= : 9148
static Regex Replace with + : 4611
reused Regex Replace with ?<= : 3860
reused Regex Replace with + : 2508
for loop : 615

if (s != ' ' || s[i - 1] != ' ')
{
res.Append(s);
}
now that's a thing of beauty

makes me laugh at my clumsy version :)

currentCharacter = inputString[CharPos];
if(currentCharacter ==' ')
{
if (preceedingCharacter != ' ')
returnString.Append(currentCharacter);
}
else
returnString.Append(currentCharacter);
preceedingCharacter = currentCharacter;

must be why your forloop is so much faster than mine
thanks again
mark
 
A

Arne Vajhøj

surprisingly(to me) your (elegant) for loop is winner hands down!

Often there is a trade off between speed and maintainability.

The Regex solutions are a lot easier to extend to add
more functionality.

The hand coded solution will soon be buried in an
unreadable mess of if statements.
if (s != ' ' || s[i - 1] != ' ')
{
res.Append(s);
}
now that's a thing of beauty

makes me laugh at my clumsy version :)

currentCharacter = inputString[CharPos];
if(currentCharacter ==' ')
{
if (preceedingCharacter != ' ')
returnString.Append(currentCharacter);
}
else
returnString.Append(currentCharacter);
preceedingCharacter = currentCharacter;

must be why your forloop is so much faster than mine


I would not expect that big a difference between those
two versions in speed. There are no big difference
in what they do - it is just how the code is written.

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top