Regex to remove \t \r \n from string

M

morleyc

Hi, i would like to remove a number of characters from my string (\t
\r \n which are throughout the string), i know regex can do this but i
have no idea how. Any pointers much appreciated.

Chris
 
N

Nicholas Paldino [.NET/C# MVP]

Chris,

Why not just use three calls to the Replace method on the String class?

string myString = input.Replace("\t", "").Replace("\r", "").Replace("\n",
"");

You can use the character version here as well if you wish.
 
M

morleyc

Why not just use three calls to the Replace method on the String class?

I am currently using the 3 replace calls :), however i have always
avoided reglular expressions before this seemed the ideal excuse to
learn them! I would also be interested in turning \r\n in a string to
just \n also. im sure it must be possible?
 
N

Nicholas Paldino [.NET/C# MVP]

Absolutely, just wondering why you wouldn't take the simpler, more
maintainable (depending on who is looking at it, at least from my point of
view) approach. =)

In this case, I believe you can have a regular expression of "[\t\r\n]"
and then call the Replace method, passing your input string and an empty
string (or whatever you want to replace any of the characters in that set
with) and it should work.
 
T

tomisarobot

it certainly is possible. you should create a little test project and
play with it. thing to remember about regex is to start small and
build up. its not hard really, but its horribly easy to assume that
things will behave differently than the reality.

been a while since ive done captures with PCRE, but for the simple
replace you are probably looking at something like this: [\r|\n|\t]

..net also has some context variable to make sure you have your
endlines localized correctly if thats all you are trying to do.
 
B

Ben Voigt

Nicholas Paldino said:
Absolutely, just wondering why you wouldn't take the simpler, more
maintainable (depending on who is looking at it, at least from my point of
view) approach. =)

Because your simpler method involves three complete string copies instead of
one!


RegEx.Replace ought to do it.
 
J

Jon Skeet [C# MVP]

Ben Voigt said:
Because your simpler method involves three complete string copies instead of
one!

Do we have any evidence that performance is an issue here? Further, do
we have evidence that regular expressions will actually make this
faster on the sample data?

Until both of those have been determined, I'd take a default course of
the simplest code which does the job.
RegEx.Replace ought to do it.

At what cost to readability though?
 
B

Ben Voigt

Jon Skeet said:
Do we have any evidence that performance is an issue here? Further, do
we have evidence that regular expressions will actually make this
faster on the sample data?

Until both of those have been determined, I'd take a default course of
the simplest code which does the job.

Well, ok, but you asked why anyone would ever choose not to do it that way,
and I gave an example.
At what cost to readability though?

Admittedly, a String.Replace(RegEx, String) method would be far more
readable, but set up a dependency from string on RegEx.
 
J

Jon Skeet [C# MVP]

Until both of those have been determined, I'd take a default course of
Well, ok, but you asked why anyone would ever choose not to do it that way,
and I gave an example.

That's fair enough.
Admittedly, a String.Replace(RegEx, String) method would be far more
readable, but set up a dependency from string on RegEx.

More importantly, it sets up a dependency on the reader understanding
regular expressions, which I've seen causing issues time and time again
in these newsgroups.

I'm all for regular expressions when their power is really needed, but
that tends to be pretty rare IME.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Jon said:
Do we have any evidence that performance is an issue here? Further, do
we have evidence that regular expressions will actually make this
faster on the sample data?

A simple test seems to indicate that regex is slower.

String Replace : 15 -> 12 x 6666666 : 6,6875
StringBuilder Replace : 15 -> 12 x 6666666 : 6,546875
Regex Replace : 15 -> 12 x 6666666 : 27,1875
Regex Replace Optimized : 15 -> 12 x 6666666 : 15,828125
String Replace : 960 -> 768 x 104166 : 3,3125
StringBuilder Replace : 960 -> 768 x 104166 : 2,03125
Regex Replace : 960 -> 768 x 104166 : 17,421875
Regex Replace Optimized : 960 -> 768 x 104166 : 13,4375
String Replace : 1000 -> 1000 x 100000 : 1,15625
StringBuilder Replace : 1000 -> 1000 x 100000 : 2,4375
Regex Replace : 1000 -> 1000 x 100000 : 3,78125
Regex Replace Optimized : 1000 -> 1000 x 100000 : 2,703125

(see code below)
At what cost to readability though?

Actually I think the regex code is more readable.

Arne

==========================================================

using System;
using System.Text;
using System.Text.RegularExpressions;

namespace E
{
public class MainClass
{
private const int N = 100000000;
private const string FMT = "{0,-25} : {1} -> {2} x {3} : {4}";
private static void TestStringReplace(string s)
{
int n = N / s.Length;
string s2 = null;
DateTime dt1 = DateTime.Now;
for(int i = 0; i < n; i++)
{
s2 = s.Replace("\r", "").Replace("\n", "").Replace("\t", "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "String Replace", s.Length,
s2.Length, n, (dt2 - dt1).TotalSeconds));
}
private static void TestStringBuilderReplace(string s)
{
int n = N / s.Length;
StringBuilder sb = new StringBuilder(s);
string s2 = null;
DateTime dt1 = DateTime.Now;
for(int i = 0; i < n; i++)
{
s2 = sb.Replace("\r", "").Replace("\n", "").Replace("\t",
"").ToString();
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "StringBuilder Replace",
s.Length, s2.Length, n, (dt2 - dt1).TotalSeconds));
}
private static void TestRegexReplace(string s)
{
int n = N / s.Length;
string s2 = null;
DateTime dt1 = DateTime.Now;
for(int i = 0; i < n; i++)
{
s2 = Regex.Replace(s, "[\r\n\t]", "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "Regex Replace", s.Length,
s2.Length, n, (dt2 - dt1).TotalSeconds));
}
private static void TestRegexReplaceOptimized(string s)
{
int n = N / s.Length;
Regex re = new Regex("[\r\n\t]", RegexOptions.Compiled);
string s2 = null;
DateTime dt1 = DateTime.Now;
for(int i = 0; i < n; i++)
{
s2 = re.Replace(s, "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "Regex Replace Optimized",
s.Length, s2.Length, n, (dt2 - dt1).TotalSeconds));
}
private static void Test(string s)
{
TestStringReplace(s);
TestStringBuilderReplace(s);
TestRegexReplace(s);
TestRegexReplaceOptimized(s);
}
public static void Main(string[] args)
{
string shortstr = "aaa\rbbb\nccc\tddd";
Test(shortstr);
string longstr = shortstr;
longstr += longstr;
longstr += longstr;
longstr += longstr;
longstr += longstr;
longstr += longstr;
longstr += longstr;
Test(longstr);
string nonestr = String.Empty.PadRight(1000, 'A');
Test(nonestr);
Console.ReadLine();
}
}
}
 
J

Jon Skeet [C# MVP]

Arne Vajhøj said:
Actually I think the regex code is more readable.

Well, it's interesting that your regex is "[\r\n\t]". I'm actually
slightly surprised this even works, as the \r, \n and \t are being
taken literally by the regex engine rather than having been escaped in
the normal way. I'd have expected "[\\r\\n\\t]" or @"[\r\n\t]" to make
it clear to the regex engine that you really meant the carriage return
etc to be part of the regex, and not incidental or for the sake of
readability (splitting the regex over several lines, as shown in
Jesse's example in another thread).

That extra level of escaping which is required in *some* cases (but
clearly not all) as well as having to understand the basic language of
regex in the first place is what makes it less readable in my opinion.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Jon said:
Arne Vajhøj said:
Actually I think the regex code is more readable.

Well, it's interesting that your regex is "[\r\n\t]". I'm actually
slightly surprised this even works, as the \r, \n and \t are being
taken literally by the regex engine rather than having been escaped in
the normal way. I'd have expected "[\\r\\n\\t]" or @"[\r\n\t]" to make
it clear to the regex engine that you really meant the carriage return
etc to be part of the regex, and not incidental or for the sake of
readability (splitting the regex over several lines, as shown in
Jesse's example in another thread).

That extra level of escaping which is required in *some* cases (but
clearly not all) as well as having to understand the basic language of
regex in the first place is what makes it less readable in my opinion.

I just used the regex provided by Nicholas.

And yes there are different rules inside and outside character
classes.

And I can not see the readability problem. The intent of the
code is obvious.

You are not sure that it works correctly. But that can be
verified.

The Substring/IndexOf combo could be less obvious to read
and would still need to be verified that it works.

Arne
 
J

Jon Skeet [C# MVP]

Arne Vajhøj said:
I just used the regex provided by Nicholas.

And yes there are different rules inside and outside character
classes.

And I can not see the readability problem. The intent of the
code is obvious.

To you, possibly. To me, even - I've done just enough regex to work out
what it means, although I wouldn't necessarily say it's obvious. To
every maintenance engineer? Not necessarily.
You are not sure that it works correctly. But that can be
verified.

There are lots of things that can be verified, but which are still less
obvious than writing things in a simpler way.
The Substring/IndexOf combo could be less obvious to read
and would still need to be verified that it works.

There's no Substring/IndexOf to be done - just three calls to Replace.
It's blindingly obvious what *they* do.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Jon said:
To you, possibly. To me, even - I've done just enough regex to work out
what it means, although I wouldn't necessarily say it's obvious. To
every maintenance engineer? Not necessarily.

It is a feature in .NET - it is a feature in most programming
environments today.

If they don't know, then they should learn.
There's no Substring/IndexOf to be done - just three calls to Replace.
It's blindingly obvious what *they* do.

No Substring/IndexOf in this case. But often regex is replaced
with some string manipulation code in the worst tradition of
C str functions.

Arne
 
J

Jon Skeet [C# MVP]

Arne Vajhøj said:
It is a feature in .NET - it is a feature in most programming
environments today.

If they don't know, then they should learn.

I'd rather not have to check the ins and outs of regular expressions
when there's a *very* simple alternative. It's so easy to go wrong with
regular expressions - I only use them when they provide a clear
benefit, which I don't believe they do in this case.

Just because you *can* do something with a regex doesn't mean you
*should*. I'm happy to go back and be really careful with regular
expressions when there's a good reason to use them, like validating
something which is genuinely a *pattern*, but I've seen enough people
get confused by them to be wary of them myself.
No Substring/IndexOf in this case. But often regex is replaced
with some string manipulation code in the worst tradition of
C str functions.

And likewise simple string manipulation code is replaced with a regex
for no reason whatsoever, sometimes introducing bugs at the same time.
 
J

Jesse Houwing

* Arne Vajhøj wrote, On 20-5-2007 2:18:
Jon said:
Ben Voigt said:
"Nicholas Paldino [.NET/C# MVP]" <[email protected]>
wrote in message
Absolutely, just wondering why you wouldn't take the simpler,
more maintainable (depending on who is looking at it, at least from
my point of view) approach. =)
Because your simpler method involves three complete string copies
instead of one!

Do we have any evidence that performance is an issue here? Further, do
we have evidence that regular expressions will actually make this
faster on the sample data?

I was intrigued by your results, so I expanded the test a little more.
My adjusted tests also keep in mind the fact that the amount to replace
will have impact on the execution speed.

And I was right.

The thing that scares me though, is that for String manipulation &
Stringbuilder, the impact of a larger amount to remove has 'little'
impact. In fact a stringbuilder the best option if you have a lot to remove.

The Regular Expressions get much, much, much slower when the amount to
remove increases. It looks like there is some very expensive buffer
copying going on in there.

Attached you'll find my adjusted test app. I'll attach the test results
at the bottom of this post. All tests were run under a x64 compiled
executable, no debugger attached, full optimization. This made quite a
difference by the way.
Actually I think the regex code is more readable.

If there were more characters to strip, say 10 or more, the regex will
become more readable very fast in this case. Though I personally would
have chosen for the following construction:

string victim = "...";
string[] stringsToRemove = new string[]{"\r", "\n", "\t"};
foreach (string stringToRemove in stringsToRemove)
{
victim = victim.Remove(stringToRemove);
// Or a stringbuilder variant;
}

This is easier to read, variables have logical names and it is easy to
add new characters later, or switch strategy without having to go
through 7 or more calls which are all the same.

The regex variant I would have used would have looked like this:

Regex rx = new Regex(@"
[\r\n\t] (?# Characters to replace )
", RegexOptions.Compiled);

Or:

Regex rx = new Regex(@"
( (?# Characters to replace )
\r
| \n
| \t
)
", RegexOptions.Compiled | RegexOptions.IgnorePatternWhitespace);


One thing we haven't looked at till now is the pressure generated on the
garbage collector. Memory usage spiked to 120MB in the most extensive
test case, which I think is pretty much. And things started jo-jo-ing
between 80 and 107 at another point, very large deltas. It was good that
my system had ample RAM and nothing else to do. Processor time isn't the
only thing that counts :).

Ok. As promised, the results:

=====================================================================

0% whitespace to replace

String Replace : 1000 : 0
StringBuilder Replace : 1000 : 0
Regex Replace : 1000 : 0
Regex Replace Optimized : 1000 : 0
Regex Replace A : 1000 : 16
Regex Replace A Optimized : 1000 : 0
String Replace : 10000 : 16
StringBuilder Replace : 10000 : 47
Regex Replace : 10000 : 31
Regex Replace Optimized : 10000 : 31
Regex Replace A : 10000 : 31
Regex Replace A Optimized : 10000 : 47
String Replace : 100000 : 94
StringBuilder Replace : 100000 : 203
Regex Replace : 100000 : 344
Regex Replace Optimized : 100000 : 391
Regex Replace A : 100000 : 344
Regex Replace A Optimized : 100000 : 281
String Replace : 1000000 : 891
StringBuilder Replace : 1000000 : 1750
Regex Replace : 1000000 : 2859
Regex Replace Optimized : 1000000 : 2891
Regex Replace A : 1000000 : 2891
Regex Replace A Optimized : 1000000 : 2438

5% whitespace to replace

String Replace : 1000 : 0
StringBuilder Replace : 1000 : 0
Regex Replace : 1000 : 16
Regex Replace Optimized : 1000 : 16
Regex Replace A : 1000 : 16
Regex Replace A Optimized : 1000 : 16
String Replace : 10000 : 16
StringBuilder Replace : 10000 : 31
Regex Replace : 10000 : 78
Regex Replace Optimized : 10000 : 63
Regex Replace A : 10000 : 78
Regex Replace A Optimized : 10000 : 63
String Replace : 100000 : 219
StringBuilder Replace : 100000 : 203
Regex Replace : 100000 : 688
Regex Replace Optimized : 100000 : 531
Regex Replace A : 100000 : 656
Regex Replace A Optimized : 100000 : 563
String Replace : 1000000 : 1734
StringBuilder Replace : 1000000 : 1703
Regex Replace : 1000000 : 5531
Regex Replace Optimized : 1000000 : 4406
Regex Replace A : 1000000 : 5516
Regex Replace A Optimized : 1000000 : 4500

50% whitespace to replace

String Replace : 1000 : 0
StringBuilder Replace : 1000 : 0
Regex Replace : 1000 : 47
Regex Replace Optimized : 1000 : 31
Regex Replace A : 1000 : 31
Regex Replace A Optimized : 1000 : 31
String Replace : 10000 : 47
StringBuilder Replace : 10000 : 16
Regex Replace : 10000 : 281
Regex Replace Optimized : 10000 : 203
Regex Replace A : 10000 : 297
Regex Replace A Optimized : 10000 : 219
String Replace : 100000 : 281
StringBuilder Replace : 100000 : 156
Regex Replace : 100000 : 2438
Regex Replace Optimized : 100000 : 1828
Regex Replace A : 100000 : 2375
Regex Replace A Optimized : 100000 : 1828
String Replace : 1000000 : 2344
StringBuilder Replace : 1000000 : 1203
Regex Replace : 1000000 : 20609
Regex Replace Optimized : 1000000 : 14750
Regex Replace A : 1000000 : 19594
Regex Replace A Optimized : 1000000 : 14875

95% whitespace to replace

String Replace : 1000 : 16
StringBuilder Replace : 1000 : 0
Regex Replace : 1000 : 63
Regex Replace Optimized : 1000 : 47
Regex Replace A : 1000 : 63
Regex Replace A Optimized : 1000 : 47
String Replace : 10000 : 47
StringBuilder Replace : 10000 : 16
Regex Replace : 10000 : 500
Regex Replace Optimized : 10000 : 359
Regex Replace A : 10000 : 469
Regex Replace A Optimized : 10000 : 344
String Replace : 100000 : 344
StringBuilder Replace : 100000 : 78
Regex Replace : 100000 : 4125
Regex Replace Optimized : 100000 : 3141
Regex Replace A : 100000 : 4047
Regex Replace A Optimized : 100000 : 2922
String Replace : 1000000 : 2859
StringBuilder Replace : 1000000 : 656
Regex Replace : 1000000 : 32750
Regex Replace Optimized : 1000000 : 24016
Regex Replace A : 1000000 : 31453
Regex Replace A Optimized : 1000000 : 23953

100% whitespace to replace

String Replace : 1000 : 0
StringBuilder Replace : 1000 : 16
Regex Replace : 1000 : 63
Regex Replace Optimized : 1000 : 47
Regex Replace A : 1000 : 63
Regex Replace A Optimized : 1000 : 47
String Replace : 10000 : 31
StringBuilder Replace : 10000 : 0
Regex Replace : 10000 : 516
Regex Replace Optimized : 10000 : 359
Regex Replace A : 10000 : 500
Regex Replace A Optimized : 10000 : 375
String Replace : 100000 : 328
StringBuilder Replace : 100000 : 78
Regex Replace : 100000 : 4172
Regex Replace Optimized : 100000 : 3406
Regex Replace A : 100000 : 4203
Regex Replace A Optimized : 100000 : 3031
String Replace : 1000000 : 2891
StringBuilder Replace : 1000000 : 625
Regex Replace : 1000000 : 34203
Regex Replace Optimized : 1000000 : 24781
Regex Replace A : 1000000 : 32672
Regex Replace A Optimized : 1000000 : 24547


=====================================================================

And the code:

=====================================================================
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleApplication1
{
class Program
{
private const string FMT = "{0,-25} : {1,-15} :
{4,11:##########0}";
private static Regex rxA = new Regex(@"[\r\n\t]",
RegexOptions.Compiled);
private static Regex rxB = new Regex(@"(\r|\n|\t)",
RegexOptions.Compiled | RegexOptions.ExplicitCapture);

private static void TestStringReplace(string s)
{
int n = ComputeRepetitions(s);
string s2 = null;
DateTime dt1 = DateTime.Now;
for (int i = 0; i < n; i++)
{
s2 = s.Replace("\r", "").Replace("\n",
"").Replace("\t", "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "String Replace",
s.Length, s2.Length, n, (dt2 - dt1).TotalMilliseconds));
}
private static void TestStringBuilderReplace(string s)
{
int n = ComputeRepetitions(s);
StringBuilder sb = new StringBuilder(s);
string s2 = null;
DateTime dt1 = DateTime.Now;
for (int i = 0; i < n; i++)
{
s2 = sb.Replace("\r", "").Replace("\n",
"").Replace("\t", "").ToString();
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "StringBuilder
Replace", s.Length, s2.Length, n, (dt2 - dt1).TotalMilliseconds));
}
private static void TestRegexReplace(string s)
{
int n = ComputeRepetitions(s);
string s2 = null;
DateTime dt1 = DateTime.Now;
for (int i = 0; i < n; i++)
{
s2 = Regex.Replace(s, @"[\r\n\t]", "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "Regex Replace",
s.Length, s2.Length, n, (dt2 - dt1).TotalMilliseconds));
}
private static void TestRegexReplaceOptimized(string s)
{
int n = ComputeRepetitions(s);
Regex re = rxA;
string s2 = null;
DateTime dt1 = DateTime.Now;
for (int i = 0; i < n; i++)
{
s2 = re.Replace(s, "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "Regex Replace
Optimized", s.Length, s2.Length, n, (dt2 - dt1).TotalMilliseconds));
}
private static void TestRegexReplaceAlternate(string s)
{
int n = ComputeRepetitions(s);
string s2 = null;
DateTime dt1 = DateTime.Now;
for (int i = 0; i < n; i++)
{
s2 = Regex.Replace(s, @"(?:\r|\n|\t)", "",
RegexOptions.None);
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "Regex Replace A",
s.Length, s2.Length, n, (dt2 - dt1).TotalMilliseconds));
}
private static void TestRegexReplaceOptimizedAlternate(string s)
{
int n = ComputeRepetitions(s);
Regex re = rxB;
string s2 = null;
DateTime dt1 = DateTime.Now;
for (int i = 0; i < n; i++)
{
s2 = re.Replace(s, "");
}
DateTime dt2 = DateTime.Now;
Console.WriteLine(String.Format(FMT, "Regex Replace A
Optimized", s.Length, s2.Length, n, (dt2 - dt1).TotalMilliseconds));
}
private static void CollectGarbage()
{
GC.Collect();
}
private static void Test(string s)
{
CollectGarbage();
TestStringReplace(s);
CollectGarbage();
TestStringBuilderReplace(s);
CollectGarbage();
TestRegexReplace(s);
CollectGarbage();
TestRegexReplaceOptimized(s);
CollectGarbage();
TestRegexReplaceAlternate(s);
CollectGarbage();
TestRegexReplaceOptimizedAlternate(s);
}

public static int ComputeRepetitions(string s)
{
int n = Convert.ToInt32(1000 / Math.Log(s.Length));
return n;
}

public static void Main(string[] args)
{
rxA.Replace("", "");
rxB.Replace("", "");

int[] whitespace = new int[] { 0, 5, 50, 95, 100 };
int minsize = 3;
int maxsize = 6;
foreach (int percentage in whitespace)
{
Console.WriteLine("\r\n{0}% whitespace to replace\r\n",
percentage);
for (int i = minsize; i <= maxsize; i++)
{
int length = Convert.ToInt32(Math.Pow(10, i));
string test = GenerateString(length, length,
percentage);
Test(test);
}
}
Console.ReadLine();
}

private static readonly char[] PossibleChars = new char[]
{

'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',

'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',

'0','1','2','3','4','5','6','7','8',',','.','"','\'','!','?','-'
};

private static readonly char[] PossibleWhitespaceChars = new char[]
{
' ', '\r', '\n', '\t'
};

public static Random _random = new
Random(DateTime.Now.Millisecond);

public static char GenerateRandomCharacter(char[] allowedChars)
{
int pos = _random.Next(allowedChars.Length - 1);

return allowedChars[pos];
}

public static string GenerateString(int minLength, int
maxLength, int spaceChance)
{
int length = minLength + _random.Next(maxLength - minLength);
StringBuilder sb = new StringBuilder(length);
for (int i = 0; i < length; i++)
{
if (spaceChance != 0 && i != 0 && i != length - 1 &&
_random.Next(100) <= spaceChance)
{

sb.Append(GenerateRandomCharacter(PossibleWhitespaceChars));
}
else
{
sb.Append(GenerateRandomCharacter(PossibleChars));
}
}
return sb.ToString();
}
}
}


=====================================================================

Kind Regards,

Jesse Houwing
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top