Regular Expression taking excessive CPU

G

Guest

I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.
 
J

Jon Skeet [C# MVP]

Brian Kitt said:
I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.

Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds


Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.
 
G

Guest

Wow, thanks. That was an extremely more helpful explantion that I had even
hoped for. This process is not entirely CPU critical, it's just parsing a
series of files and preparing them for DTS, but it used to run in about an
hour, and is now taking nearly 5 hours. I expect to be processing much more
data in the future, so more than anything, it's just irritating that it takes
so long to process. I will change to option #2, that should be sufficient
for what I'm doing. I agree about the regular expression, in the "Old Days"
I would have hard coded it much like you did in Option 3 or 4, but I like to
use the tools we have to make our life simpler.

Jon Skeet said:
Brian Kitt said:
I have a process where I do some minimal reformating on a TAB delimited
document to prepare for DTS load. This process has been running fine, but I
recently made a change. I have a Full Text index on one column, and
punctuation in the column was causing some problems down the line. This
column is used only for full text indexing, and otherwise ignored. I decided
to use the following regular expression to remove all punctuation (actually
anything but alphanumeric). This is the only change I made to my code, and
it is now taking more than 4 times as long as it used to. Why is this
regular expression adding so much time to the process, and is there a better
way to strip out all non-alphanumeric data?

ftIndex = Regex.Replace(ftIndex, "[^a-zA-Z0-9 ]",String.Empty);

ftIndex is a string variable that typically won't exceed 100 characters. It
is nothing more than keywords associated with the data that I am loading.

Okay, there are some problems here:

1) By using the static Regex.Replace method, you're making the
framework parse your pattern every time you call it.

2) You're also not using RegexOptions.Compile, which can help
performance

3) You don't need to use a regular expression at all.

Here's a test program:

using System;
using System.Text;
using System.Text.RegularExpressions;

public class Test
{
// Number of strings
const int Size = 1000;
// Length of string
const int Length = 100;
// How often to include a non-alphanumeric character
const double NonAlphaProportion = 0.01;

// How many iterations to run
const int Iterations = 1000;

// Non-alphanumeric characters to pick from (at random)
static readonly char[] NonAlphaChars =
".,!\"$%^&*()_+-=".ToCharArray();
// Alphanumeric characters to pick from (at random)
static readonly char[] AlphaChars =
("ABCDEFGHIJKLMNOPQRSTUVWXYZ"+
"abcedfghijklmnopqrstuvwxyz"+
"0123456789 ").ToCharArray();

static void Main()
{
string[] strings = GenerateTestData();

int total=0;
DateTime start = DateTime.Now;
for (int i=0; i < Iterations; i++)
{
foreach (string x in strings)
{
total += RemoveNonAlpha1(x).Length;
}
}
DateTime end = DateTime.Now;
Console.WriteLine ("Time taken: {0}", end-start);
Console.WriteLine ("Total length: {0}", total);
}

static string RemoveNonAlpha1(string x)
{
return Regex.Replace(x, "[^a-zA-Z0-9 ]", string.Empty);
}

static Regex regex = new Regex("[^a-zA-Z0-9 ]",
RegexOptions.Compiled);
static string RemoveNonAlpha2(string x)
{
return regex.Replace(x, string.Empty);
}

static string RemoveNonAlpha3(string x)
{
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string RemoveNonAlpha4(string x)
{
bool foundNonAlpha = false;
foreach (char c in x)
{
if (!((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
foundNonAlpha = true;
break;
}
}
if (!foundNonAlpha)
{
return x;
}
StringBuilder builder = new StringBuilder(x.Length);
foreach (char c in x)
{
if (((c >= '0' && c <= '9') ||
(c >= 'A' && c <= 'Z') ||
(c >= 'a' && c <= 'z') ||
c==' '))
{
builder.Append(c);
}
}
return builder.ToString();
}

static string[] GenerateTestData()
{
Random random = new Random(0);
string[] ret = new string[Size];

for (int i=0; i < Size; i++)
{
StringBuilder builder = new StringBuilder(Length);
for (int j=0; j < Length; j++)
{
char[] selection;

if (random.NextDouble() < NonAlphaProportion)
{
selection = NonAlphaChars;
}
else
{
selection = AlphaChars;
}
builder.Append (
selection[random.Next(selection.Length)]);
}
ret = builder.ToString();
}
return ret;
}
}

Here, the first version is what you've currently got.

The second version is a version using a cached, compiled regular
expression instance.

The third version goes through each character in the string in a hard-
coded manner, and appends each alpha-numeric character to a
StringBuilder.

The fourth version checks whether or not there's anything to trim
before even creating the StringBuilder.

Now, with the sample data from above, here's the timing on my machine:

1) 40 seconds
2) 10 seconds
3) 3.73 seconds
4) 3 seconds


Now, here's the timing when the proportion of non-alphanumeric
characters is changed to 5% instead of 1%:

1) 44 seconds
2) 12 seconds
3) 3.8 seconds
4) 3.86 seconds

Finally, when the proportion is changed to 0.1%:

1) 37 seconds
2) 9 seconds
3) 3.7 seconds
4) 1.3 seconds

So, as you can see, the performance depends on the data. However, I
would actually suggest you go for number 2 unless it's absolutely
performance critical. A regular expression is the simplest way of
expressing what you're after in this case, and the performance is a lot
better than with the first case.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top