Cleaning data - performance issue

Jesper Stocholm · Apr 2, 2006

I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man

(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;

The config file is read using EnterpriseLibrary.

The problem is not that it takes some time to use this method - the
problem is that the timespan increases as the method is called which
indicates to me that there might be some string-issues that I have
not taken care of.

So my question is roughly:

How do I most efficiently clean a string for unwanted chars?
Should I work on the individual bytes instead of using a
StringBuilder? The system creates roughly 1Gb of CSV-files
as it performs its largest job, so we really need to be able
to clean this amount of data most efficiently.

Any help will be greatly appreciated.

)

Jon Skeet [C# MVP] · Apr 2, 2006

Jesper Stocholm said:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man (

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.
Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;

It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.

Nicholas Paldino [.NET/C# MVP] · Apr 2, 2006

Another thing here, an internal RegEx object is being created through
every iteration of the loop. The OP would see much better performance by
creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably increase
dramatically.

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Jon Skeet said:
Jesper Stocholm said:

I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man (

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_,
string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.
Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;

Click to expand...

It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.

Jesper Stocholm · Apr 4, 2006

Another thing here, an internal RegEx object is being created
through
every iteration of the loop. The OP would see much better performance
by creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably
increase dramatically.

Hi guys,

I did as suggested and moved the parsing of chars outside the method itself
and I now pass this as a parameter. Also I have skipped the check on
StringBuilder.Length, since it was basically not needed.

The method now works as intended and memory consumption is moderate.

Thanks for your input.

)

HttpWebRequest - data is truncated?	2	Apr 24, 2008
Truncated data using .NET 3.5 sockets VS 2008	17	Mar 24, 2008
DataReader.GetChars bug when using SequentialAccess ?	1	Sep 12, 2006
Performance issue whilst writing to Excel	1	May 22, 2008
InteropServices: How to map .NET data type to LPVOID	1	Jul 15, 2005
Huge data needs to be transfer from Fixed width Text File to SQL S	5	Jun 22, 2005
Quote extraction from Yahoo Historical Quotes	1	Jul 12, 2003
XP Performance Issue	5	Jan 17, 2008

Cleaning data - performance issue

Jesper Stocholm

Jon Skeet [C# MVP]

Nicholas Paldino [.NET/C# MVP]

Jesper Stocholm

Ask a Question

Similar Threads