Cleaning data - performance issue

  • Thread starter Jesper Stocholm
  • Start date
J

Jesper Stocholm

I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :blush:(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;

The config file is read using EnterpriseLibrary.

The problem is not that it takes some time to use this method - the
problem is that the timespan increases as the method is called which
indicates to me that there might be some string-issues that I have
not taken care of.

So my question is roughly:

How do I most efficiently clean a string for unwanted chars?
Should I work on the individual bytes instead of using a
StringBuilder? The system creates roughly 1Gb of CSV-files
as it performs its largest job, so we really need to be able
to clean this amount of data most efficiently.

Any help will be greatly appreciated.

:blush:)
 
J

Jon Skeet [C# MVP]

Jesper Stocholm said:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :blush:(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.
Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;


It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.
 
N

Nicholas Paldino [.NET/C# MVP]

Another thing here, an internal RegEx object is being created through
every iteration of the loop. The OP would see much better performance by
creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably increase
dramatically.

Hope this helps.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Jon Skeet said:
Jesper Stocholm said:
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man :blush:(

The method is:

public static StringBuilder RemoveChars(StringBuilder dataToClean_,
string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.
Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}

The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as

\u0000;
\u0009;
\u000A;


It strikes me that the principle problem here is that you're parsing
the illegal characters every time you call the method. You should have
one method which converts the list of illegal characters from a string
array into a char array, and then you can reuse that char array each
time you call the method.

I expect that will be *much* faster than using regular expressions on
each iteration.

One of the key things to spot here is that you're doing the same work
every time the method is called - you're matching the same strings with
the same regular expressions each time. Any time you're looking for
performance gains and you find yourself duplicating effort, that's
somewhere to start.
 
J

Jesper Stocholm

Another thing here, an internal RegEx object is being created
through
every iteration of the loop. The OP would see much better performance
by creating one RegEx instance and setting the options on it to do a
pre-compile of the expression. The performance would probably
increase dramatically.

Hi guys,

I did as suggested and moved the parsing of chars outside the method itself
and I now pass this as a parameter. Also I have skipped the check on
StringBuilder.Length, since it was basically not needed.

The method now works as intended and memory consumption is moderate.

Thanks for your input.

:blush:)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top