J
Jesper Stocholm
I have developed a data-cleaner that extracts some data from a database,
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man (
The method is:
public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}
The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as
\u0000;
\u0009;
\u000A;
The config file is read using EnterpriseLibrary.
The problem is not that it takes some time to use this method - the
problem is that the timespan increases as the method is called which
indicates to me that there might be some string-issues that I have
not taken care of.
So my question is roughly:
How do I most efficiently clean a string for unwanted chars?
Should I work on the individual bytes instead of using a
StringBuilder? The system creates roughly 1Gb of CSV-files
as it performs its largest job, so we really need to be able
to clean this amount of data most efficiently.
Any help will be greatly appreciated.
)
cleans it for illegal/unwanted data and writes it to a CSV-fil for later
insertion to a SQL Server 2000 database. My problem is that it performs
like an old, limb man (
The method is:
public static StringBuilder RemoveChars(StringBuilder dataToClean_, string[] illegalChars_)
{
// only try to remove chars if there is data to clean
if (dataToClean_.Length > 0)
{
foreach (string s in illegalChars_)
{
MatchCollection reg = Regex.Matches(s, @"\\u([0-9A-F]{4})");
for(int i = 0; i < reg.Count; i++)
{
dataToClean_.Replace((char)int.Parse(reg.Groups[1].Value, NumberStyles.HexNumber), ' ');
}
}
}
return dataToClean_;
}
The illegal chars is defined in a config-file and is used as a
string array. They are defined in the config file as
\u0000;
\u0009;
\u000A;
The config file is read using EnterpriseLibrary.
The problem is not that it takes some time to use this method - the
problem is that the timespan increases as the method is called which
indicates to me that there might be some string-issues that I have
not taken care of.
So my question is roughly:
How do I most efficiently clean a string for unwanted chars?
Should I work on the individual bytes instead of using a
StringBuilder? The system creates roughly 1Gb of CSV-files
as it performs its largest job, so we really need to be able
to clean this amount of data most efficiently.
Any help will be greatly appreciated.
)