Help with replacement pattern

  • Thread starter Thread starter Flomo Togba Kwele
  • Start date Start date
F

Flomo Togba Kwele

I'm looking to replace all commas in a single string not contained within a pair of double-quotes,
with a tab. I don't know where to begin.

e.g., string line = "a,",,"b" would be changed to "a,"\t\t"b"

Can someone suggest a regex pattern to use, or any other way that will accomplish this?

TIA Flomo
--
 
I'm looking to replace all commas in a single string not contained
within a pair of double-quotes, with a tab. I don't know where to
begin.

e.g., string line = "a,",,"b" would be changed to "a,"\t\t"b"

Can someone suggest a regex pattern to use, or any other way that will
accomplish this?

You want to replace all commas preceeded by an even number of doublequotes.
Try using a look-behind pattern for that - e.g.
something like

str.replace(@"(?<=^([^""]*""[^""]*""[^""]*)*),", "\t");
 
Hello Flomo,

From your post, my understanding on this issue is: you want to use regex to
replace the commas which are not contained within a pair of quotes. If I'm
off base, please feel free to let me know.

I think you can refer to the following regular expression to replace the
commas. (But regex is not a recommended approach in tackling this problem,
see the comparison of performance in the end of my reply)
(?<head>".*?")*(?<remove>,*)(?<tail>".*?")*
Here is some explanations:
The first (".*?")* is trying to match any "" pair in front of the commas to
be replaced.
The last (".*?")* is trying to match "" pair behind the commas.
After all the "" pairs are matched, any commas in the remaining string
should be replaced with '\t'.

The complete C# code is listed below:
static void Main(string[] args)
{
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";
Regex regex = new
Regex("(?<head>\".*?\")*(?<remove>,*)(?<tail>\".*?\")*");
MatchEvaluator myEvaluator = new
MatchEvaluator(Program.ReplaceFunction);
Console.WriteLine(regex.Replace(test, myEvaluator));
}
public static string ReplaceFunction(Match m)
{
return m.Groups["head"].Value + m.Groups["remove"].Value.Replace(',',
'\t') + m.Groups["tail"].Value;
}

An alternative way to accomplish the task is to purely operate on the chars
of the string. By iterating the characters in the string, the task can be
done in O(n), n is the length of the string.
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";
char[] str = test.ToCharArray();
bool isInQuotes = false;
for (int i = 0; i < str.Length; i++)
{
if (!isInQuotes && str == ',')
{
str = '\t';
continue;
}
if (str == '\"')
isInQuotes = !isInQuotes;
}
Console.WriteLine(str);
In the code above, isInQuotes is a flag indicating whether the current
char is contained within a pair of quotes. If isInQuotes is false and the
char is a comma, then we should replace it with a '\t'.

Here is a comparison in performance of the two approaches:
I let both methods run 100000 times on the test string:
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";
The result is that it takes 6380ms for Regex, but only 39ms for the string
method. Therefore, I recommend the latter.
Regex is useful in some complicated cases such as the match of Email
address, but sometime, it is resource-consuming. Thus, in some cases that
can be resolved in one iteration of string, a direct operation on chars is
recommended.

Please feel free to let me know if you have any other concern.

Sincerely,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
For MSDN subscribers whose posts are left unanswered, please check this
document: http://blogs.msdn.com/msdnts/pages/postingAlias.aspx

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications. If you are using Outlook Express/Windows Mail, please make sure
you clear the check box "Tools/Options/Read: Get 300 headers at a time" to
see your reply promptly.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hello Niels,

Thank you for the suggestion.

Hello Flomo,

Niels's regex is the third method to resolve the problem. But this method
still suffers from the loss of performance.
I made a test to compare the three methods: (see the code listing 1)
The result is:
#direct operation on chars: 39ms.
#my regex: 6280ms.
#Niels's regex: 8997ms.
Therefore, I think the direct operation on chars is the best way by now.

Code Listing 1:
class Program
{
static void Main(string[] args)
{
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";

Test1(test);
Test2(test);
Test3(test);
return;
}

public static void Test1(string test)
{
Stopwatch sw = new Stopwatch();
sw.Start();

for (int times = 0; times < 100000; times++)
{
char[] str = test.ToCharArray();
bool isInQuotes = false;
for (int i = 0; i < str.Length; i++)
{
if (!isInQuotes && str == ',')
{
str = '\t';
continue;
}
if (str == '\"')
isInQuotes = !isInQuotes;
}
//Console.WriteLine(str);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

public static void Test3(string test)
{
Stopwatch sw = new Stopwatch();
sw.Start();

for (int times = 0; times < 100000; times++)
{
Regex regex = new Regex("(?<=^([^\"]*\"[^\"]*\"[^\"]*)*),");
regex.Replace(test, "\t");
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

public static void Test2(string test)
{
Stopwatch sw = new Stopwatch();
sw.Start();

for (int times = 0; times < 100000; times++)
{
Regex regex = new
Regex("(?<head>\".*?\")*(?<remove>,*)(?<tail>\".*?\")*");
MatchEvaluator myEvaluator = new
MatchEvaluator(Program.ReplaceFunction);
regex.Replace(test, myEvaluator);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

public static string ReplaceFunction(Match m)
{
return m.Groups["head"].Value +
m.Groups["remove"].Value.Replace(',', '\t') + m.Groups["tail"].Value;
}
}

Sincerely,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Hi Flomo,

Would you mind letting me know the result of the suggestions? If you need
further assistance, feel free to let me know. I will be more than happy to
be of assistance.

Have a great day!

Sincerely,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
Back
Top