Help with replacement pattern

F

Flomo Togba Kwele

I'm looking to replace all commas in a single string not contained within a pair of double-quotes,
with a tab. I don't know where to begin.

e.g., string line = "a,",,"b" would be changed to "a,"\t\t"b"

Can someone suggest a regex pattern to use, or any other way that will accomplish this?

TIA Flomo
--
 
N

Niels Ull

I'm looking to replace all commas in a single string not contained
within a pair of double-quotes, with a tab. I don't know where to
begin.

e.g., string line = "a,",,"b" would be changed to "a,"\t\t"b"

Can someone suggest a regex pattern to use, or any other way that will
accomplish this?

You want to replace all commas preceeded by an even number of doublequotes.
Try using a look-behind pattern for that - e.g.
something like

str.replace(@"(?<=^([^""]*""[^""]*""[^""]*)*),", "\t");
 
J

Jialiang Ge [MSFT]

Hello Flomo,

From your post, my understanding on this issue is: you want to use regex to
replace the commas which are not contained within a pair of quotes. If I'm
off base, please feel free to let me know.

I think you can refer to the following regular expression to replace the
commas. (But regex is not a recommended approach in tackling this problem,
see the comparison of performance in the end of my reply)
(?<head>".*?")*(?<remove>,*)(?<tail>".*?")*
Here is some explanations:
The first (".*?")* is trying to match any "" pair in front of the commas to
be replaced.
The last (".*?")* is trying to match "" pair behind the commas.
After all the "" pairs are matched, any commas in the remaining string
should be replaced with '\t'.

The complete C# code is listed below:
static void Main(string[] args)
{
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";
Regex regex = new
Regex("(?<head>\".*?\")*(?<remove>,*)(?<tail>\".*?\")*");
MatchEvaluator myEvaluator = new
MatchEvaluator(Program.ReplaceFunction);
Console.WriteLine(regex.Replace(test, myEvaluator));
}
public static string ReplaceFunction(Match m)
{
return m.Groups["head"].Value + m.Groups["remove"].Value.Replace(',',
'\t') + m.Groups["tail"].Value;
}

An alternative way to accomplish the task is to purely operate on the chars
of the string. By iterating the characters in the string, the task can be
done in O(n), n is the length of the string.
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";
char[] str = test.ToCharArray();
bool isInQuotes = false;
for (int i = 0; i < str.Length; i++)
{
if (!isInQuotes && str == ',')
{
str = '\t';
continue;
}
if (str == '\"')
isInQuotes = !isInQuotes;
}
Console.WriteLine(str);
In the code above, isInQuotes is a flag indicating whether the current
char is contained within a pair of quotes. If isInQuotes is false and the
char is a comma, then we should replace it with a '\t'.

Here is a comparison in performance of the two approaches:
I let both methods run 100000 times on the test string:
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";
The result is that it takes 6380ms for Regex, but only 39ms for the string
method. Therefore, I recommend the latter.
Regex is useful in some complicated cases such as the match of Email
address, but sometime, it is resource-consuming. Thus, in some cases that
can be resolved in one iteration of string, a direct operation on chars is
recommended.

Please feel free to let me know if you have any other concern.

Sincerely,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

==================================================
For MSDN subscribers whose posts are left unanswered, please check this
document: http://blogs.msdn.com/msdnts/pages/postingAlias.aspx

Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/default.aspx#notif
ications. If you are using Outlook Express/Windows Mail, please make sure
you clear the check box "Tools/Options/Read: Get 300 headers at a time" to
see your reply promptly.

Note: The MSDN Managed Newsgroup support offering is for non-urgent issues
where an initial response from the community or a Microsoft Support
Engineer within 1 business day is acceptable. Please note that each follow
up response may take approximately 2 business days as the support
professional working with you may need further investigation to reach the
most efficient resolution. The offering is not appropriate for situations
that require urgent, real-time or phone-based interactions or complex
project analysis and dump analysis issues. Issues of this nature are best
handled working with a dedicated Microsoft Support Engineer by contacting
Microsoft Customer Support Services (CSS) at
http://msdn.microsoft.com/subscriptions/support/default.aspx.
==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
J

Jialiang Ge [MSFT]

Hello Niels,

Thank you for the suggestion.

Hello Flomo,

Niels's regex is the third method to resolve the problem. But this method
still suffers from the loss of performance.
I made a test to compare the three methods: (see the code listing 1)
The result is:
#direct operation on chars: 39ms.
#my regex: 6280ms.
#Niels's regex: 8997ms.
Therefore, I think the direct operation on chars is the best way by now.

Code Listing 1:
class Program
{
static void Main(string[] args)
{
string test = "\"a,,,,,\",\"j,dd,\"b\",\",\"";

Test1(test);
Test2(test);
Test3(test);
return;
}

public static void Test1(string test)
{
Stopwatch sw = new Stopwatch();
sw.Start();

for (int times = 0; times < 100000; times++)
{
char[] str = test.ToCharArray();
bool isInQuotes = false;
for (int i = 0; i < str.Length; i++)
{
if (!isInQuotes && str == ',')
{
str = '\t';
continue;
}
if (str == '\"')
isInQuotes = !isInQuotes;
}
//Console.WriteLine(str);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

public static void Test3(string test)
{
Stopwatch sw = new Stopwatch();
sw.Start();

for (int times = 0; times < 100000; times++)
{
Regex regex = new Regex("(?<=^([^\"]*\"[^\"]*\"[^\"]*)*),");
regex.Replace(test, "\t");
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

public static void Test2(string test)
{
Stopwatch sw = new Stopwatch();
sw.Start();

for (int times = 0; times < 100000; times++)
{
Regex regex = new
Regex("(?<head>\".*?\")*(?<remove>,*)(?<tail>\".*?\")*");
MatchEvaluator myEvaluator = new
MatchEvaluator(Program.ReplaceFunction);
regex.Replace(test, myEvaluator);
}
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
}

public static string ReplaceFunction(Match m)
{
return m.Groups["head"].Value +
m.Groups["remove"].Value.Replace(',', '\t') + m.Groups["tail"].Value;
}
}

Sincerely,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 
J

Jialiang Ge [MSFT]

Hi Flomo,

Would you mind letting me know the result of the suggestions? If you need
further assistance, feel free to let me know. I will be more than happy to
be of assistance.

Have a great day!

Sincerely,
Jialiang Ge ([email protected], remove 'online.')
Microsoft Online Community Support

=================================================
When responding to posts, please "Reply to Group" via your newsreader
so that others may learn and benefit from your issue.
=================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top