Replace strings in a text file and get the number of replacements made

S

Smith

Hi

Usually I only replace strings with text = text.Replace("old text", "new
text");

Now I need to display the number of replacements made. Is there an easy way
or do I need some custom replacement method?
 
V

vanderghast

String a= ... ;
int n = (a.Length-a.Replace("old text", "").Length)/("old text".Length);



should do.


Vanderghast, Access MVP
 
P

Peter Duniho

Hi

Usually I only replace strings with text = text.Replace("old text", "new
text");

Now I need to display the number of replacements made. Is there an easy
way
or do I need some custom replacement method?

As Vanderghast suggests, as long as the new text is always a different
length than the original, you can simply look at the difference in length
of the modified string, which will be an exact multiple of the difference
in length between the searched-for text and the replacement text.

If you don't have that guarantee -- that is, the new text could be the
same length as the original -- then an alternative approach would be to
use the Regex class. Normally I'd say for simple replacement it's
overkill, but one thing it provides is an API that returns a list of
matched sites in your original string; the length of that list is exactly
the number you're looking for.

Custom code to do the replacement would be most efficient, but Regex won't
be awful, it's already there written for you, and if this isn't a
bottleneck in your code it won't matter anyway.

Pete
 
V

vanderghast

I should have specified that you use String.Empty, NOT the new REPLACING
string, in the replace statement, so the result of

myString.Replace(frament, String.Empty)


will be the original myString shortenend by frament.Length each time
fragment can be ***consumed*** from myString.


Sure,that assumes you use the convention that "ohoh" occurs only two times
in "ohohohohoh", not four times (***consuming*** assumption, rather than
***matching*** assumption).


Vanderghast, Access MVP
 
J

Jesse Houwing

Hello Peter,

Looping through the string in a stringbuilder is probably the safest way
to do this:

string input = "bla bla bla bla bla bla blabla";
string search = "bla";
string replacement = "bli";
StringBuilder sb = new StringBuilder(input);

int count = 0;
for (int i = 0; i < sb.Length; )
{
if (AreEqual(sb, search, i))
{
sb.Remove(i, search.Length);
sb.Insert(i, replacement);
i += replacement.Length;
count++;
}
else
{
i++;
}
}
Console.WriteLine(count);
Console.WriteLine(sb.ToString());

static bool AreEqual(StringBuilder sb, string val, int pos)
{
for (int i = 0; i < val.Length; i++)
{
if (sb[pos + i] != val)
{
return false;
}
}
return true;
}

It might be faster to use

sb.Replace(search, replacement, i, search.Length)

instead of a sb.Remove, sb.Insert, I'm not sure, but they won't differ that
much.

It is probably a lot faster than using a regex, though I haven't done any
measurements.

Jesse
 
P

Peter Duniho

Looping through the string in a stringbuilder is probably the safest way
to do this:

[...]
int count = 0;
for (int i = 0; i < sb.Length; )
{
if (AreEqual(sb, search, i))
{
sb.Remove(i, search.Length);
sb.Insert(i, replacement);
i += replacement.Length;
count++;
}
else
{
i++;
}
}
[...]

It might be faster to use sb.Replace(search, replacement, i,
search.Length)

instead of a sb.Remove, sb.Insert, I'm not sure, but they won't differ
that much.

If you simply call StringBuilder.Replace(string, string, int, int) instead
of having your own AreEqual() method followed by a call to Remove() and
Insert(), the performance should be practically identical, but you
wouldn't get any information about how many replacements occurred.

Alternatively, if you still call AreEqual() and then call
StringBuilder.Replace(string, string, int, int), you're duplicating effort
(which costs performance), because StringBuilder.Replace(string, string,
int, int) has to actually do the string comparison again. That would
actually be _slower_ than your original code.

You could do a little hack by searching for the first character that
differs between the search and replacement strings (as an initialization,
not as part of the loop), and then bumping a counter after each call to
StringBuilder.Replace() based on whether the character at the same offset
within the current StringBuilder has changed. That would be only slightly
slower than just calling StringBuilder.Replace(string, string, int, int),
but would include the count.

That said, I would hope that any Replace() method in .NET, including
Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be
faster than the code you posted. The main reason being that all of those
methods have the opportunity to optimize the construction of the new
string, whereas your example doesn't optimize at all.

At the very least, I would not use the Remove()/Insert() pattern you've
shown. Instead, I would use a String as input, and a StringBuilder as
output, appending text segments to the output StringBuilder as I scan the
input String. That way the code avoids having to repeatedly shift your
character buffer in the StringBuilder (which happens _twice_ for each
replacement in your code). That's exactly the kind of optimization I'd
expect to find inside the .NET classes.

It might even be worthwhile to defer creation of the output StringBuilder
until you detect the first match that needs to be replaced, if there's an
expectation that for a significant frequency of input, no replacements
would be needed.
It is probably a lot faster than using a regex, though I haven't done
any measurements.

I would expect Regex to be on par with other explicit mechanisms like
that, especially given the need to count the replacements (which for
non-Regex solutions requires replacing the search text twice). If
performance is an issue, then a "scan and build" approach as I suggest
above is probably slightly faster than using built-in Replace() methods
simply because you can incorporate a count into the replacement logic.

All that said, if performance is an issue (and there's nothing in the OP
to suggest it is), the only way to know for sure what the best solution is
would be to try the different alternatives and measure them. Even
theoretical advantages and disadvantages may be irrelevant for a typical
data set, and intuition is a terrible way to measure performance. :)

For best performance, it may be that none of the suggestions offered so
far are probably suitable. There's an optimized text search algorithm,
the name of which I can't recall at the moment, that can probably be
adapted, but if not then a degenerate state-graph implementation (since
there's only one string to search for) would probably work too. Either
approach would avoid having to keep performing full string comparisons at
each character index in the original string (consider an original string
"aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences of
"aaaaaab" with something :) ).

But even there, as I said, there's no way to know for sure without
measuring. Performance of the various choices is to some extent going to
be data dependent; liabilities that exist in the general case might not
really be that much of a problem. For example, if dealing with
essentially random data, it's not too terrible to just keep comparing over
and over at an incremented index, because those comparisons will normally
terminate quickly when there's no match.

In other words, even the theoretically worst-case implementation might not
turn out to be much different than the more optimized ones.

Until there's a performance problem shown, the OP should stick with
whatever solution _reads_ the best, and is the most maintainable. And if
there is a performance problem shown, measuring each viable alternative is
the only way to know for sure which will be fastest.

Pete
 
J

Jesse Houwing

Hello Peter,

Agreed on the readability part, but using regex.replace opens up a new can
of worms, which people aren't usually prepared for. Say this search/replace
action can be entered from the UI, then adding . or * or { into your search
pattern can lead to unexpected behaviour, or worse a regex parse error. The
regex will also be expensive, because it will have to be parsed/compiled
every time a new pattern is used (and if it is a user defined replacement,
that would be more often than not).

So this would have to be extended with a Regex.Escape call first. The same
applies for the replacement pattern. Say I want to search $2 and replace
it with $0.1 you'd get funny things... ($2.1 actually)... So it isn't just
using a different call to get the same results.

That said, I'd opt for an extention method on string and write an efficient
version (could use mine as an example) of a Replace method that returns the
number of matches. And from that moment on, use that. Just as readable (or
even better) and no crazy unexpected regex problems due to not exactly understanding
what is involved.

Jesse
Looping through the string in a stringbuilder is probably the safest
way to do this:

[...]
int count = 0;
for (int i = 0; i < sb.Length; )
{
if (AreEqual(sb, search, i))
{
sb.Remove(i, search.Length);
sb.Insert(i, replacement);
i += replacement.Length;
count++;
}
else
{
i++;
}
}
[...]
It might be faster to use sb.Replace(search, replacement, i,
search.Length)

instead of a sb.Remove, sb.Insert, I'm not sure, but they won't
differ that much.
If you simply call StringBuilder.Replace(string, string, int, int)
instead of having your own AreEqual() method followed by a call to
Remove() and Insert(), the performance should be practically
identical, but you wouldn't get any information about how many
replacements occurred.

Alternatively, if you still call AreEqual() and then call
StringBuilder.Replace(string, string, int, int), you're duplicating
effort (which costs performance), because
StringBuilder.Replace(string, string, int, int) has to actually do
the string comparison again. That would actually be _slower_ than
your original code.

You could do a little hack by searching for the first character that
differs between the search and replacement strings (as an
initialization, not as part of the loop), and then bumping a counter
after each call to StringBuilder.Replace() based on whether the
character at the same offset within the current StringBuilder has
changed. That would be only slightly slower than just calling
StringBuilder.Replace(string, string, int, int), but would include
the count.

That said, I would hope that any Replace() method in .NET, including
Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be
faster than the code you posted. The main reason being that all of
those methods have the opportunity to optimize the construction of
the new string, whereas your example doesn't optimize at all.

At the very least, I would not use the Remove()/Insert() pattern
you've shown. Instead, I would use a String as input, and a
StringBuilder as output, appending text segments to the output
StringBuilder as I scan the input String. That way the code avoids
having to repeatedly shift your character buffer in the StringBuilder
(which happens _twice_ for each replacement in your code). That's
exactly the kind of optimization I'd expect to find inside the .NET
classes.

It might even be worthwhile to defer creation of the output
StringBuilder until you detect the first match that needs to be
replaced, if there's an expectation that for a significant frequency
of input, no replacements would be needed.
It is probably a lot faster than using a regex, though I haven't done
any measurements.
I would expect Regex to be on par with other explicit mechanisms like
that, especially given the need to count the replacements (which for
non-Regex solutions requires replacing the search text twice). If
performance is an issue, then a "scan and build" approach as I suggest
above is probably slightly faster than using built-in Replace()
methods simply because you can incorporate a count into the
replacement logic.

All that said, if performance is an issue (and there's nothing in the
OP to suggest it is), the only way to know for sure what the best
solution is would be to try the different alternatives and measure
them. Even theoretical advantages and disadvantages may be
irrelevant for a typical data set, and intuition is a terrible way to
measure performance. :)

For best performance, it may be that none of the suggestions offered
so far are probably suitable. There's an optimized text search
algorithm, the name of which I can't recall at the moment, that can
probably be adapted, but if not then a degenerate state-graph
implementation (since there's only one string to search for) would
probably work too. Either approach would avoid having to keep
performing full string comparisons at each character index in the
original string (consider an original string
"aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences
of "aaaaaab" with something :) ).

But even there, as I said, there's no way to know for sure without
measuring. Performance of the various choices is to some extent going
to be data dependent; liabilities that exist in the general case
might not really be that much of a problem. For example, if dealing
with essentially random data, it's not too terrible to just keep
comparing over and over at an incremented index, because those
comparisons will normally terminate quickly when there's no match.

In other words, even the theoretically worst-case implementation might
not turn out to be much different than the more optimized ones.

Until there's a performance problem shown, the OP should stick with
whatever solution _reads_ the best, and is the most maintainable. And
if there is a performance problem shown, measuring each viable
alternative is the only way to know for sure which will be fastest.

Pete
 
V

vanderghast

There is a difference between matching and replacing.

Someone can say "ohoh" is matched twice in "ohohoh", once starting at
position 0 and once starting at position 2,

but if you speak to replace (consume) it, you have only one possible
'action'.

I haven't tried, but I assume Regex would find 2 matches, while replace will
replace just once the pattern.

And again, (InitialStrring.Length-InitialString.Replace(pattern,
String.Empty).Length) / pattern.Length is 'safe', as far as I know, for all
cases, and use no external loop, to count the number of replacements where
will be of pattern into InitialString (by whatever newPattern, which is
irrelevant).



Vanderghast, Access MVP


Jesse Houwing said:
Hello Peter,

Agreed on the readability part, but using regex.replace opens up a new can
of worms, which people aren't usually prepared for. Say this
search/replace action can be entered from the UI, then adding . or * or
{ into your search pattern can lead to unexpected behaviour, or worse a
regex parse error. The regex will also be expensive, because it will have
to be parsed/compiled every time a new pattern is used (and if it is a
user defined replacement, that would be more often than not).

So this would have to be extended with a Regex.Escape call first. The same
applies for the replacement pattern. Say I want to search $2 and replace
it with $0.1 you'd get funny things... ($2.1 actually)... So it isn't just
using a different call to get the same results.

That said, I'd opt for an extention method on string and write an
efficient version (could use mine as an example) of a Replace method that
returns the number of matches. And from that moment on, use that. Just as
readable (or even better) and no crazy unexpected regex problems due to
not exactly understanding what is involved.

Jesse
Looping through the string in a stringbuilder is probably the safest
way to do this:

[...]
int count = 0;
for (int i = 0; i < sb.Length; )
{
if (AreEqual(sb, search, i))
{
sb.Remove(i, search.Length);
sb.Insert(i, replacement);
i += replacement.Length;
count++;
}
else
{
i++;
}
}
[...]
It might be faster to use sb.Replace(search, replacement, i,
search.Length)

instead of a sb.Remove, sb.Insert, I'm not sure, but they won't
differ that much.
If you simply call StringBuilder.Replace(string, string, int, int)
instead of having your own AreEqual() method followed by a call to
Remove() and Insert(), the performance should be practically
identical, but you wouldn't get any information about how many
replacements occurred.

Alternatively, if you still call AreEqual() and then call
StringBuilder.Replace(string, string, int, int), you're duplicating
effort (which costs performance), because
StringBuilder.Replace(string, string, int, int) has to actually do
the string comparison again. That would actually be _slower_ than
your original code.

You could do a little hack by searching for the first character that
differs between the search and replacement strings (as an
initialization, not as part of the loop), and then bumping a counter
after each call to StringBuilder.Replace() based on whether the
character at the same offset within the current StringBuilder has
changed. That would be only slightly slower than just calling
StringBuilder.Replace(string, string, int, int), but would include
the count.

That said, I would hope that any Replace() method in .NET, including
Regex.Replace(), String.Replace(), or StringBuilder.Replace() would be
faster than the code you posted. The main reason being that all of
those methods have the opportunity to optimize the construction of
the new string, whereas your example doesn't optimize at all.

At the very least, I would not use the Remove()/Insert() pattern
you've shown. Instead, I would use a String as input, and a
StringBuilder as output, appending text segments to the output
StringBuilder as I scan the input String. That way the code avoids
having to repeatedly shift your character buffer in the StringBuilder
(which happens _twice_ for each replacement in your code). That's
exactly the kind of optimization I'd expect to find inside the .NET
classes.

It might even be worthwhile to defer creation of the output
StringBuilder until you detect the first match that needs to be
replaced, if there's an expectation that for a significant frequency
of input, no replacements would be needed.
It is probably a lot faster than using a regex, though I haven't done
any measurements.
I would expect Regex to be on par with other explicit mechanisms like
that, especially given the need to count the replacements (which for
non-Regex solutions requires replacing the search text twice). If
performance is an issue, then a "scan and build" approach as I suggest
above is probably slightly faster than using built-in Replace()
methods simply because you can incorporate a count into the
replacement logic.

All that said, if performance is an issue (and there's nothing in the
OP to suggest it is), the only way to know for sure what the best
solution is would be to try the different alternatives and measure
them. Even theoretical advantages and disadvantages may be
irrelevant for a typical data set, and intuition is a terrible way to
measure performance. :)

For best performance, it may be that none of the suggestions offered
so far are probably suitable. There's an optimized text search
algorithm, the name of which I can't recall at the moment, that can
probably be adapted, but if not then a degenerate state-graph
implementation (since there's only one string to search for) would
probably work too. Either approach would avoid having to keep
performing full string comparisons at each character index in the
original string (consider an original string
"aaaaaaaaaaaaaaaaaaaaaaaaaa" where you want to replace all occurrences
of "aaaaaab" with something :) ).

But even there, as I said, there's no way to know for sure without
measuring. Performance of the various choices is to some extent going
to be data dependent; liabilities that exist in the general case
might not really be that much of a problem. For example, if dealing
with essentially random data, it's not too terrible to just keep
comparing over and over at an incremented index, because those
comparisons will normally terminate quickly when there's no match.

In other words, even the theoretically worst-case implementation might
not turn out to be much different than the more optimized ones.

Until there's a performance problem shown, the OP should stick with
whatever solution _reads_ the best, and is the most maintainable. And
if there is a performance problem shown, measuring each viable
alternative is the only way to know for sure which will be fastest.

Pete
 
G

Göran Andersson

Smith said:
Hi

Usually I only replace strings with text = text.Replace("old text", "new
text");

Now I need to display the number of replacements made. Is there an easy way
or do I need some custom replacement method?

You can do the replacing yourself using IndexOf and a StringBuilder, so
that you can count them:


string original = "1234567890123456789012345678901234567890";
string find = "23";
string replacement = "twentythree";

StringBuilder result = new StringBuilder();
int replacements = 0;
int index = 0;
do {
int newIndex = original.IndexOf(find, index);
if (newIndex != -1) {
result.Append(original, index, newIndex - index);
result.Append(replacement);
replacements++;
index = newIndex + find.Length;
} else {
result.Append(original, index, original.Length - index);
index = original.Length;
}
} while (index < original.Length);

Console.WriteLine(result.ToString());
Console.WriteLine(replacements);
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top