Intersect of 2 strings

M

Mark Rogers

Happy New Year y'all,

It's my day to be stumped. I'm trying to extract an Intersect between
two strings where the letters stay in the same order as the original
strings.

For example: string 1) "Happy new year" and string 2) "new year rocks!"
The result I want is: " new year "

IEnumerable provides the correct characters, but the results are not in
order of the original strings.

CompareOrdinal (probably the wrong use) works where I can get the start
and end of each string, except when the strings begin or end has the
same position. For example: "MyString" and "MyStrings" results in some
weird number. I probably don't understand the method correctly. If the
start position and end positions aren't equal, it works fine. Like
"String works" and "MyString work" can generate "String work"

So, with that said I need to compare "Happy new year" and string 2) "new
year rocks!" and get the result: " new year "

I'm long winded, but I hope the problem is clear.

-- Mark
 
P

Peter Duniho

Mark said:
Happy New Year y'all,

It's my day to be stumped. I'm trying to extract an Intersect between
two strings where the letters stay in the same order as the original
strings.

For example: string 1) "Happy new year" and string 2) "new year rocks!"
The result I want is: " new year " [...]

I have no idea how helpful this will be, but a kind of similar question
came up last year, for which I already had a little code snippet that
might have been useful (unfortunately, the OP in that thread never
bothered to reply again, to report any successes or failures).

In that thread, the person was looking to qualify partial matches,
rather than actually returning the string. And the code snippet I
provided was more along those lines.

But, the basic idea is the same as your need, at least in the case of my
partial-matching algorithm. The code I posted could be modified so
that, instead of returning a metric, it actually returns an array of
characters it matched to come up with that metric (or indeed, a string
built from such an array). It also deals with exact matches, but you
can easily change that to handle case-insensitive matching if that's
important.

That thread is here:
http://groups.google.com/group/micr...read/thread/9b5f8cfeaa9479fa/962b604a2efeff34

Note that the code I posted is, and is advertised as such, not designed
for efficiency at all. But, I'm thinking that if your initial approach
was to use something like IEnumerable, maybe your inputs are short
enough that efficiency isn't going to be a huge problem.

In that same thread, some other people mentioned some algorithm that
accomplished something similar, but which _is_ designed for efficiency.
The only problem is that I don't know whether it would be able to be
modified to emit the characters themselves, rather than just simply the
metric. But that might be worth a look too.

Pete
 
M

Mark Rogers

Peter said:
Mark said:
Happy New Year y'all,

It's my day to be stumped. I'm trying to extract an Intersect between
two strings where the letters stay in the same order as the original
strings.

For example: string 1) "Happy new year" and string 2) "new year rocks!"
The result I want is: " new year " [...]

I have no idea how helpful this will be, but a kind of similar question
came up last year, for which I already had a little code snippet that
might have been useful (unfortunately, the OP in that thread never
bothered to reply again, to report any successes or failures).

In that thread, the person was looking to qualify partial matches,
rather than actually returning the string. And the code snippet I
provided was more along those lines.

But, the basic idea is the same as your need, at least in the case of my
partial-matching algorithm. The code I posted could be modified so
that, instead of returning a metric, it actually returns an array of
characters it matched to come up with that metric (or indeed, a string
built from such an array). It also deals with exact matches, but you
can easily change that to handle case-insensitive matching if that's
important.

That thread is here:
http://groups.google.com/group/micr...read/thread/9b5f8cfeaa9479fa/962b604a2efeff34


Note that the code I posted is, and is advertised as such, not designed
for efficiency at all. But, I'm thinking that if your initial approach
was to use something like IEnumerable, maybe your inputs are short
enough that efficiency isn't going to be a huge problem.

In that same thread, some other people mentioned some algorithm that
accomplished something similar, but which _is_ designed for efficiency.
The only problem is that I don't know whether it would be able to be
modified to emit the characters themselves, rather than just simply the
metric. But that might be worth a look too.

Pete

Thanks Pete,

I resorted to some wine, so I' check it out in th AM! I'll let you know
the result turns out.

-- Mark
 
K

kndg

Mark said:
Happy New Year y'all,

It's my day to be stumped. I'm trying to extract an Intersect between
two strings where the letters stay in the same order as the original
strings.

For example: string 1) "Happy new year" and string 2) "new year rocks!"
The result I want is: " new year "

IEnumerable provides the correct characters, but the results are not in
order of the original strings.

CompareOrdinal (probably the wrong use) works where I can get the start
and end of each string, except when the strings begin or end has the
same position. For example: "MyString" and "MyStrings" results in some
weird number. I probably don't understand the method correctly. If the
start position and end positions aren't equal, it works fine. Like
"String works" and "MyString work" can generate "String work"

So, with that said I need to compare "Happy new year" and string 2) "new
year rocks!" and get the result: " new year "

I'm long winded, but I hope the problem is clear.

-- Mark

Hi Mark,

I don't whether it is efficient or not, but here is my try:

using System;

class Program
{
static void Main()
{
string str1 = "Happy new year";
string str2 = "new year rocks!";
//string str1 = "MyString";
//string str2 = "MyStrings";

if (str1.Length < str2.Length) // swap
{
string temp = str1;
str1 = str2;
str2 = temp;
}

Console.WriteLine("String1: {0}", str1);
Console.WriteLine("String2: {0}", str2);
Console.WriteLine("Intersect: {0}", FindIntersect(str1, str2));

Console.ReadLine();
}

private static string FindIntersect(string str1, string str2)
{
for (int n = str2.Length; n > 0; n--)
{
for (int i = 0; i <= str1.Length - n; i++)
{
for (int j = 0; j <= str2.Length - n; j++)
{
if (IsPartialMatch(str2, str1, j, i, n)) // found
best intersect
{
return str2.Substring(j, n);
}
}
}
}

return String.Empty;
}

public static bool IsPartialMatch(string source, string target, int
sourceIndex, int targetIndex, int length)
{
for (int i = 0; i < length; i++)
{
if (source[sourceIndex + i] != target[targetIndex + i])
return false;
}

return true;
}
}

Regards.
 
H

Harlan Messinger

Mark said:
Happy New Year y'all,

It's my day to be stumped. I'm trying to extract an Intersect between
two strings where the letters stay in the same order as the original
strings.

For example: string 1) "Happy new year" and string 2) "new year rocks!"
The result I want is: " new year "

IEnumerable provides the correct characters, but the results are not in
order of the original strings.

What is your well-defined general rule for this operation? You've given
an example of a string operation you'd like to implement, but you
haven't explained what operation this example illustrates.

For example, I can't tell from what you wrote what kind of result you'd
expect from "*** one two three four five six ***" and "!!! three four
!!! six five !!! one two !!!".
 
T

Tim Roberts

kndg said:
private static string FindIntersect(string str1, string str2)
{
for (int n = str2.Length; n > 0; n--)
{
for (int i = 0; i <= str1.Length - n; i++)
{
for (int j = 0; j <= str2.Length - n; j++)
{
if (IsPartialMatch(str2, str1, j, i, n)) // found
best intersect
{
return str2.Substring(j, n);
}
}
}
}

return String.Empty;
}

Admittedly, the problem statement is not as detailed as one might like, but
I don't think it's really necessary to use nested loops. I would assert
that, in order for there to be an intersection as he describes it, the last
N characters of string1 must always equal the first N characters of
string2. Thus, we can just loop from the length of the shortest down to 0:

private static string FindIntersect(string str1, string str2)
{
for( int Start = str1.Length - Math.Min(str1.Length, str2.Length);
Start < str1.Length;
Start++
)
{
if( str2.StartsWith( str1.Substring(Start) ) )
return str1.Substring(Start);
}

return String.Empty;
}
 
K

kndg

Tim said:
Admittedly, the problem statement is not as detailed as one might like, but
I don't think it's really necessary to use nested loops. I would assert
that, in order for there to be an intersection as he describes it, the last
N characters of string1 must always equal the first N characters of
string2. Thus, we can just loop from the length of the shortest down to 0:

private static string FindIntersect(string str1, string str2)
{
for( int Start = str1.Length - Math.Min(str1.Length, str2.Length);
Start < str1.Length;
Start++
)
{
if( str2.StartsWith( str1.Substring(Start) ) )
return str1.Substring(Start);
}

return String.Empty;
}

Hi Tim,

I haven't look closely at your code (looks cool), but it failed (as I
understand from the OP's requirement) for below strings;

string str1 = "You've given an example of a string";
string str2 = "I'm trying to extract an Intersect";

I suppose it would return " an " but instead it give an empty string.
 
K

kndg

kndg said:
Hi Tim,

I haven't look closely at your code (looks cool), but it failed (as I
understand from the OP's requirement) for below strings;

string str1 = "You've given an example of a string";
string str2 = "I'm trying to extract an Intersect";

I suppose it would return " an " but instead it give an empty string.

Okay, it seems that your code just look at the start of the string and
that's why it would return empty string even there is intersect in the
middle of the string like "aaa bbb ccc" and "ddd bbb eee". But as you
said, the OP haven't been clear on his actual requirement.

Regards.
 
P

Peter Duniho

Tim said:
Admittedly, the problem statement is not as detailed as one might like,

I think for sure it's not. We really don't know whether the OP intends
to only qualify purely overlapping strings, or he's truly looking for
the character-by-character intersection of two strings. For example,
from his own example we know that the "intersection" of "abc" and "bcd"
would be "bc". But, what about the intersection of "abcde" and "ace"?

After all, he did imply that he'd tried to use Enumerable.Intersect(),
and that the only problem was ordering.

Even if we assume he really means to only match contiguous characters,
there are still ambiguities. For example, what about:

– "bcd" and "abc"?
– "abcde" and "bcd", or vice a versa?
– "abcde" and "abc"?
– "cdefg" and "abcde"?
– etc.

Of course, his actual stated example is impossible anyway, as he says he
wants "Happy new year" and "new year rocks!" to result in " new year ",
but doesn't explain why we should be including the leading and trailing
spaces? Those certainly aren't part of the true intersection of the two
strings, however you look at it.
but I don't think it's really necessary to use nested loops.

Well, it seems possible that _four_ nested loops are probably overkill
(but I'm not even sure about this myself…haven't really looked closely
at kndg's solution). But, it's not like there are no loops in the code
you posted. You just hide them in calls to StartsWith() and
Substring(). :)
I would assert
that, in order for there to be an intersection as he describes it, the last
N characters of string1 must always equal the first N characters of
string2.

I think at the very least, you have to be willing to assume that the
order of the input does not matter. That is, "Happy new year" and "new
year rocks!" should produce the same output as "new year rocks!" and
"Happy new year".

The solution you propose:
Thus, we can just loop from the length of the shortest down to 0:

private static string FindIntersect(string str1, string str2)
{
for( int Start = str1.Length - Math.Min(str1.Length, str2.Length);
Start < str1.Length;
Start++
)
{
if( str2.StartsWith( str1.Substring(Start) ) )
return str1.Substring(Start);
}

return String.Empty;
}

…certainly will work with the given examples (except you won't emit the
extra mystery spaces :) ), but it's hardly what I'd say implements the
most obvious-yet-simplistic interpretation of the OP's requirements.
Your implementation not only is dependent on input order, it strictly
assumes that the strings overlap end of one and beginning of the other.

That _could_ be what the OP wants, but it's not what I'd guess.

Of course, only the OP can say for sure.

Pete
 
P

Peter Duniho

Some minor observations:
[...]
private static string FindIntersect(string str1, string str2)
{
for (int n = str2.Length; n > 0; n--)
{
for (int i = 0; i <= str1.Length - n; i++)
{
for (int j = 0; j <= str2.Length - n; j++)
{
if (IsPartialMatch(str2, str1, j, i, n)) // found
best intersect
{
return str2.Substring(j, n);
}
}
}
}

return String.Empty;
}

Upon further reflection, I believe I still think that four nested loops
is overkill. You should be able to get by with three at the most, by
combining the matching with the length variation. That is, instead of
trying the longest possible match first, scan from a given starting
point and see how far you get.

Tracking the longest matching substring you can find, you should be able
to do the whole thing in three loops.

In reality, there's dependency in the range of the loops. I'm too tired
at the moment to do a big-O comparison, but I doubt the four-loop
solution is actually O(N^4). It only looks that way because of the way
the code is structured. Not that O(N^3) is great either, but it's
better than O(N^4). :)

The other thing I would change is to put the swap into the
FindIntersect() method, rather than making that an input requirement.

Pete
 
K

kndg

Peter said:
Some minor observations:
[...]
private static string FindIntersect(string str1, string str2)
{
for (int n = str2.Length; n > 0; n--)
{
for (int i = 0; i <= str1.Length - n; i++)
{
for (int j = 0; j <= str2.Length - n; j++)
{
if (IsPartialMatch(str2, str1, j, i, n)) // found
best intersect
{
return str2.Substring(j, n);
}
}
}
}

return String.Empty;
}

Upon further reflection, I believe I still think that four nested loops
is overkill. You should be able to get by with three at the most, by
combining the matching with the length variation. That is, instead of
trying the longest possible match first, scan from a given starting
point and see how far you get.

Tracking the longest matching substring you can find, you should be able
to do the whole thing in three loops.

Yes, I have thought of this but I think in order to track the longest
matching substring it will prevent me to terminate the loop as early as
possible. What is your opinion?
In reality, there's dependency in the range of the loops. I'm too tired
at the moment to do a big-O comparison, but I doubt the four-loop
solution is actually O(N^4). It only looks that way because of the way
the code is structured. Not that O(N^3) is great either, but it's
better than O(N^4). :)

The other thing I would change is to put the swap into the
FindIntersect() method, rather than making that an input requirement.

Hahaha... yeah that's the result if you write the code in hurry. I
should have encapsulate it into another method and call it from
FindIntersect. The IsPartialMatch method should be private also.
 
K

kndg

kndg said:
[...]
Upon further reflection, I believe I still think that four nested
loops is overkill. You should be able to get by with three at the
most, by combining the matching with the length variation. That is,
instead of trying the longest possible match first, scan from a given
starting point and see how far you get.

Tracking the longest matching substring you can find, you should be
able to do the whole thing in three loops.

Yes, I have thought of this but I think in order to track the longest
matching substring it will prevent me to terminate the loop as early as
possible. What is your opinion?

Reread your post, I think I get what you means. I will put some thoughs
on this and will try to repost on tomorrow (I'm off from work now --
offline :( )

Others also might have some improvement or better algorithm, please feel
free to post your suggestion.

Regards.
 
P

Peter Duniho

kndg said:
[...]
Yes, I have thought of this but I think in order to track the longest
matching substring it will prevent me to terminate the loop as early
as possible. What is your opinion?

Reread your post, I think I get what you means. I will put some thoughs
on this and will try to repost on tomorrow (I'm off from work now --
offline :( )

Don't bang your head on it too much. A) Mark may not need that degree
of refinement (of course, we don't really know what he needs at this
point), and b) after I thought about it some more, I am questioning my
original proposal and think maybe you need some equivalent to four loops
after all.

But if it turns out I was right after all, I'd sure love to know. :)
 
K

kndg

Peter said:
kndg said:
[...]
Tracking the longest matching substring you can find, you should be
able to do the whole thing in three loops.

Yes, I have thought of this but I think in order to track the longest
matching substring it will prevent me to terminate the loop as early
as possible. What is your opinion?

Reread your post, I think I get what you means. I will put some
thoughs on this and will try to repost on tomorrow (I'm off from work
now -- offline :( )

Don't bang your head on it too much.

No, I won't. It hurts :)
But it seems like a fun, so I will play around a bit.
A) Mark may not need that degree
of refinement (of course, we don't really know what he needs at this
point), and b) after I thought about it some more, I am questioning my
original proposal and think maybe you need some equivalent to four loops
after all.

But if it turns out I was right after all, I'd sure love to know. :)

Ok, as promised, here is my improvement (it has been reduced to 2
loops). I haven't benchmarked it yet, but it seems is much better than
the previous code.

public static void Swap<T>(ref T a, ref T b)
{
T temp = a;
a = b;
b = temp;
}

public static string FindIntersect(string str1, string str2)
{
if (str1.Length < str2.Length) Swap(ref str1, ref str2);

int matchIndex = 0;
int matchLength = 0;

for (int i = 0; i < str1.Length; i++)
{
int j = 0;
int index = i;
int length = 0;

while (j < Math.Min(str1.Length - i, str2.Length))
{
if (str2[j] != str1[index])
{
if (length > matchLength)
{
matchIndex = j - length;
matchLength = length;
}
else
{
index = i;
length = 0;
}
}
else
{
index++;
length++;
}

j++;
}

if (length > matchLength)
{
matchIndex = j - length;
matchLength = length;
}
}

return str2.Substring(matchIndex, matchLength);
}

Regards.
 
K

kndg

kndg said:
[...] I haven't benchmarked it yet, but it seems is much better than
the previous code.

Done benchmarking.
Test against random text from google (10K bytes text file and 1K bytes
text file). The improvement code is way way much better than previous
code. On my system, the improvement code only takes 200ms to complete
while the previous one takes about 2 minutes! Hahaha....
 
P

Peter Duniho

kndg said:
[...]
Ok, as promised, here is my improvement (it has been reduced to 2
loops). I haven't benchmarked it yet, but it seems is much better than
the previous code.

Hmmm…I'm not sure it works for all inputs. If the inner loop is working
on a match and reaches the end, it picks up working on the next match at
the character index where the match failed, rather than starting over at
the next character in str2 after where the first match started.

So, consider what happens when you call the method with "aabbcc" and
"aaabb". You'll match "aa", but then resume comparing characters
starting at the third "a" in the second string, missing the match
between "aab" and the first three characters of the first string.

I'm also skeptical of your inner loop's condition. It's true that you
only ever want to examine at most the number of characters left in the
first string after accounting for the current "i". But that's not to
say that you must then necessarily restrict yourself to the first
"str1.Length - i" characters of str2. The search needs to repeat
starting at each index in str2 in order to accommodate all the possible
matches.

Good effort though!

Pete

p.s. It's always easier to find things wrong with something than to fix
them. :)
 
K

kndg

Peter said:
kndg said:
[...]
Ok, as promised, here is my improvement (it has been reduced to 2
loops). I haven't benchmarked it yet, but it seems is much better than
the previous code.

Hmmm…I'm not sure it works for all inputs. If the inner loop is working
on a match and reaches the end, it picks up working on the next match at
the character index where the match failed, rather than starting over at
the next character in str2 after where the first match started.

So, consider what happens when you call the method with "aabbcc" and
"aaabb". You'll match "aa", but then resume comparing characters
starting at the third "a" in the second string, missing the match
between "aab" and the first three characters of the first string.

I'm also skeptical of your inner loop's condition. It's true that you
only ever want to examine at most the number of characters left in the
first string after accounting for the current "i". But that's not to
say that you must then necessarily restrict yourself to the first
"str1.Length - i" characters of str2. The search needs to repeat
starting at each index in str2 in order to accommodate all the possible
matches.

Hi Pete,
Thanks for your comments (yes, you are correct).
But the magic is, it produces the correct output (bug?).
"aabbcc","aaabb" -> "aabb"
I will look again at my code.

Regards.
 
P

Peter Duniho

kndg said:
Thanks for your comments (yes, you are correct).
But the magic is, it produces the correct output (bug?).
"aabbcc","aaabb" -> "aabb"

Interesting. Must be a code path I didn't recognize that handles that case.

As long as that code path reliably covers those kinds of overlap cases
that are otherwise missed, then perhaps there's no problem after all.

Pete
 
K

kndg

Peter said:
Interesting. Must be a code path I didn't recognize that handles that
case.

As long as that code path reliably covers those kinds of overlap cases
that are otherwise missed, then perhaps there's no problem after all.

Pete

It is indeed a bug.
"zaabbcc", "yyaaabb" -> "aab" which is wrong.
I will look again at my code.
 
K

kndg

Peter said:
kndg said:
[...]
Ok, as promised, here is my improvement (it has been reduced to 2
loops). I haven't benchmarked it yet, but it seems is much better than
the previous code.

Hmmm…I'm not sure it works for all inputs. If the inner loop is working
on a match and reaches the end, it picks up working on the next match at
the character index where the match failed, rather than starting over at
the next character in str2 after where the first match started.

So, consider what happens when you call the method with "aabbcc" and
"aaabb". You'll match "aa", but then resume comparing characters
starting at the third "a" in the second string, missing the match
between "aab" and the first three characters of the first string.

I'm also skeptical of your inner loop's condition. It's true that you
only ever want to examine at most the number of characters left in the
first string after accounting for the current "i". But that's not to
say that you must then necessarily restrict yourself to the first
"str1.Length - i" characters of str2. The search needs to repeat
starting at each index in str2 in order to accommodate all the possible
matches.

Your comments was very helpful.
Here is the bug-fix version.
Any inputs is highly appreciated.

public static string FindIntersect(string str1, string str2)
{
if (str1.Length < str2.Length) Swap(ref str1, ref str2);

int matchIndex = 0;
int matchLength = 0;

for (int i = 0; i < str1.Length; i++)
{
int j = 0;
int index = i;
int length = 0;

while (j < str2.Length && index < str1.Length)
{
if (str2[j] != str1[index])
{
if (length > matchLength)
{
matchIndex = j - length;
matchLength = length;
}

j -= length;
index = i;
length = 0;
}
else
{
index++;
length++;
}

j++;
}

if (length > matchLength)
{
matchIndex = j - length;
matchLength = length;
}
}

return str2.Substring(matchIndex, matchLength);
}

Regards.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top