Regex Matches

K

Kofi

Any takers?

Got a string of DNA as an input sequence GGATGGATG, apply the simple
regex "GGATG" as in

Regex r = new Regex("GGATG", (RegexOptions.Compiled));

MatchCollection matches = r.Matches("GGATGGATG");

Now I would expect to get two matches right? One at index 0 in the
string and the second at index 4? Or am I being really dumb or
something (EricGu, where art thou?).

Thanks 4 help.

Kofi.
 
G

Gabriel Lozano-Morán

It would match two times if you put an extra G at index 4 in the matches
string:
GGATGGGATG

Gabriel Lozano-Morán
 
B

Bruce Wood

Gabriel said:
It would match two times if you put an extra G at index 4 in the matches
string:
GGATGGGATG

Gabriel Lozano-Morán

Well, yes, but I think that what the OP wanted to know is why Regex
doesn't re-scan after a match. That is, in the string GGATGGATG, the
Regex will match the initial string: GGATG. After that, where does the
Regex processor look to start matching next? Does it start with the
part of the string after the first matched character, so does it begin
matching the substring GATGGATG, in which case it would find a second
match in the fifth character of the original string (the fourth
character of the substring)? Or does it start looking for another match
after the last character matched in the first match, therefore matching
against GATG, which will result in no second match?

Regex appears to display the latter behaviour, according to the OP.

I checked the RegexOptions enumeration, and don't see any flag for
Rescan. I have seen this option for other Regex pattern matchers, but
it doesn't appear to be in the .NET one.

One thing the OP could do is use Match instead of Matches:

string dna = "GGATGGATG";
int matchIndex = 0;
Regex r = new Regex("GGATG");
Match sequence = r.Match(dna, matchIndex);
while (sequence != Match.Empty)
{
matchIndex = sequence.Index;
Console.WriteLine("Sequence matched at index {0}", matchIndex);
matchIndex++;
sequence = r.Match(dna, matchIndex);
}

Or something like that. Then he could determine where Regex should
start searching again after it finds a match.
 
L

Ludovic SOEUR

That is logical that you get only one result.
If you want to get all indexes that have matched, you can use this trick :
use GGAT(?=G) instead of GGATG
So you will match all GGAT sequence that is followed by G. You will not get
of course GGATG in a match result but you don't mind because you know you
are looking for GGATG.
So with
Regex r = new Regex("GGAT(?=G)", (RegexOptions.Compiled));
MatchCollection matches = r.Matches("GGATGGATG");

you will get 2 matches, the first at position 0 and the second at position 4

Hope it helps,

Ludovic SOEUR.
 
B

Bruce Wood

Bruce said:
string dna = "GGATGGATG";
int matchIndex = 0;
Regex r = new Regex("GGATG");
Match sequence = r.Match(dna, matchIndex);
while (sequence != Match.Empty)
{
matchIndex = sequence.Index;
Console.WriteLine("Sequence matched at index {0}", matchIndex);
matchIndex++;
sequence = r.Match(dna, matchIndex);
}

I should point out that there's a bug in my code. The loop test should
read:

while (sequence != Match.Empty && matchIndex < dna.Length) ...

The bug will show up only when matching a one-character Regex pattern
that matches on the last character of the string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top