Regex Matches

  • Thread starter Thread starter Kofi
  • Start date Start date
K

Kofi

Any takers?

Got a string of DNA as an input sequence GGATGGATG, apply the simple
regex "GGATG" as in

Regex r = new Regex("GGATG", (RegexOptions.Compiled));

MatchCollection matches = r.Matches("GGATGGATG");

Now I would expect to get two matches right? One at index 0 in the
string and the second at index 4? Or am I being really dumb or
something (EricGu, where art thou?).

Thanks 4 help.

Kofi.
 
It would match two times if you put an extra G at index 4 in the matches
string:
GGATGGGATG

Gabriel Lozano-Morán
 
Gabriel said:
It would match two times if you put an extra G at index 4 in the matches
string:
GGATGGGATG

Gabriel Lozano-Morán

Well, yes, but I think that what the OP wanted to know is why Regex
doesn't re-scan after a match. That is, in the string GGATGGATG, the
Regex will match the initial string: GGATG. After that, where does the
Regex processor look to start matching next? Does it start with the
part of the string after the first matched character, so does it begin
matching the substring GATGGATG, in which case it would find a second
match in the fifth character of the original string (the fourth
character of the substring)? Or does it start looking for another match
after the last character matched in the first match, therefore matching
against GATG, which will result in no second match?

Regex appears to display the latter behaviour, according to the OP.

I checked the RegexOptions enumeration, and don't see any flag for
Rescan. I have seen this option for other Regex pattern matchers, but
it doesn't appear to be in the .NET one.

One thing the OP could do is use Match instead of Matches:

string dna = "GGATGGATG";
int matchIndex = 0;
Regex r = new Regex("GGATG");
Match sequence = r.Match(dna, matchIndex);
while (sequence != Match.Empty)
{
matchIndex = sequence.Index;
Console.WriteLine("Sequence matched at index {0}", matchIndex);
matchIndex++;
sequence = r.Match(dna, matchIndex);
}

Or something like that. Then he could determine where Regex should
start searching again after it finds a match.
 
That is logical that you get only one result.
If you want to get all indexes that have matched, you can use this trick :
use GGAT(?=G) instead of GGATG
So you will match all GGAT sequence that is followed by G. You will not get
of course GGATG in a match result but you don't mind because you know you
are looking for GGATG.
So with
Regex r = new Regex("GGAT(?=G)", (RegexOptions.Compiled));
MatchCollection matches = r.Matches("GGATGGATG");

you will get 2 matches, the first at position 0 and the second at position 4

Hope it helps,

Ludovic SOEUR.
 
Bruce said:
string dna = "GGATGGATG";
int matchIndex = 0;
Regex r = new Regex("GGATG");
Match sequence = r.Match(dna, matchIndex);
while (sequence != Match.Empty)
{
matchIndex = sequence.Index;
Console.WriteLine("Sequence matched at index {0}", matchIndex);
matchIndex++;
sequence = r.Match(dna, matchIndex);
}

I should point out that there's a bug in my code. The loop test should
read:

while (sequence != Match.Empty && matchIndex < dna.Length) ...

The bug will show up only when matching a one-character Regex pattern
that matches on the last character of the string.
 
Back
Top