Hello Chuck,
I have to disagree here. Trying it for myself will give me numbers but
not understanding which is the untimate goal.
The fault here is mine for not explaining adequately what I wanted.
In the case above I'm looking for the special case where there is 1
match per string for either Regex. For instance; the result of running
both examples above against "09/26/07 The quick brown fox ran away."
The date would probably take just as long for each regex. However, it
seems like it might be more efficient with the static characters to
search for a longer string than a shorter one (assuming that there was
no match embedded inside of another match). The reason for the
increase is that the pointer pointing to the head of the search would
move a greater length after a successful match.
A rule of thumb:
the more variable length parts, the slower *, +, {1,3}, {1,}
reluctant modifier, may be slower .*?, +?
the more look arounds (?<+, (?!, (?=, (?<!, the slower
the more optional parts, the slower
the more choices, the slower
the longer the regex, the slower
the more general parts (eg .*), the slower *if* there is a new part after
that.
a short constant part is marginally faster than a large constant part
anchored regexes are faster than ^, $
CultureInvariant regexes are slower
Named captures are slower than unnamed ones.
If you don't use a group for capturing, then it's faster to use (?:...) instead
of (...). Unless you specify the ExplicitCapture option.
Why?
Variable parts will try to capture as much as possible. so if you look at
the following expresion: A.*B in the following string: "AB1234567890" will
first apply the A, then the .* which means it will try to capture AB1234567890.
Then it concludes the B won't match and it will remove the last character
of the string back to AB123456789, again conclusign it won't match and remove
the next character from the capture AB12345678 all the way till it finally
returns AB. This is called Backtracking and you usually want to prevent that
as much as possible. If you know there should be a relatively short distance
between the A and the B use a reluctant modifier .*? or limit the length
to search .{0,100} or even make that reluctant .{0,100}?. See my note on
reluctant modifiers in the next paragraph.
Reluctant modifiers are generally slower. *unless* you know there is supposed
to be a a short distance between this part and the next. Why? Well consider
A.*?B with the following input: A1234567890B it will try to match this as
follows: it firtst captures A, then sees if there's a B behind it. If there
isn't the .*? captures the 1. So now we're ate A1. The engine will again
try to see if the next character is a B. It isn't so .*? will capture the
next character A12. This is repeated until it captures A1234567890B. Now
going back to the previous example. If th einput string there is used this
expression is actually faster. Why? Well considering the string AB1234567890.
The engine will capture the constant A first. It will then look if there's
a B behind it. It is. So the result AB is returned.
The same applies for look arounds. When the engine finds a look around it
will start a new expression search from that position. Either in the opposite
direction or it will continue from the current position and return afterwards.
A look around with a constant or constant length value (eg (?<=AB) or (?<=A.{8}B)
will be much faster than a variable length check. Even more so if the variable
length check could result in backtracking of itself. (eg woth backtracking
option (?<=A.*B) or a normal one (?<=A.*). The more look arounds, the more
expressions need to be run. In some cases a look around can make things faster,
or are the only way to get the correct result. But try to keep the possible
length of the match inside the look around as short as possible.
Optional parts make things slower. Why? Well again has to do with backtracking.
There is a chance it needs to revert back if the optional part won't match.
This get's worse if the optional part has a variable length.
The more choices teh slower. (A|B|C|D|E|F|G) will try each option one after
the other. It will try them in the order yous pecify them. SO if you have
a large batch of data you're going to parse in batch. It migth make sense
to see which characters occur more often than others and put those up front.
Also, if you have multiple parts that begin with the same string it's faster
to break doen the problem even faster: (AAA, ABB, ACC, ADD) is slower than
(A(AA|BB|CC|DD)). Why? Well because this saves the engine from backtracking
back over the A. The same of course applies to [abcdefg] wich is teh equivalent
of the regex (?:a|b|c|d|e|f|g).
The longer the regex, the slower. Well that seems logical doesn't it? If
you have two almost identical regexes, the longest one will be slower. Just
because there are more parts to evaluate.
The more general parts, the slower. This comes back to the first rule. [a-z]*
has a more limited amount of characters to match. So teh chances it migth
capture the whole input and backtrack all the way back are smaller than when
you use .* or even worse [\s\S]*. (why is [\s|S]* worse? you might ask, well
that has to do with the fact that by default the .* will stop when it finds
a newline.)
A short constant part is faster than a long one. Just because it will take
marginally longer to compare 10 characters instead of 5. I'd ignore this.
In some cases a larger string can be faster. Just because the chances of
findign such a string are more remote, so there are less points in the string
a regex migth try to start and find a match. Especially if the large constant
is at the front of the string.
Anchored regexes are faster. If you anchor a match either to the beginning
^ or the end of a string $, it'll be much faster. Though adding $ to a regex
might in some cases make no difference at all. The ^ just ensures there are
a lot fewer places to start searching. Thus it's faster.
CultureInvariant regexes are slower because in that case the engine will
try to match each character to it's Unicode equivalents in other language
blocks. There might me more representations of the character i in all the
unicode characters. So instead of actually searching for i it now searches
for [iiiiiiiiii] (each character having another unicode value, but equal
representation).
If you use capturing (....) will be faster than (?<a>....). I'm not exactly
sure, but I guess is that this is being caused by the fact that the engien
will have to do more string comparisons opposed to a simple numeric counter.
Do note that for maintainability and readability a named capture is much
better, even though it's slower.
Finally, if you're not using a numbered capture, it's faster to let it not
capture at all. This can be achieved in two ways. 1. add ?: to the beginning
of the group to tell the engine it won't have to store the value it finds.
If your regex doesn't use unnamed captures at all, set the ExplicitCapture
option to tell it it can ignore any non-named group for capturing. It save
you from typing ?: at the start of each group, making the parsing of your
regex a tiny bit faster and your regex more readable.
So back to your problem:
From all these rules, only the larger constant part one will apply. And because
there is no very restrictive part after the constant, the shorter pattern
will probably win. Be it marginally.