Regex optimization

C

Chuck B

In a C# Regex expression which would be faster when run against say 10,000
strings:

Regex(@"\d+/\d+/\d+ The quick brown fox.*");

or

Regex(@"\d+/\d+/\d+ The.*");

The reason I'm asking is that I'm not sure how Regex works internally and
it's not clear why one would be faster than the other.
 
C

Chuck B

Yes I will - but then that won't tell me much about what went on internally.

I was hoping that someone with knowledge of the Regex engine could help me
understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)
 
K

Kevin Spencer

The Regex engine works by looping through the string one character at a
time, applying the rules of the test to each fragment. In some cases, it
may backtrack, depending on the rules. Therefore, the more rules there are
in the regular expression, the longer it will take. The rules in the regular
expression are defined by the characters in it.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
C

Chris Shepherd

Chuck said:
Yes I will - but then that won't tell me much about what went on internally.

I was hoping that someone with knowledge of the Regex engine could help me
understand why one was better than the other.

Thanks for your help. Oh wait... nm... ;)

Trying it for yourself and seeing the results is a perfectly reasonable
answer.

While it may not allow you to recreate the source code, it certainly
will tell you which way is more efficient. Because your original regexes
were completely unequal, it's hard to say whether you could actually
learn anything about what is going on "internally" since in one instance
you'd get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find
them, so #1 wins out. On 10,000 lines of "The quick brown fox" where the
returned values would be equal, I'd wager on the second regex being
faster (equal time spent returning values, less time spent searching).

Chris.
 
C

Chuck B

Chris Shepherd said:
Trying it for yourself and seeing the results is a perfectly reasonable
answer.

I have to disagree here. Trying it for myself will give me numbers but not
understanding which is the untimate goal.
While it may not allow you to recreate the source code, it certainly will
tell you which way is more efficient. Because your original regexes were
completely unequal, it's hard to say whether you could actually learn
anything about what is going on "internally" since in one instance you'd
get a LOT more hits than the other.

Assuming a somewhat normal dataset where some records have "The quick
brown fox" and more records have "The" in them, I'd suggest that it's
probably more work to return more results than it is simply to find them,
so #1 wins out. On 10,000 lines of "The quick brown fox" where the
returned values would be equal, I'd wager on the second regex being faster
(equal time spent returning values, less time spent searching).

The fault here is mine for not explaining adequately what I wanted.

In the case above I'm looking for the special case where there is 1 match
per string for either Regex. For instance; the result of running both
examples above against "09/26/07 The quick brown fox ran away."

The date would probably take just as long for each regex. However, it seems
like it might be more efficient with the static characters to search for a
longer string than a shorter one (assuming that there was no match embedded
inside of another match). The reason for the increase is that the pointer
pointing to the head of the search would move a greater length after a
successful match.
 
C

Chuck B

Thanks Kevin.

I ran a test with both 10,000 iterations of the test string - "9/19/09 This
is a test. This is only a test of the quick brown fox. If this had been a
real quick brown fox it would have eaten yer toes." with the Regex
"\d+/\d+/\d+.*real.*" and "\d+/\d+/\d+.*this had been a real.*" and it turns
out that there's about a 3 microsecond difference with the shorter
expression being faster.

I'm guessing that rules that involve escape characters like \d, \w would
take longer to match due to the fact that there are more candidates to sort
thru.
 
K

Kevin Spencer

Hi Chuck,

Actually, it's just a simple matter of more rules in the regular expression.

\d+/\d+/\d+ The quick brown fox.*
\d+/\d+/\d+ The.*

Note that each character represents a rule. "quick brown fox" is actually 15
rules, indicating specific characters that must match. So, the Regex engine
must test each successive character in the target string against each of
these characters/rules to ascertain a match before moving on. With the
wildcard (.*), only the newline character (1 character) must be looked for.

--
HTH,

Kevin Spencer
Microsoft MVP

DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
 
J

Jesse Houwing

Hello Chuck,
I have to disagree here. Trying it for myself will give me numbers but
not understanding which is the untimate goal.

The fault here is mine for not explaining adequately what I wanted.

In the case above I'm looking for the special case where there is 1
match per string for either Regex. For instance; the result of running
both examples above against "09/26/07 The quick brown fox ran away."

The date would probably take just as long for each regex. However, it
seems like it might be more efficient with the static characters to
search for a longer string than a shorter one (assuming that there was
no match embedded inside of another match). The reason for the
increase is that the pointer pointing to the head of the search would
move a greater length after a successful match.

A rule of thumb:
the more variable length parts, the slower *, +, {1,3}, {1,}
reluctant modifier, may be slower .*?, +?
the more look arounds (?<+, (?!, (?=, (?<!, the slower
the more optional parts, the slower
the more choices, the slower
the longer the regex, the slower
the more general parts (eg .*), the slower *if* there is a new part after
that.
a short constant part is marginally faster than a large constant part
anchored regexes are faster than ^, $
CultureInvariant regexes are slower
Named captures are slower than unnamed ones.
If you don't use a group for capturing, then it's faster to use (?:...) instead
of (...). Unless you specify the ExplicitCapture option.

Why?

Variable parts will try to capture as much as possible. so if you look at
the following expresion: A.*B in the following string: "AB1234567890" will
first apply the A, then the .* which means it will try to capture AB1234567890.
Then it concludes the B won't match and it will remove the last character
of the string back to AB123456789, again conclusign it won't match and remove
the next character from the capture AB12345678 all the way till it finally
returns AB. This is called Backtracking and you usually want to prevent that
as much as possible. If you know there should be a relatively short distance
between the A and the B use a reluctant modifier .*? or limit the length
to search .{0,100} or even make that reluctant .{0,100}?. See my note on
reluctant modifiers in the next paragraph.

Reluctant modifiers are generally slower. *unless* you know there is supposed
to be a a short distance between this part and the next. Why? Well consider
A.*?B with the following input: A1234567890B it will try to match this as
follows: it firtst captures A, then sees if there's a B behind it. If there
isn't the .*? captures the 1. So now we're ate A1. The engine will again
try to see if the next character is a B. It isn't so .*? will capture the
next character A12. This is repeated until it captures A1234567890B. Now
going back to the previous example. If th einput string there is used this
expression is actually faster. Why? Well considering the string AB1234567890.
The engine will capture the constant A first. It will then look if there's
a B behind it. It is. So the result AB is returned.

The same applies for look arounds. When the engine finds a look around it
will start a new expression search from that position. Either in the opposite
direction or it will continue from the current position and return afterwards.
A look around with a constant or constant length value (eg (?<=AB) or (?<=A.{8}B)
will be much faster than a variable length check. Even more so if the variable
length check could result in backtracking of itself. (eg woth backtracking
option (?<=A.*B) or a normal one (?<=A.*). The more look arounds, the more
expressions need to be run. In some cases a look around can make things faster,
or are the only way to get the correct result. But try to keep the possible
length of the match inside the look around as short as possible.

Optional parts make things slower. Why? Well again has to do with backtracking.
There is a chance it needs to revert back if the optional part won't match.
This get's worse if the optional part has a variable length.

The more choices teh slower. (A|B|C|D|E|F|G) will try each option one after
the other. It will try them in the order yous pecify them. SO if you have
a large batch of data you're going to parse in batch. It migth make sense
to see which characters occur more often than others and put those up front.
Also, if you have multiple parts that begin with the same string it's faster
to break doen the problem even faster: (AAA, ABB, ACC, ADD) is slower than
(A(AA|BB|CC|DD)). Why? Well because this saves the engine from backtracking
back over the A. The same of course applies to [abcdefg] wich is teh equivalent
of the regex (?:a|b|c|d|e|f|g).

The longer the regex, the slower. Well that seems logical doesn't it? If
you have two almost identical regexes, the longest one will be slower. Just
because there are more parts to evaluate.

The more general parts, the slower. This comes back to the first rule. [a-z]*
has a more limited amount of characters to match. So teh chances it migth
capture the whole input and backtrack all the way back are smaller than when
you use .* or even worse [\s\S]*. (why is [\s|S]* worse? you might ask, well
that has to do with the fact that by default the .* will stop when it finds
a newline.)

A short constant part is faster than a long one. Just because it will take
marginally longer to compare 10 characters instead of 5. I'd ignore this.
In some cases a larger string can be faster. Just because the chances of
findign such a string are more remote, so there are less points in the string
a regex migth try to start and find a match. Especially if the large constant
is at the front of the string.

Anchored regexes are faster. If you anchor a match either to the beginning
^ or the end of a string $, it'll be much faster. Though adding $ to a regex
might in some cases make no difference at all. The ^ just ensures there are
a lot fewer places to start searching. Thus it's faster.

CultureInvariant regexes are slower because in that case the engine will
try to match each character to it's Unicode equivalents in other language
blocks. There might me more representations of the character i in all the
unicode characters. So instead of actually searching for i it now searches
for [iiiiiiiiii] (each character having another unicode value, but equal
representation).

If you use capturing (....) will be faster than (?<a>....). I'm not exactly
sure, but I guess is that this is being caused by the fact that the engien
will have to do more string comparisons opposed to a simple numeric counter.
Do note that for maintainability and readability a named capture is much
better, even though it's slower.

Finally, if you're not using a numbered capture, it's faster to let it not
capture at all. This can be achieved in two ways. 1. add ?: to the beginning
of the group to tell the engine it won't have to store the value it finds.
If your regex doesn't use unnamed captures at all, set the ExplicitCapture
option to tell it it can ignore any non-named group for capturing. It save
you from typing ?: at the start of each group, making the parsing of your
regex a tiny bit faster and your regex more readable.

So back to your problem:

From all these rules, only the larger constant part one will apply. And because
there is no very restrictive part after the constant, the shorter pattern
will probably win. Be it marginally.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top