Efficient regular expression pattern ?

Steve B. · Jun 14, 2006

Hi,

I'm building an application that analyse a flow of url in order to detect
some pages.
I've a very huge list of regular expressions (up to several thousands) that
I have to check on all urls.

Each regular expression will be evaluated agains all urls.

How can I write the code in order to be the most efficient ?

My fisrt idea is to create a List<Regex> that I populate once, and each time
an url is incoming, I do a foreach(Regex r in Regexs).

But I'm afraid this process is quite slow since urls can input up to
10/seconds.

I'd apreciate any comments or suggestions.

Thanks,
Steve

Barry Kelly · Jun 14, 2006

Steve B. said:
I'm building an application that analyse a flow of url in order to detect
some pages.
I've a very huge list of regular expressions (up to several thousands) that
I have to check on all urls.

Can you show us some of your regular expressions? Are many of them
simple text matches, or do they all involve wildcards? Are they using ()
which end up capturing when they don't need to (consider the
ExplicitCapture option)? Do you use the Compiled option?

It can be quite easy to write regular expressions with exponential
matching behaviour. For example, an RE which is supposed to match the
whole string and looks like ".*/foo/" will be much slower than
"^.*/foo/". Also, the current .NET regular expression matcher doesn't
perform some optimizations that many other matchers do.

Each regular expression will be evaluated agains all urls.

How can I write the code in order to be the most efficient ?

Profile it to find out what bits are slow. You might consider trying
each regular expression on a test set of URL data, and dumping out which
ones are matching slowest. That'll tell you which ones you need to
optimize.

But I'm afraid this process is quite slow since urls can input up to
10/seconds.

How slow is it currently?

I imagine you should be able to check many tens of thousands of regular
expressions per second.

-- Barry

Steve B. · Jun 14, 2006

The goal is to detect some pages loaded on specific "malicious" pages.
According that, most of the regexp will looks like this :

^http://www.spywarerulez.com/

This is supposed to detect any ressource for this host.

However most of this shitting companies use several subdomain names, and
regexp also look like this :

^http://[A-Za-z0-9_\.].spywarerulez.com/

I'm quite new in using Regex, so I'm not aware of the behaviour of all
options.

For the moment I've not tested the actual speed of this process, since I do
not have right now all urls. The app is self learning and the url will grow
when it will go into production and when the administrators will add new
addresses.

Thanks,
Steve

Barry Kelly · Jun 14, 2006

Steve B. said:
For the moment I've not tested the actual speed of this process, since I do
not have right now all urls.

You should try some tests with urls and regexs you make up yourself, to
find out (in ballpark figures) what you can expect.

The app is self learning and the url will grow
when it will go into production and when the administrators will add new
addresses.

You should probably include some self-monitoring too then, and
periodically log a dump of the "top ten" slowest REs, if the scanning
process does in fact turn out to be a bottleneck.

-- Barry

Kevin Spencer · Jun 14, 2006

Hi Steve,

Forgive me if I repeat anything you already know.

A regular expression is an expression of a set of rules that define a
"pattern" (or a pattern that defines a set of rules!).

It looks to me like you're trying to create SPAM detection rules, a most
admirable quest, and one which I am chipping away at during my own spare
time.

SPAM detection rules act as a filter, and if I'm not mistaken, you want to
filter out suspected SPAM emails.

For the greatest efficiency, you want to use the filter that removes the
most from the list in order to eliminate those from the remaining items that
need to be filtered. The fewer the items to filter, the faster the process.

Some filters will remove more than others. For example, let's say that
you're filtering on domain names. In your list, you identify the following:

^http://www.spywarerulez.com
^http://a.spywarerulez.com
^http://b.spywarerulez.com
^http://spywarerulez.com
^http://spywarerulez.net
^http://spywarerulez.org

Now, you could filter on all of them individually using several filters, but
if, for example, you know that any domain that ends with "spywarerulez."
plus a root domain, you could filter them all out with:

\b(?:http://)*\w*\.?spywarerulez.\w+\b

(Note that this assumes that the URL begins and ends at a word boundary. You
would probably want to change that part)

So, what I would do is to identify which regular expressions filter the most
items out first. Apply these filters first, and then create a succession of
more and more specific filters.

One good technique to prevent multiple filters is to use an OR expression,
as in:

\b(?:http://)*\w*\.?(spywarerulez|myspyware|spammersdelight|getrichquick|suckeraminute).\w+\b

This would match any domain with any of the domain names that are ORed
together.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Steve B. said:
The goal is to detect some pages loaded on specific "malicious" pages.
According that, most of the regexp will looks like this :

^http://www.spywarerulez.com/

This is supposed to detect any ressource for this host.

However most of this shitting companies use several subdomain names, and
regexp also look like this :

^http://[A-Za-z0-9_\.].spywarerulez.com/

I'm quite new in using Regex, so I'm not aware of the behaviour of all
options.

For the moment I've not tested the actual speed of this process, since I
do not have right now all urls. The app is self learning and the url will
grow when it will go into production and when the administrators will add
new addresses.

Thanks,
Steve

Barry Kelly said:

Can you show us some of your regular expressions? Are many of them
simple text matches, or do they all involve wildcards? Are they using ()
which end up capturing when they don't need to (consider the
ExplicitCapture option)? Do you use the Compiled option?

It can be quite easy to write regular expressions with exponential
matching behaviour. For example, an RE which is supposed to match the
whole string and looks like ".*/foo/" will be much slower than
"^.*/foo/". Also, the current .NET regular expression matcher doesn't
perform some optimizations that many other matchers do.

Profile it to find out what bits are slow. You might consider trying
each regular expression on a test set of URL data, and dumping out which
ones are matching slowest. That'll tell you which ones you need to
optimize.

How slow is it currently?

I imagine you should be able to check many tens of thousands of regular
expressions per second.

-- Barry

Click to expand...

Steve B. · Jun 14, 2006

Thanks for your advises.

I'll take a deeper look and try to measure the executing time for each match
test.

Steve

Steve B. · Jun 14, 2006

The suspects url will be stored and be used by multiples systems.
The first "client app" will be a proxy system that will kick this http
request and replace by a warning message (so even if the user try to display
a picutre in outlook, he won't actually send a request to the server and if
a "link" is clicked from an email, or IM, or anywhere else, the page won't
load if is forbidden).

Email filter is not planned now, but it should be a great idea to capitalize
malicious url.

Thanks,

Since

Kevin Spencer said:
Hi Steve,

Forgive me if I repeat anything you already know.

A regular expression is an expression of a set of rules that define a
"pattern" (or a pattern that defines a set of rules!).

It looks to me like you're trying to create SPAM detection rules, a most
admirable quest, and one which I am chipping away at during my own spare
time.

SPAM detection rules act as a filter, and if I'm not mistaken, you want to
filter out suspected SPAM emails.

For the greatest efficiency, you want to use the filter that removes the
most from the list in order to eliminate those from the remaining items
that need to be filtered. The fewer the items to filter, the faster the
process.

Some filters will remove more than others. For example, let's say that
you're filtering on domain names. In your list, you identify the
following:

^http://www.spywarerulez.com
^http://a.spywarerulez.com
^http://b.spywarerulez.com
^http://spywarerulez.com
^http://spywarerulez.net
^http://spywarerulez.org

Now, you could filter on all of them individually using several filters,
but if, for example, you know that any domain that ends with
"spywarerulez." plus a root domain, you could filter them all out with:

\b(?:http://)*\w*\.?spywarerulez.\w+\b

(Note that this assumes that the URL begins and ends at a word boundary.
You would probably want to change that part)

So, what I would do is to identify which regular expressions filter the
most items out first. Apply these filters first, and then create a
succession of more and more specific filters.

One good technique to prevent multiple filters is to use an OR expression,
as in:

\b(?:http://)*\w*\.?(spywarerulez|myspyware|spammersdelight|getrichquick|suckeraminute).\w+\b

This would match any domain with any of the domain names that are ORed
together.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

A lifetime is made up of
Lots of short moments.

Steve B. said:

The goal is to detect some pages loaded on specific "malicious" pages.
According that, most of the regexp will looks like this :

^http://www.spywarerulez.com/

This is supposed to detect any ressource for this host.

However most of this shitting companies use several subdomain names, and
regexp also look like this :

^http://[A-Za-z0-9_\.].spywarerulez.com/

I'm quite new in using Regex, so I'm not aware of the behaviour of all
options.

For the moment I've not tested the actual speed of this process, since I
do not have right now all urls. The app is self learning and the url will
grow when it will go into production and when the administrators will add
new addresses.

Thanks,
Steve

Barry Kelly said:

I'm building an application that analyse a flow of url in order to
detect
some pages.
I've a very huge list of regular expressions (up to several thousands)
that
I have to check on all urls.

Can you show us some of your regular expressions? Are many of them
simple text matches, or do they all involve wildcards? Are they using ()
which end up capturing when they don't need to (consider the
ExplicitCapture option)? Do you use the Compiled option?

It can be quite easy to write regular expressions with exponential
matching behaviour. For example, an RE which is supposed to match the
whole string and looks like ".*/foo/" will be much slower than
"^.*/foo/". Also, the current .NET regular expression matcher doesn't
perform some optimizations that many other matchers do.

Each regular expression will be evaluated agains all urls.

How can I write the code in order to be the most efficient ?

Profile it to find out what bits are slow. You might consider trying
each regular expression on a test set of URL data, and dumping out which
ones are matching slowest. That'll tell you which ones you need to
optimize.

But I'm afraid this process is quite slow since urls can input up to
10/seconds.

How slow is it currently?

I imagine you should be able to check many tens of thousands of regular
expressions per second.

-- Barry

Click to expand...

Click to expand...

Nick Malik [Microsoft] · Jun 15, 2006

You probably don't need regular expressions to do what you are doing.

Regular expressions are a generic system of pattern matching that uses a
fairly efficient matching algorithm to execute a compiled pattern. However,
for the fact that it is generic and the fact that patterns are compiled, you
pay a price. It is better to use regular expressions when you have a small
number of patterns and a single long target string or many target strings.
It may not be a good idea to use regular expressions when you have thousands
of patterns to match against thousands of target strings. Use regular
expressions... just not in this way.

URLS are very mathematical. As such, you can use a single regular
expression to extract the domain name from each of the incoming urls. Then
you can use simple parsing and matching to determine if your domain name
matches. If it does, you can look for subdomains.

So, first create a hit list containing the domain names you are looking for,
followed by a list of subdomain rules you want to apply.

step 1: take http://s1234563.fubar.com/cx/url=http://www.ebay.com and apply
a regex to get "s1234563" and "fubar.com"
step 2: match directly against "fubar.com" if no hit in your hit list, you
are done.
step 3: if you hit against a domain, then look to the list of subdomains to
see if one is specifically targeted. If not, all are 'bad'.

Hope this helps,

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.

Steve B. · Jun 15, 2006

Your solution seems to be quite effective, but there a scenario where this
cannot apply :

The solution will probably works well for domain filtering, but if I want to
exlude a specific page or sub pages, it won't works.

Let's imagine for example a web site hosting site :
http://www.anygoodprovider.com/asiteofmalicioususer/spywareeldoradodownloads.htm

In this case, the domain is ok, but not the page.

I'm thinkg about a "IUrlFilter" :

interface IUrlFilter {
bool TestUrl(string urlToTest);
}

This interface will be implemented twice :

class HostFilter : IUrlFilter

class PageFilter : IUrlFilter

I firstly test agains hostname (which will probably be 95% of the tests),
and if host is ok, I also test against page (with a very lower number of
test to execute) using the regex method.

I suppose this approach would also help for extensibility of the system...

Do you think I'm on the right way ?

Thanks,
Steve

Nick Malik [Microsoft] · Jun 21, 2006

Steve B. said:
Your solution seems to be quite effective, but there a scenario where this
cannot apply :

The solution will probably works well for domain filtering, but if I want
to exlude a specific page or sub pages, it won't works.

Let's imagine for example a web site hosting site :
http://www.anygoodprovider.com/asiteofmalicioususer/spywareeldoradodownloads.htm

In this case, the domain is ok, but not the page.

yes, but you get a hit on the domain, so you proceed to the page. If you
make the domain matching efficient, you make 95% of your matching efficient.

I'm thinkg about a "IUrlFilter" :

interface IUrlFilter {
bool TestUrl(string urlToTest);
}

This interface will be implemented twice :

class HostFilter : IUrlFilter

class PageFilter : IUrlFilter

I disagree. I wouldn't use your mechanism for either method.

I firstly test agains hostname (which will probably be 95% of the tests),
and if host is ok, I also test against page (with a very lower number of
test to execute) using the regex method.

Regex can be used not only to match, but also to parse. Use it to parse,
but then match using a data structure.

Use regex to return a list of terms seperated by slashes. Create a tree
structure of all of the offending pages by URL. I'm willing to bet that 80%
of the hits against an offending page will be consumed by less than 5% of
the root elements and less than 30% of the first level elements. In other
words, virus sites will have many virus pages, and hosting providers that
don't care about virus hosting will have many many virus sites.

allow for aliases by making your tree so that you can have references from
multiple roots point to a single tree. This covers the fact that DNS
systems routinely allow for nearly infinite aliases.

hostingprovider.com
subdomain: abc, jbx, sand, zx*
dir: foo
dir: bar
page: viruspage1.html
page: viruspage2.html
page: viruspage3.html
dir: fudge
page: viruspage4.html
hostalias.com -- reference to hostingprovider.com
127.198.41.77 -- reference to hostingprovider.com

One regex can parse your entire URL into parts. Domain, subdomain, a list
of directories, and a page reference, followed by a list of parameters.
So you can match once against every URL, take the parts list and run it
through the tree you create. Simple matching will catch every hit. And it
scales well.

Let's say that you have 10,000 URLs in your list. Let's say that 80% of
them have the same 100 domains. That leaves 20% with a smattering of other
domains. I'll swag that down to about 1000 domains. Let's say that you
create a tree structure like above, but the root is a sorted list of these
1000 domains. Let's say you use my method and a binary search against the
domain name. You can find the domain hit in 10 hits or less... you'll
average five hits. Then, if your directory and page lists follow the same
pattern, you are likely to find your match or free the URL in less than 10
more matches. That's 20 matches. Add another 10,000 URLs to the list. The
number of matches goes up... to 21.

With your method, the first scenario requires 10,000 matches. The second
requires 20,000 matches. It scales EXCEEDINGLY badly.

Please, Steve, use a data structure for this mechanism. The fundamental
concept is a Suffix tree, which is a specialized form of a Trie. Note that
you could use a pure Trie structure, especially a Patricia Trie structure,
if you are clever and want REALLY fast performance, but that may require
some programming chops. You can find a good writeup on Trie structures on
wikipedia. http://en.wikipedia.org/wiki/Trie You could also use a Directed
Acyclic Word Graph, which will give you some very fast performance as well.
To be honest, I may have described a compact DAWG above.

Some other links:
http://www.nist.gov/dads/HTML/suffixtree.html
http://www.nist.gov/dads/HTML/directedAcyclicWordGraph.html

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.
--