Tokenizing a large buffer

Jonathan Sion · Dec 18, 2007

Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net?

Thanks
Jonathan

Peter Duniho · Dec 18, 2007

I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net?

I don't know how lex was implemented. And I don't know whether a state
machine is the best way to solve the problem. But I do know that it's a
reasonable way to solve the problem, and that I wrote a simple
implementation awhile ago and posted it here. You can see it in this post:
http://groups.google.com/group/microsoft.public.dotnet.languages.csharp/msg/0f06f696d4500b77

I see some things in the implementation that I'd probably do differently
if I were doing it again today, but it ought to work. Or at least
something like it.

For what it's worth, just how large is this "large text string"? And how
frequently do you need to do this? If this is something that your code
needs to do over and over on a frequent basis, optimizing the
implementation would be useful. But I'd guess that 500 searches on even a
100K string or so wouldn't take that long, just to do it once.

There's some value in using the brute-force method, as it will keep the
code a _lot_ simpler. I wouldn't worry about the performance unless you
have a good reason to believe it will be a problem.

Hope that helps.

Pete

Samuel R. Neff · Dec 18, 2007

If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.

HTH,

Sam

Peter Duniho · Dec 18, 2007

If you generate the regex's as compiled (constructor param) and cache
them in memory, and make sure that each regex is anchored to the
begining of the search string, then I don't think looping through all
the regexes as you move through the text will perform bad at all.
Since the regexes have to match at the start of the text then they can
all fail very fast if they don't match.

In short, I would do some testing to make sure you're not trying to
solve a problem that doesn't exist first.

In addition, how long can a Regex match string be? Assuming there's no
practical limit -- that is, you can put any arbitrary string there -- then
since Regex supports boolean "or" in the search pattern string, you could
just have a single search pattern string with all of the tokens in it.

So rather than looping with multiple searches using Regex, just loop on
the tokens in creating the search string. Then let Regex do all the hard
work.

Would this be as fast or faster than the state graph? I don't know...it
depends on whether the Regex authors put some effort into optimizing that
case. I don't know enough about Regex (implementation _or_ API

) to
have an answer to that. But even if they didn't, obviously Sam and I
agree that the simpler code is better as long as there's no direct
evidence that performance is actually going to be an issue.

Pete

Jonathan Sion · Dec 19, 2007

Hi Pete,
I thought about OR-ing them all into a single pattern. this would
certainly improve SOME performance, but the problem is, many regular
expressions overlap, and when you ran a search pattern, it would stop
on the first one found, while I need all of them. Right now, I got
around 500 regexp patterns (one for each token) and while some could
be or-ed, and would save me some time, some cannot. I guess LEX was
created for a reason, and I thought to ask here if anybody knows of
a .net implementation.
for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers
(basically, I am running over all stored procedures in a database, to
find erros in code. I am building a TSQL compiler) and so running over
entire database, if its large, means executing it and going for a
drink every time. I will procceed with pete's advice, this could help
performance quite a bit I am sure.. though like I said, its not the
ultimate solution, I think
thanks,
J

Jon Skeet [C# MVP] · Dec 19, 2007

for those here who asked if performance really is a problem - good
question, and the answer is yes. running hunderds of search patterns
on a buffer could take 2-3 minutes, and I got plenty of such buffers

When you say "could take 2-3 minutes" do you mean you've measured it
with a particular set of search patterns?

Jon

Jonathan Sion · Dec 20, 2007

yes,
It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )

Jon Skeet [C# MVP] · Dec 20, 2007

It could take something like 2 minutes for a single stored procedure
(a really large one) and when it comes to a database of 700 procedures
it means going out for an adult beverage every time I run it : )

Are you already precompiling the regular expressions? How complicated
are the expressions?

Jon

Samuel R. Neff · Dec 20, 2007

2 minutes for a single procedure is way too much. I suggest you post
some sample code and a sample procedure. Don't need 500 expressions,
but 2-3 representative ones and 2-3 representative procedures, as well
as the relevant code that generates the regexp and does the loop.

Sam

Jonathan Sion · Dec 27, 2007

Hi,
I am sorry for the delay in my response. Here is a sample code. I got
my object oTokBase, which i defined, and simply holds a regular
expression pattern (in the TokenDef string property) and also holds
the 'options' that I transfer to the .net RE engine (case sensitive,
multiline, etc...) so this is the iteration, and i got about 500
patterns. I have to plead ignorance when asked if its 'compiled' or
not - is it worth reading about? (i.e. will it help performance) also,
I apologize for the VB.NET code in a c# group... the code is very
similar, and I came to this group knowing that the more serious
programmers would be here : )

here's the code, Thanks!
--------------------------------------------------------

For Each oTokBase In Me.m_TokDefs

oTokDef = oTokBase
Dim oRegex As Regex = New Regex(oTokDef.TokenDef,
oTokDef.RegExpOptions)
'run the search:
Dim mc As MatchCollection = oRegex.Matches(Me.m_sBuffer)
If mc.Count > 0 Then
'aha! found:
Dim m As Match
Dim iCharPosInLine As Integer, iLine As Integer
For Each m In mc
Dim oToken As New Token(oTokDef.TokenDef, m.Value,
oTokDef.TokenID)
oToken.Position.AbsolutePos = m.Index
Me.GetPosInLine(m.Index, iCharPosInLine, iLine)
oToken.Position.CharPosInLine = iCharPosInLine
oToken.Position.Line = iLine
Me.TokenPositions.Add(oToken)
Next

End If
NextToken:
Next

Ben Voigt [C++ MVP] · Dec 31, 2007

Jonathan Sion said:
Hi,
I got a large text string, and a bunch of regular expression patterns
i need to find within the string. in other words, to find all the
'tokens' inside it. of course I could use the regexp engine, the thing
is, if I got 500 different patterns, this means 500 searches on the
buffer. Does anybody has any general idea on how this could be sped
up? In other words, I am looking to do what the old 'LEX' tool in unix
used to do - one pass on the string, finding all patterns and making
them tokens... what I need is not THAT complex, so I am wondering
if .net regexp could search for manny patterns in a single pass.

and if the answer is no - is there any lex implementation in .net?

google gives -- http://sourceforge.net/projects/csflex/

Tokenizing a large buffer

Jonathan Sion

Peter Duniho

Samuel R. Neff

Peter Duniho

Jonathan Sion

Jon Skeet [C# MVP]

Jonathan Sion

Jon Skeet [C# MVP]

Samuel R. Neff

Jonathan Sion

Ben Voigt [C++ MVP]