Search a file and find all occurences

G

Guest

Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important. I don't want to take the route of loading the text of
the file into a giant string and searching it, but would rather focus on a
performance-minded solution. Any sugesstions for a performance - oriented
"find all occurences" type of search against a file?

Thanks in advance!
 
B

Bruce Wood

Dameon said:
Hi All,

I have a process where I'd like to search the contents of a file(in a dir)
for all occurences (or the count of) of a given string. My goal is to focus
more on performance, as some of the files could be upwards of 25mb in size
and time is important.
I don't want to take the route of loading the text of the file into a giant string and
searching it, but would rather focus on a performance-minded solution.

First of all you have to get clear in your mind what your requirements
are. First you state that you want a performant (for real time)
solution, then you state that you don't want to load the text of the
file into a string and search it. Well, if the file is up to 25Mb then
that wouldn't be feasible... nonetheless, are you concerned with time
or memory? Often you have to trade one to get the other. Where along
this scale are you:

1. Care about speed and memory consumption be damned.
2. Want speed without using unnecessarily large amounts of memory.
3. Want a compact footprint and you're willing to sacrifice some speed.
4. Want a compact footprint and speed be damned.

Obviously you're not near the bottom of the list, but are you willing
to pay for speed with memory? That may help limit possible solutions.
Any sugesstions for a performance - oriented "find all occurences" type of search against a file?

I don't have time to search for the code or the precise algorithm, but
I believe that what you want is a state-machine driven model. For a
search string you can build a state machine that indicates what to do
at each mismatch point. Then you can read the file as a sequence of
bytes and apply the state machine to the stream.

As I said, I don't have time to search for the code, but here is an
idea. Let's say you're searching for "abracadabra" in the stream
"abracabracadabra":

1. a matches a. Move along.
2. b matches b. Move along.
3. r matches r. Move along.
4. a matches a. Move along.
5. c matches c. Move along.
6. a matches a. Move along.
7. b does not match d. At this point, the state machine knows that you
don't have to back up all the way to the preceding "b". You can back up
to the 4th check, the "a" and start comparing again.
....and so on...

The idea is that you pre-process the search string into a state
machine. The states and transitions are unique to the search string and
allow you to avoid re-checking characters you don't have to. The
technique becomes more effective the longer the search string.

Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.
 
B

Bruce Wood

Bruce said:
Can anyone point the OP to a link for this algorithm? It's an oldie...
I'm sure there are examples out there.

I found it. It's called the Boyer-Moore Algorithm, and I had it
backward... it tests from the end of the search string to the start.
It's the Knuth-Pratt-Morris Algorithm that searches in the way I
described.

You can just Google those names. I found this page with some examples
of the algorithms in action:

http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html
 
B

Bruce Wood

Dameon said:
Any advice on how to implement Boyer-Moore in C# for the following scenario:

1. I have an automated process which needs to add a closing tag to a file
2. The close tag cannot be added until the file has X amount of occurences
of a specified string
3. The automated process will call methodX to check for X amount of
occurences in the specified file
4. When the correct # of occurences are found, the closing tag will be added.

The first question is, how many of the files are likely to need the
closing tag added?
1. Most of them
2. A few of them

This governs your solution. If the answer is "a few of them" then you
can tolerate an algorithm that first has to read the file to see if it
is a match, and then has to read it again to actually append the tag.
This is easier. If the answer is "most of them" then you have to search
and append the tag all in one operation, which is harder.
Unfortunately, there's no free lunch, either: if you choose the second,
harder algorithm, then its performance will suck if you have to append
to only a few files. If you choose the first, easier algorithm, then
its performance will suck if you have to append to most files.

In either case, in order to do the search, I would create an array of
characters in memory that is the same size as or larger than the search
string. I would then wrap it in a class that gives you "circular
buffer" semantics with indices that work backward (where index 0 is the
last character you just read, 1 is the one before that, 2 is the one
before that, etc):

public class CircularBuffer
{
private char[] _buffer;
private int _length;
private int _zeroIndex;

public CircularBuffer(int capacity)
{
this._buffer = new char[capacity];
this._length = 0;
this._zeroIndex = 0;
}

public void Add(char newChar)
{
this._zeroIndex -= 1;
if (this._zeroIndex < 0)
{
this._zeroIndex += this._buffer.Length;
}
this._buffer[this._zeroIndex] = newChar;
if (this._length < this._buffer.Length)
{
this._length += 1;
}
}

public char this[int index]
{
if (index < 0 || index >= this._length)
{
throw new IndexOutOfRangeException(...);
}
int trueIndex = this._zeroIndex + index;

// Yes, one could use a modulus operator to do this, but I
would expect
// a test followed by a subtraction to be slightly faster than
a modulus,
// which requires an integer division operation.
if (trueIndex > 0)
{
trueIndex -= this._buffer.Length;
}
return this._buffer[trueIndex];
}
}

Basically what this gives you is a buffer that holds the last
"capacity" characters that you have read from a stream. As you read
each character you can "Add" it to the buffer, which makes one
character "fall off" the "start" of the array to make room for the new
character.

With this, it should be easy enough to program Boyer-Moore: you have a
rolling record of enough characters from the stream that you can move
far enough backward through the stream to make the required
comparisons.

Now the only problem is adding the closing tag.

Here the only question is whether you want to be writing an output file
as you are doing Boyer-Moore, and throw away the copied file if it
turns out that the file fails the test, or whether you want to do
Boyer-Moore first and then, if the file passes the test, read the file
again, copying it into a new file and appending the closing tag.

In the first case, you will waste time writing files you don't need if
the Boyer-Moore test eventually reveals that the file isn't a match. In
the second case, you have to read the file again if the test reveals
that the file is a match. So, if few files will be a match, then the
second algorithm (read to test / read again to copy) is obviously more
performant. On the other hand, if most files will be a match, then the
first algorithm (always write new file / delete if not needed) will be
more performant. This first algorithm is, by the way, slightly easier
to write, as you can write two completely separate bits of code, each
of which has one job to do. However, it really depends upon your mix of
files which is the best way to do it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top