Fastest way to search text file for string

J

Julie

What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The search string
is typically 10 characters or less. Finally, I don't want to spawn out to an
external executable (e.g. grep), but include the algorithm/method directly in
the .NET code base. For the first rev, wildcard support is not a requirement.

Thanks for any pointers!
 
H

Hermit Dave

i would suggest that you have a look at Regex implmentation. I think regex
is the fastest when it comes to scanning.
You might need to use filestream to load the file so i dont think its the
most appropriate answer.
anyways make a local copy of one of those files and give Regex a try. see if
it comes anywhere near the 10 sec mark.

--

Regards,

Hermit Dave
(http://hdave.blogspot.com)
 
D

Drebin

I wouldn't spend anymore time on see *if* you can do this, until you find
out *why* you have to do this!

100mb flat file?? This is exactly the reason why relational databases were
made and are still used for just about everything. Without knowing more
about your app, I'd rather take the 2 minutes to load this into a SQL table,
build an index - and then what you want to do, suddenly becomes quick
(sub-second), simple and will support wildcards later. Maybe bulk-load your
file at night - and have your front-end hit the database during the day?

I don't think you will be happy with just about any solution. Every response
you will get to this is either going to be way to slow -or- way too
complicated. You're re-inventing the wheel!!

My $ .02
 
C

cody

Just load the Text file into a large string and use string.IndexOf() this
should be even faster than RegEx.
 
J

John Timney \(Microsoft MVP\)

Given the size of the file, probably the only way would be to use a
filebytearray, load the file in as bytes 1 at a time and convert them to
chars by creating an indexer. You will need to work out a way of checking
if a series of chars make up the string you are looking for.

I would start here.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/csref/html/
vcwlkindexerstutorial.asp

However, I very much doubt you will manage to scan 100 meg of data in 10
seconds.

--
Regards

John Timney
Microsoft Regional Director
Microsoft MVP
 
J

Julie

Drebin said:
I wouldn't spend anymore time on see *if* you can do this, until you find
out *why* you have to do this!

All requirements have been defined at this point by project management. This
isn't just a blind decision, but the result of the examination of the domain
and expected results.
100mb flat file?? This is exactly the reason why relational databases were
made and are still used for just about everything. Without knowing more
about your app, I'd rather take the 2 minutes to load this into a SQL table,
build an index - and then what you want to do, suddenly becomes quick
(sub-second), simple and will support wildcards later. Maybe bulk-load your
file at night - and have your front-end hit the database during the day?

Yes, 100+ MB flat files. These are loosely formatted datafiles from external
laboratory instruments.

Remember, proper implementation dictates that you implement what is required,
nothing more. The current requirements are simple access & management of these
text files that allows immediate searching that averages 10 seconds or less.
WinGrep accomplishes this in 6 seconds on our target system (1.3 GHz, 500 MB
RAM).

Future requirements *may* dictate that additional time constraints are imposed,
which would then lead to db or other external indexing of the files. But, that
will be implemented when and if necessary. If you have questions about this
approach, you may want to look into the (industrial) extreme programming
paradigm, which is what our shop successfully uses.
I don't think you will be happy with just about any solution. Every response
you will get to this is either going to be way to slow -or- way too
complicated. You're re-inventing the wheel!!

Nay, my good friend, not re-inventing the wheel, but asking where the wheel
is. Text-searching of large files isn't uncommon or inappropriate. I'm just
looking into comments on such searches in .Net; this stuff is fairly trivial in
C++/Win32 (I'd prefer *not* to drop down to managed/unmanaged C++ for this
project).
 
J

Julie

John Timney (Microsoft MVP) said:
Given the size of the file, probably the only way would be to use a
filebytearray, load the file in as bytes 1 at a time and convert them to
chars by creating an indexer. You will need to work out a way of checking
if a series of chars make up the string you are looking for.

I would start here.

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/csref/html/
vcwlkindexerstutorial.asp

However, I very much doubt you will manage to scan 100 meg of data in 10
seconds.

Thanks, I'll look into that.

WinGrep performs the search in 6 seconds on our target system (1.3 GHz, 500 MB
RAM). (WinGrep is not open source, and C++/Win32.)
 
F

Frans Bouma [C# MVP]

Julie said:
Nay, my good friend, not re-inventing the wheel, but asking where the wheel
is. Text-searching of large files isn't uncommon or inappropriate. I'm
just looking into comments on such searches in .Net; this stuff is fairly
trivial in C++/Win32 (I'd prefer not to drop down to managed/unmanaged C++
for this project).

Searching text in textblocks should use the textsearch algorithm by
Knuth-Morris-More or the Boyer-Moore variant. These algorithms are much
faster than the brute force algorithms implemented in the string class.

Algorithms in C by Sedgewick contains a description of these algorithms, and
I'm sure you'll find some descriptions on the internet.

Basicly they come down to this:
string: ababababcababababacababababacbabababacbabababc

if you now try to find the string abc, do not start at teh first character,
but at the last. So in the string, the 3rd character is an 'a'. So abc will
never start at the first character of the string, so we can skip the first 3.
It works with skip arrays and is quite clever, it will tremendously speed up
string search, especially with large texts.

Frans.
 
J

James Curran

Julie said:
What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string.

I don't want to load the entire file into physical memory,
memory-mapped files are ok (and preferred).

The problem is that fast access to a large file requires direct access
to memory, which is antithetical to managed code. Your best choice would be
to isolate the search in a unmanaged C++ function, which is coded by the
managed C# app. Access the file as a memory mapped file is fairly easy in
unmanaged code, but I don't believe it's possible at all in a managed app.

The best algorithm for searching your file is probably the Boyer-Moore
method. Moore himself has a cool webpage graphically demostrating it:
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html.
Googling "Boyer-Moore" will provide any number of implementations, such as
this one: http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713b.srch.c.html
 
W

William Stacey [MVP]

If you want to jam em out in c#, I for one would really appreaciate it and
keep for future use. Or post at CodeProject. TIA.
[/QUOTE]
....
 
D

Drebin

Julie said:
Remember, proper implementation dictates that you implement what is
required,
nothing more.

Wow. Most every other developer would argue that "proper" implementation
dictates: functionality that is required and then any other structural or
design implementations that will likely be used in the future.

That's like saying "my customers told me to build a house and they want 3
bedrooms". You need to have the foresight and wisdom to anticipate that
later they are going to say "we want a kitchen and bathrooms too!". If you
don't, you will *always* be doing way more work than you need to! Customers
often don't know what they want - it's up to your expertise to help them
define that.

For your app, common sense tells me: if they want to search this data, they
will want to sort it, and they will want to browse and filter results. Going
your route, EACH step will likely take just as long. If you did it the right
way once, in the beginning - then each additional thing the users wanted
would be practically free.
If you have questions about this
approach, you may want to look into the (industrial) extreme programming
paradigm, which is what our shop successfully uses.

You can call it what you want and justify it however, but I'm really glad I
don't work where you work. :)
 
M

Michael C

That's like saying "my customers told me to build a house and they want 3
bedrooms". You need to have the foresight and wisdom to anticipate that
later they are going to say "we want a kitchen and bathrooms too!". If you
don't, you will *always* be doing way more work than you need to! Customers
often don't know what they want - it's up to your expertise to help them
define that.

If you're storing 100MB of data in a flat file, you're probably not running
very efficiently to begin with. Like Drebin said, you need to look at your
ultimate goals and you'll probably find much better solutions out there
(i.e., SQL, etc.)

That being said, you need to have the foresight, wisdom and *discipline* to
define all requirements *up-front*. Adding extra kitchens and bathrooms to
the floorplan halfway through the house building process jacks the cost - in
time and $$$'s - *way* up. It's true that a lot of users don't understand
the requirements definition process, and many aren't aware of the features
they'll want down the road; that's where you step in and help guide them
during the planning process. If you're constantly re-writing your
applications because the user requirements keep changing during the
implementation phase, you might want to look at improving your own planning
process.

Thanks,
Michael C., MCDBA
 
F

Frans Bouma [C# MVP]

William said:
If you want to jam em out in c#, I for one would really appreaciate it and
keep for future use. Or post at CodeProject. TIA.

I made a typo, It's Knutt-Morris-Pratt, not Knutt-Morris-Moore.

Here is a link to a lot of string search algorithms. Both I mentioned are
there with C-code and explanation. It's pretty straight forward :)]
http://www-igm.univ-mlv.fr/~lecroq/string/

Frans.
 
J

Julie

Drebin said:
Wow. Most every other developer would argue that "proper" implementation
dictates: functionality that is required and then any other structural or
design implementations that will likely be used in the future.

I was in that group of developers 6 months ago. I was raised old school where
requirements, specs, and definitions are done before coding starts.

Believe me, the paradigm switch *wasn't* easy nor intuitive. However, I have
to say, that there is a lot to IXP that really impacts the production quality
*and* performance of the entire project. I'm no expert, but I'm making
progress, and probably won't switch back to the old way.

And I guarantee that I wasn't excited about the switch. However, I committed
to trying it for 6 months, and then reevaluate. So far, so good.
That's like saying "my customers told me to build a house and they want 3
bedrooms".

Right -- and that is what you build. The foresight is to build it in such a
fashion that that future changes can be easily accomplished. One of the
primary keys is constant refactoring of code and TDD. That leaves the code in
a much more stable and flexible state. It doesn't sound efficient, but when
you are proficient at it, it really is more effective.
You need to have the foresight and wisdom to anticipate that
later they are going to say "we want a kitchen and bathrooms too!".

In your hypothetical example, what if they never want a kitchen/bathroom? Then
you have wasted work.
If you
don't, you will *always* be doing way more work than you need to! Customers
often don't know what they want - it's up to your expertise to help them
define that.

For your app, common sense tells me: if they want to search this data, they
will want to sort it, and they will want to browse and filter results. Going
your route, EACH step will likely take just as long. If you did it the right
way once, in the beginning - then each additional thing the users wanted
would be practically free.

See, your common sense is already wrong, and if followed, leads down the wrong
path and direction for this application.

Anyway, the thread isn't to discuss the merits of IXP, merely looking for a
quick text search implementation in C# -- nothing more, nothing less.
 
J

Julie

Michael said:
If you're storing 100MB of data in a flat file, you're probably not running
very efficiently to begin with. Like Drebin said, you need to look at your
ultimate goals and you'll probably find much better solutions out there
(i.e., SQL, etc.)

Like I indicated in another follow-up: "These are loosely formatted datafiles
from external laboratory instruments." The requirement is to work w/ these
files, not change the file format.
That being said, you need to have the foresight, wisdom and *discipline* to
define all requirements *up-front*. Adding extra kitchens and bathrooms to
the floorplan halfway through the house building process jacks the cost - in
time and $$$'s - *way* up. It's true that a lot of users don't understand
the requirements definition process, and many aren't aware of the features
they'll want down the road; that's where you step in and help guide them
during the planning process. If you're constantly re-writing your
applications because the user requirements keep changing during the
implementation phase, you might want to look at improving your own planning
process.

Thank you for the compliment! Yes, we definitely do have foresight, wisdom and
discipline. The conclusion of our investigation was the following
requirements:

Search an approx. 100 MB flat text file for a given string in an average of 10
seconds or less, no wildcards.

Now, if I can just have the *discipline* to implement something that meets
_those_ requirements, I get the job done, am appreciated by the team, we
release the product, I get paid, and everyone is happy.
 
J

Julie

James said:
The problem is that fast access to a large file requires direct access
to memory, which is antithetical to managed code. Your best choice would be
to isolate the search in a unmanaged C++ function, which is coded by the
managed C# app. Access the file as a memory mapped file is fairly easy in
unmanaged code, but I don't believe it's possible at all in a managed app.

The best algorithm for searching your file is probably the Boyer-Moore
method. Moore himself has a cool webpage graphically demostrating it:
http://www.cs.utexas.edu/users/moore/best-ideas/string-searching/index.html.
Googling "Boyer-Moore" will provide any number of implementations, such as
this one: http://www.dcc.uchile.cl/~rbaeza/handbook/algs/7/713b.srch.c.html

Thanks for the tips. I wasn't aware that mm file support wasn't available in
..NET, seems short-sighted to me.

Managed/unmanaged code really isn't a possibility, part of the requirements is
that it is implemented in C#. I may be able to get away w/ using interop
though for the mm file support.
 
J

Julie

Julie said:
Thanks for the tips. I wasn't aware that mm file support wasn't available in
.NET, seems short-sighted to me.

Managed/unmanaged code really isn't a possibility, part of the requirements is
that it is implemented in C#. I may be able to get away w/ using interop
though for the mm file support.

Not interop, but P/Invoke...
 
M

Michael C

Use Perl.


Julie said:
Like I indicated in another follow-up: "These are loosely formatted datafiles
from external laboratory instruments." The requirement is to work w/ these
files, not change the file format.


Thank you for the compliment! Yes, we definitely do have foresight, wisdom and
discipline. The conclusion of our investigation was the following
requirements:

Search an approx. 100 MB flat text file for a given string in an average of 10
seconds or less, no wildcards.

Now, if I can just have the *discipline* to implement something that meets
_those_ requirements, I get the job done, am appreciated by the team, we
release the product, I get paid, and everyone is happy.
 
M

Michael C

You got some serious specifications, but haven't given enough information to
really help you find a solution. So I suggested using a language that was
designed specifically to perform text processing. Maybe you could provide
more information, like:

You specify 10 seconds to locate text matches in a 100 MB flat text file.
Is the file already loaded into memory? Do you want to load the whole file
into memory first? Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole
point is moot). How many matches are you searching for? One match, every
match? Is the file structured in such a way that its format can be
leveraged to speed up the process? Are there certain fields that are
searched more than others in searches?

Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data
in reasonable time frames. Of course that may not be an option for you.

Thanks,
Michael C., MCDBA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top