Fastest way to search text file for string

Phill · Sep 19, 2004

How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?

Drebin · Sep 19, 2004

Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do
something like:

select customerid from customer where customer_lastname = 'smith'

assuming you had customer data to load. This is pretty standard stuff - did
you mean this in a different way? SQL is already a really, really powerful
tool for loading, sorting and searching for data. So for someone to want to,
and think they could do a better job - is ambitious to say the least. SQL
concepts (like indexing and searching) are time-tested, I can't picture
challenging it and thinking I could do a better job!!

James Curran · Sep 19, 2004

Julie said:
Thanks for the tips. I wasn't aware that mm file support wasn't
available in .NET, seems short-sighted to me.

What good is a memory mapped file in an environment where you cannot
directly access memory?

James Curran · Sep 19, 2004

Julie said:
Like I indicated in another follow-up: "These are loosely formatted
datafiles from external laboratory instruments." The requirement is
to work w/ these files, not change the file format.

Well, admittedly I don't know the specifics of the contract, but it
seems that you are taking that a bit too literally. As far as I can see,
you are required to:
A) Read in the 100 MB flat text file, and
B) Spit out results found within that file.

Exactly how you get from A) to B) is strictly your concern, and if you want
to implement it by loading it into a SQL database or otherwise indexing it,
no one else needs even know about it.

Michael C · Sep 19, 2004

Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file
so that it didn't have to be re-created from scratch every darn time you run
the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL
Server. Any solution short of using a separate index of some sort is going
to be very non-scalable and comparatively slow. But alack and alas, to each
her own...

Thanks,
Michael C., MCDBA

Julie · Sep 19, 2004

James said:
What good is a memory mapped file in an environment where you cannot
directly access memory?

Can't directly access memory? Sure you can.

In this case, it could be an abstraction layer on something like char[] or
byte[].

Julie · Sep 19, 2004

Michael said:
Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file
so that it didn't have to be re-created from scratch every darn time you run
the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL
Server. Any solution short of using a separate index of some sort is going
to be very non-scalable and comparatively slow. But alack and alas, to each
her own...

As I indicated in my original post, indexing isn't an option (and this
therefore excludes any database access).

Apparently you aren't aware of any methods in .NET to accomplish what I need --
thanks for your input.

Julie · Sep 19, 2004

Michael said:
You got some serious specifications, but haven't given enough information to
really help you find a solution.

I wasn't asking for a solution, I was asking:

"What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."

So I suggested using a language that was
designed specifically to perform text processing.

I'm not doing text processing, I'm searching a text file for a given string --
that is all.

Maybe you could provide more information, like:

You specify 10 seconds to locate text matches in a 100 MB flat text file.
Is the file already loaded into memory?

No, the file is on disk, as I indicated in my original post:

"I don't want to load the entire file into physical memory"

Do you want to load the whole file into memory first?

No: "I don't want to load the entire file into physical memory"

Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole
point is moot).

N/A: "I don't want to load the entire file into physical memory"

How many matches are you searching for? One match, every
match?

Typically 0 or 1 (i.e. found/not found).

Is the file structured in such a way that its format can be
leveraged to speed up the process?

For the purposes of my requirements, the file is unordered, unstructured,
essentially random text.

Are there certain fields that are
searched more than others in searches?
No.

Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data
in reasonable time frames. Of course that may not be an option for you.

Not stuck, that is the requirement. Or, do you presume the original
requirements for grep to be 'stuck'?

Again, I'm not interested in reinventing the wheel as you put it, I'm simply
after: "What is the *fastest* way in .NET to search large on-disk text files
(100+ MB) for a given string."

Julie · Sep 19, 2004

Drebin said:
Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do
something like:

select customerid from customer where customer_lastname = 'smith'

Please explain how to do this w/ unstructured text?

"The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted."

Julie · Sep 19, 2004

James said:
Well, admittedly I don't know the specifics of the contract, but it
seems that you are taking that a bit too literally. As far as I can see,
you are required to:
A) Read in the 100 MB flat text file, and
B) Spit out results found within that file.

Exactly how you get from A) to B) is strictly your concern, and if you want
to implement it by loading it into a SQL database or otherwise indexing it,
no one else needs even know about it.

Nope, sorry, I'm not taking it too literally, but thanks for implying that I'm
incapable to understand requirements.

Assume the input/output requirements to be the same as grep.

Drebin · Sep 19, 2004

Julie,

Please don't take offense, I have no desire to get into a pissing contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never met
anyone that thinks like this. So after this, I'll back off :-)

First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what you
want, until you see some of your project come to life. Using the analogy of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.

So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).

As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.

So - bottom line, if you have to deal with data that doesn't have structure,
you should first address that. You are going to spend 3x as much time trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.

You seem pretty engrained in this mindset of yours, so I'm not trying to
convince you of anything - just giving you the computer science take on what
you are trying to do. Anyhow, good luck with this!! :-)

Michael C · Sep 19, 2004

Oh that's simple then. You point your browser at
http://www.thecodeproject.com/csharp/wingrep.asp#xx825897xx and contact
Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you
implement and see if it does the job in 10 seconds or less.

Too easy.

Your welcome,
Michael C., MCDBA

Michael C · Sep 19, 2004

Simple. Since you require a simple Grep-like function, you find one that's
already made, and modify it to your taste as opposed to re-inventing the
wheel from scratch. If the Code Project link I already gave you doesn't fit
your needs, I would recommend Googling "C#.NET Grep". But I'm sure you've
already done that.

Your welcome,
Michael C., MCDBA

Willy Denoyette [MVP] · Sep 19, 2004

The real problem is to find the fastest way to read the data into memory.
Reading a 100Mb file will take something like 5 - 10 seconds depending on
the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data
is not cached.
Depending on the results of the file data load time, you can decide to use a
naive algorithm or opt for a faster algorithm like the Boyer-Moore
algorithm.
Consider that:
- the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6
times faster than a naive algorithm,
- a naive algorithm should be able to search a 10 char pattern in less than
one sec on a decent system (P4 - 2.8 GHz).

So it's really important to know exactly how much time will be spent to
bring the data in memory before you decide upon the searching algorithm.

Willy.

James Curran · Sep 19, 2004

Julie said:
Assume the input/output requirements to be the same as grep.

And it's *completely acceptable* for GREP to be implemented using a SQL
server behind the scenes. Granted that probably wouldn't be a very good
implementation, but it as long as the *inputs* and *outputs* are as
expected, the implementation is irrelevant.

John Price · Sep 19, 2004

Drebin,

I have no desire to get into a pissing contest -

But that is exactly what you are doing. You are trying to prove you are
right and she is wrong. Just except that she knows what her requirements are
and that (rightly or wrongly) a database is not suitable. Why does it matter
so much to you if she is wrong?

Her original question was quite clear and explicit, even to the point of
stating the file could not be indexed or sorted! Unfortunately some times in
life you have to deal with what you are given and you can't always demand
how information is provided.

Just answer the question as asked and don't start criticising people just
because you believe their approach is wrong. By all means offer an
alternative solution but don't take offence when somebody says it is not
suitable. And don't be so arrogant to assume you know their requirements or
circumstances better they do.

Regards

John

p.s By all means start ranting and raving at me as well but do so knowing I
wont be reading anymore of your replies. You have a major attitude problem.
I suggest you refrain from contributing until you can control it.

Daniel O'Connell [C# MVP] · Sep 19, 2004

James Curran said:
And it's *completely acceptable* for GREP to be implemented using a SQL
server behind the scenes. Granted that probably wouldn't be a very good
implementation, but it as long as the *inputs* and *outputs* are as
expected, the implementation is irrelevant.

It would be ok if the grep *ENGINE* can accept any input, a grep
implementation that uses a database is valid, but close to useless
considering grep is built strictly to search files. What good is a program
that imports all of my files into a database and then tosses the database
out afterwards? Would you *WANT* that? Just because some of you think its a
really great idea because it means you don't have to do any work?

For what its worth, this heavy push for a database is probably foolish. If
the search only occurs a couple of times, there is a pretty good chance that
the time spent loading the database and extra space used up will be a huge
waste of space.

An index makes sense, but only if alot of searches will occur over the same
files. For one off searches I would imagine creating hte index would take
more time and alot more space than just searching the file.

Julie · Sep 20, 2004

Michael said:
Oh that's simple then. You point your browser at
http://www.thecodeproject.com/csharp/wingrep.asp#xx825897xx and contact
Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you
implement and see if it does the job in 10 seconds or less.

Excellent -- thanks!

Julie · Sep 20, 2004

Willy Denoyette said:
The real problem is to find the fastest way to read the data into memory.
Reading a 100Mb file will take something like 5 - 10 seconds depending on
the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data
is not cached.
Depending on the results of the file data load time, you can decide to use a
naive algorithm or opt for a faster algorithm like the Boyer-Moore
algorithm.
Consider that:
- the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6
times faster than a naive algorithm,
- a naive algorithm should be able to search a 10 char pattern in less than
one sec on a decent system (P4 - 2.8 GHz).

So it's really important to know exactly how much time will be spent to
bring the data in memory before you decide upon the searching algorithm.

Yes, thanks for the comments on on B-M searching.

As for loading the file into memory, I specifically do *not* want to do that.
Win32 has offered memory-mapped files for quite some time -- exactly what I'm
after in the .Net world.

Julie · Sep 20, 2004

Drebin said:
Julie,

Please don't take offense, I have no desire to get into a pissing contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never met
anyone that thinks like this. So after this, I'll back off

I'll gladly take that as compliment.

First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what you
want, until you see some of your project come to life. Using the analogy of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.

So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).

Well, you may want to consider that it is possible -- this is actually a port
of an existing C++ application to C#. Requirements were
defined/refined/reevaluated years ago. My task is simple, as I oringally
posted. Please don't assume that you know more about my requirements than I
do.

As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.

Right -- The source of the data is a 3rd party $100,000 laboratory instrument
(mass spectromoter). I'll just contact the manufacturer and tell them that
they are doing everything wrong and to output their data in some other format.
If they have any questions as to why, I'll refer them to you.

So - bottom line, if you have to deal with data that doesn't have structure,
you should first address that. You are going to spend 3x as much time trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.

So, apparently, you implement more than was asked for. Interesting. I prefer
to implement what was asked for.

Fastest way to search text file for string

Phill

Drebin

James Curran

James Curran

Michael C

Julie

Julie

Julie

Julie

Julie

Drebin

Michael C

Michael C

Willy Denoyette [MVP]

James Curran

John Price

Daniel O'Connell [C# MVP]

Julie

Julie

Julie