Fastest way to search text file for string

Jon Skeet [C# MVP] · Sep 20, 2004

Julie said:
Yes, thanks for the comments on on B-M searching.

As for loading the file into memory, I specifically do *not* want to do that.
Win32 has offered memory-mapped files for quite some time -- exactly what I'm
after in the .Net world.

Willy wasn't suggesting (as I read it) loading the whole file into
memory in one go - but you need to accept that if you're going to
search through every byte of the original data, all of it will need to
be loaded from disk into memory at that stage. It would be well worth
finding out how long it takes *just* for the loading part on the target
system (taking the cache into account) before looking at the searching
part, IMO.

Willy Denoyette [MVP] · Sep 20, 2004

Julie said:
"Willy Denoyette [MVP]" wrote:

Yes, thanks for the comments on on B-M searching.

As for loading the file into memory, I specifically do *not* want to do
that.
Win32 has offered memory-mapped files for quite some time -- exactly what
I'm
after in the .Net world.

Julie,

Like Jon said, I wasn't suggesting loading the whole file in memory at once,
but as you need to search the whole file you will have to transfer all file
data to memory at some point in time.
Also, using memory mapped files doesn't make sense because:
- The total IO time will be the same as you would read the data directly
into your process space.
Simply because you have to create a "file mapping object" with a size equal
to the file size, then you can create individual file views to do the
search, but as you need to search the whole file you effectively load the
whole file in (virtual) memory.
- You don't shared the file mapping object with other processes which is one
main reason to use mapped files.

Willy.

James Curran · Sep 21, 2004

Daniel said:
For what its worth, this heavy push for a database is probably
foolish.

But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form.
There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would
probably be faster than doing several simple text searches.

Daniel O'Connell [C# MVP] · Sep 21, 2004

James Curran said:
But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely
free-form. There's probably recognizable rows & columns. Further, I don't
believe searching would generally be limited to one search per file, so
reading, indexing, do several indexed searches, and then expiring the
indexing would probably be faster than doing several simple text searches.

Perhaps it has rows and columns, or perhaps its a flat file that simply has
periodic data, perhaps the data is closer to an XML file than it is a
databse, having tagged segments that may exist in any order. Perhaps the
file isn't constant, perhaps the equipment or network setups changes the
file commonly. Rows &columns are only one possible representation and static
content is only one possiblity.

Also, we have no way of knowing how quickly this data is flowing in. What if
this particular piece of equipment is generating hundreds of gigs a week?
Does a database still make sense? Does an index? Or does each file get
processed once and sent along to permanant storage? Is this query only run
against data sometimes, when something else occurs, or is it run dozens of
times an hour? Are the searches automated or are they something an
individual user does when they have reason to? Does the search utility have
to operate over millions of potential files or just one?

Without all of this information, what right do you have to call the OP's
choice naive? As anything we state is probably far less informed than what
their's is, I can't really believe you would consider your own stance to be
the less naive of the bunch.
Whatever our personal experiances may be, that doesn't mean that our
experiances with any particular thing are the only ones possible.

Your suggestion may well be the absolute wrong thing to do. Or it may be the
right one, however I just don't think you're in any better position than I
am to guage that. One would hope the person who wrote the requirements *was*
however.

Julie · Sep 21, 2004

James said:
But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

Thanks for the insult of calling me naive.

I really find it hard to believe that you think that I can't understand what my
requirements are.

For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form.
There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would
probably be faster than doing several simple text searches.

I still don't see where I'm asking for anything faster than 1 hit in a 100 MB
file in 10 seconds or less. That is all that the performance requirement
dictates. Anything more is completely wasted effort.

I appreciate your comments and suggestions, but please, don't become so focused
on what you think is necessary that you completely ignore what I know to be the
requirements.

Thanks

Drebin · Sep 21, 2004

Julie,

Just one other random thought, if I may. I guess the reason why you are
finding so much resistance is that pretty much everyone - except for you -
finds "solutions" in their jobs and remains open-minded in their work. Very
rarely are we asked to blindly "fill these requirements". Our industry is
such that, most companies simply can't afford to work that inefficiently.
Instead, most companies give a developer a "problem" and we are to use our
expertise to find the most efficient solution. "Efficient" means not only
the fastest to implement, but also the the most scalable and easiest to
change.

In other words, when we hear such unreasonable requirements as you've
vaguely defined, everyone's first reaction is to get a handle on those,
because they don't sound reasonable. When someone comes to me with with
outrageous requirements, I don't even BOTHER trying to answer the original
question, because 99% of the time, they don't have a handle on the problem.
But I'm gathering from your reaction and style that you likely work in the
gov't or aerospace - and I know that is a completely different mindset
there.

Anyhow - for future reference, it would probably be a big time-saver if you
were to actually share (in some level of detail) what your requirements are.
What IS the format of this data? Because without that, you come off as an
inexperienced and stubborn developer that is not pushing back on
unreasonable requirements when you CLEARLY should be (in our minds). So, if
you want people to stop reacting to your requirements, it might be best to
actually go into some detail about why they are so immutable so people can
get past it - and start helping with your actual problem.

For whatever it's worth..

Julie · Sep 22, 2004

Drebin said:
Julie,

Just one other random thought, if I may. I guess the reason why you are
finding so much resistance is that pretty much everyone - except for you -
finds "solutions" in their jobs and remains open-minded in their work. Very
rarely are we asked to blindly "fill these requirements". Our industry is
such that, most companies simply can't afford to work that inefficiently.
Instead, most companies give a developer a "problem" and we are to use our
expertise to find the most efficient solution. "Efficient" means not only
the fastest to implement, but also the the most scalable and easiest to
change.

Your points would be valid if: you were talking to someone that didn't know
what they were doing and/or asked for comments on process. Neither of those
apply to me.

I'm *very* open minded, I examine all of the potential solutions that I can,
and make informed decisions. I did that work before posing the original
question. I challenge you to be a little more open minded and realize that
simple-text searching can be (and IS! in this case) a valid solution to the
problem posed.

I work very closely w/ those that define the requirements, I know exactly what
they want, and they are informed as to the details, costs, issues, etc.

Honestly, I had gone your route and implemented this using a database, it would
have completely changed the disposition of the components that I'm working on,
the installation requirements, licensing requirements, base system
requirements, implementation time frame, complexity, maintainability, version
issues, etc., etc., etc. Absolutely none of that can be tolerated on the
project for a relatively simple part of the component that I'm working on.

In other words, when we hear such unreasonable requirements as you've
vaguely defined, everyone's first reaction is to get a handle on those,
because they don't sound reasonable.

Answer me one question: how can you determine that those requirements are
unreasonable?

When someone comes to me with with
outrageous requirements, I don't even BOTHER trying to answer the original
question, because 99% of the time, they don't have a handle on the problem.
But I'm gathering from your reaction and style that you likely work in the
gov't or aerospace - and I know that is a completely different mindset
there.

Oh wise sage, I work in neither of those disciplines.

Anyhow - for future reference, it would probably be a big time-saver if you
were to actually share (in some level of detail) what your requirements are.
What IS the format of this data? Because without that, you come off as an
inexperienced and stubborn developer that is not pushing back on
unreasonable requirements when you CLEARLY should be (in our minds). So, if
you want people to stop reacting to your requirements, it might be best to
actually go into some detail about why they are so immutable so people can
get past it - and start helping with your actual problem.

How about this, for your future reference: try answering the question posed,
don't spend so much time trying to read more into it than exists. If you have
questions about the requirements, ask them _after_ you have answered the
original question. Otherwise, you come off as a know-it-all.

Finally, my requirements were well defined, you just don't happen to want to
believe them.

Julie said:
For whatever it's worth..

Julie said:

James said:

Daniel O'Connell [C# MVP] wrote:

For what its worth, this heavy push for a database is probably
foolish.

But I'm not pushing heavily for a database. I'm pushing heavily in
favor of looking beyond a very naive reading of the requirements, and
putting some knowledge of the domain space behind the problem.

Click to expand...

Thanks for the insult of calling me naive.

I really find it hard to believe that you think that I can't understand what my
requirements are.

For example, I have trouble believing that the input data here
("datafiles from external laboratory instruments") are completely free-form.
There's probably recognizable rows & columns. Further, I don't believe
searching would generally be limited to one search per file, so reading,
indexing, do several indexed searches, and then expiring the indexing would
probably be faster than doing several simple text searches.

Click to expand...

I still don't see where I'm asking for anything faster than 1 hit in a 100 MB
file in 10 seconds or less. That is all that the performance requirement
dictates. Anything more is completely wasted effort.

I appreciate your comments and suggestions, but please, don't become so focused
on what you think is necessary that you completely ignore what I know to be the
requirements.

Thanks

Click to expand...

Julie · Sep 22, 2004

Julie said:
What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.

The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.

I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The search string
is typically 10 characters or less. Finally, I don't want to spawn out to an
external executable (e.g. grep), but include the algorithm/method directly in
the .NET code base. For the first rev, wildcard support is not a requirement.

Thanks to all those that replied.

I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.

As indicated, my requirement was to search a 100 MB text file for a string in
10 seconds or less. My initial results (debug, unoptimized) are right around 5
seconds on the target system, presumably the release/optimized build will be a
bit faster.

Implementation is essentially opening a text stream (StreamReader) and reading
the contents, line by line looking for the search string. Total implementation
is about 10 lines of code.

John Timney \(Microsoft MVP\) · Sep 22, 2004

Julie,

Purely out of interest - how are you checking if the string doesn't exist
over two lines?

Regards

John Timney
Microsoft Regional Director
Microsoft MVP

Willy Denoyette [MVP] · Sep 22, 2004

Julie said:
Thanks to all those that replied.

I spent a little time researching some of the access and search methods
proposed, and the funny thing is that the most simple and straightforward
implementation actually turned out to be quite sufficient.

As indicated, my requirement was to search a 100 MB text file for a string
in
10 seconds or less. My initial results (debug, unoptimized) are right
around 5
seconds on the target system, presumably the release/optimized build will
be a
bit faster.

Implementation is essentially opening a text stream (StreamReader) and
reading
the contents, line by line looking for the search string. Total
implementation
is about 10 lines of code.

Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did
your test.

Willy.

Julie · Sep 22, 2004

John Timney (Microsoft MVP) said:
Julie,

Purely out of interest - how are you checking if the string doesn't exist
over two lines?

As a matter of definition of the file format, the search string cannot span
lines, so no extra processing required.

Julie · Sep 23, 2004

Willy Denoyette said:
Did you try to flush the File System cache first?
I'm pretty sure the file was (partly) cached in the FS cache when you did
your test.

At first I thought the same thing, but I ran the test & timing on several
machines w/ similar results on the first run.

Do you know of a way to programmatically flush the cache from .NET? If so,
I'll try it again w/ a forced flush and see if it changes my results.

Willy Denoyette [MVP] · Sep 23, 2004

Julie said:
At first I thought the same thing, but I ran the test & timing on several
machines w/ similar results on the first run.

If there's no difference between the first run and the second run, it
indicates that the file is read from the FS cache.

Do you know of a way to programmatically flush the cache from .NET? If
so,
I'll try it again w/ a forced flush and see if it changes my results.

You could try to fill the memory with data, this way you will reduce the FS
cache (though, you can't flush all of the cache), but also the Working Set
of all processes, so your system will become real slow.

byte[] bytes = new byte[100 * 5000000]; // allocate 500Mb
(assumes 512 Mb system RAM)
for(int i=0; i<bytes.Length; i++)
bytes = 0;

Willy.

Drebin · Sep 23, 2004

Julie said:
Answer me one question: how can you determine that those requirements are
unreasonable?

Because with the exception of aerospace and gov't work, the private sector
can't tolerate gross inefficiencies like you described. So either you don't
care about being inefficient or you are not smart enough to know that it
is!! Either way, it's bad and either way, you are too stubborn to see people
here are trying to actually help you! Imagine that? Perfect strangers still
willing to help someone who has been a jackass and jerky the whole time.

Oh wise sage, I work in neither of those disciplines.

Why the constant attacks - I'm on the newsgroup to help people, and you're
still just being an ass - and to someone who is trying to help you!! Not
cool.

How about this, for your future reference: try answering the question posed,
don't spend so much time trying to read more into it than exists. If you have
questions about the requirements, ask them _after_ you have answered the
original question. Otherwise, you come off as a know-it-all.

Ya know what, I don't know where you get this sense of entitlement, everyone
that posted, did so because they WANT to. We don't owe you anything! You
aren't entitled to help! So you know what, if I see any more posts from you,
I won't bother to respond. And if you have further technical problems, take
it someplace else. Newsgroups are filled with people of varying levels of
experience and expertise - some answers are helpful, some - not-so-much (my
own answers included) - but ya know what, when you have a problem, maybe a
combination of several things you've read will help you come to a solution.
If people don't give you the answer you want, you don't blast them! Point
is, you are grossly missing the point of newsgroups and I, for one - think
it's kind of ****ed up.

I'm even more annoyed that I wasted this much time on you and this retarded
topic!

Finally, my requirements were well defined, you just don't happen to want to
believe them.

I'm completely underwhelmed. I still think you're an idiot.

Julie · Sep 23, 2004

Drebin said:
Because with the exception of aerospace and gov't work, the private sector
can't tolerate gross inefficiencies like you described. So either you don't
care about being inefficient or you are not smart enough to know that it
is!! Either way, it's bad and either way, you are too stubborn to see people
here are trying to actually help you! Imagine that? Perfect strangers still
willing to help someone who has been a jackass and jerky the whole time.

Can you please explain to me how I've 'tolerated gross inefficiencies'?

Why the constant attacks - I'm on the newsgroup to help people, and you're
still just being an ass - and to someone who is trying to help you!! Not
cool.

I haven't deliberately attacked anyone. I've been simply trying to clarify for
a few posters specifically what is required.

As near as I can tell, you don't want to believe me that I know what the
requirements are.

I've tried to calmly and effectively re-iterate my requirements. Again, you
may not want to agree with them, you may not understand them, and you many not
like them. But the fact remains, the requirements have been carefully examined
on our end, and what I originally posted is what is still required.

I *have* looked at a database solution, and the previous implementation *DID*
use secondary hash indexes into the data. This is *exactly* what we are trying
to move away from due to the added complexity, maintenance issues, etc. that
just aren't justified for this component. Meeting the goal of < 10 second
searches w/ a simple/maintainable solution is (in our case!) infinitely more
desirable than sub-second searches at the expense of increased complexity and
maintainability. Honestly, it is that simple, why do you want to make it so
much more complicated?

Ya know what, I don't know where you get this sense of entitlement, everyone
that posted, did so because they WANT to. We don't owe you anything! You
aren't entitled to help!

Sense of entitlement? I'm baffled by that one. I had a question about a
particular requirement, got a lot more comments about the _requirements_ and
very few responses to the original question.

One of the more important things that I've learned about answering questions in
this and other forums is this:

1) don't assume

2) have faith that the op knows what they want (unless they indicate otherwise)

3) if the op isn't clear, ask a follow-up question

4) when it is clear what the op is after, specifically answer that question

5) if there are other methods or alternatives available but differ slightly
from the stated requirements, add that as a post script

Here is an example:

Jill: I need to search an unformatted 100 MB text file for a string in less
than 10 seconds, what is the fastest way to do this in C#?

Jack: I'm not aware of anything in C# other than opening the file as a text
stream and just searching over the data (brute force). You could use B-M (and
related) searching techniques to speed up the actual string search. Can you
index the data or store it in a database?

Jill: Thanks for the info. I was hoping there was something native to the
language/framework and more efficient that brute force, but maybe it will work
in this case. I'll give it a shot and let you know what kind of results I
get. BTW: my target system is 1.3 GHz w/ 500 MB RAM. Regarding
databases/indexing: we specifically do not want to go down this route (we
already implemented it that way, and for the specific needs, it turned out to
be a *huge* pain to maintain, the benefit just wasn't there for this
component).

Jill: Tried out the brute force method, and it works just great. Times are
around 5 seconds, implementation is simple and it will be easily maintained.
Thanks!

Jack: No problem, glad to help out.

So you know what, if I see any more posts from you,
I won't bother to respond. And if you have further technical problems, take
it someplace else.

"Take it someplace else"??? Sorry, I wasn't aware that you were the moderator
of this newsgroup.

Newsgroups are filled with people of varying levels of
experience and expertise - some answers are helpful, some - not-so-much (my
own answers included) - but ya know what, when you have a problem, maybe a
combination of several things you've read will help you come to a solution.
If people don't give you the answer you want, you don't blast them!

I didn't blast anyone. I simply tried to convey my requirements, asking for
possible solutions, you kept attacking my requirements. Do you disagree with
that?

I'm even more annoyed that I wasted this much time on you and this retarded
topic!

I'm completely underwhelmed. I still think you're an idiot.

Sorry to hear that. I don't think that you are an idiot, and even if I did, I
wouldn't post it in a public forum.

Drebin · Sep 23, 2004

Alright.. last one and I'll let you have the last word if you want it..

Julie said:
Can you please explain to me how I've 'tolerated gross inefficiencies'?

Flat-files are the most inefficient and error-prone way to store, search and
sort data. Without structure that is enforced, you can't guarantee the data
integrity. Without data integrity, you completely undermine the usefullness
of your entire System! Without structure that is enforced, you can't rely on
more efficient ways to search and sort.

Databases were invented because they simplified what everyone was doing
manually again and again. After you've written a sorting algorithm enough
times, you think "Damn, I wish I didn't have to keep doing this over and
over". You start "leveraging" technology and reusability.

Imagine there is a "program" that stores data in a structured way, can
enforce the integrity of that data, allows for various ways to increase the
efficiency of searching and can sort very quickly. Sounds great! But it's
"called" a RDBMS.. and now all of a sudden you hate it again??

All I'm saying - is that when someone CHOOSES to stay with a text file, in
that native format, it's bad for several reasons:

-Structure is not enforced
-No guarantee of data integrity
-No efficient, native way of searching
-No efficient, native way of sorting

In other words, this is the most inefficient and useless form of data. You
have to write from scratch, anything you want to do to the data.

When this data is in a database, all of these are taken care of for free.
And when talking about price and managability - I'll take managing a bunch
of databases over managing free-form text files all day long.

The bottom line of this, is that when someone is fired-up about working in
text-files - that spells disaster for me, because you are having a logical
computer rely on data that could very well be inconsistent/illogical, and
you have to write even the most basic utility (like searching). And I call
that gross inefficiency. In fact, you can't GET any more inefficient!!

As near as I can tell, you don't want to believe me that I know what the
requirements are.

I STILL question that you understand the implications of writing a System in
the year 2004 based on raw text files. I can't imagine a computer person
that understands the implications of doing this - and being OK with it..
nay, thinking that it's the BETTER way to do things!! I can't imagine not
leveraging the technologies that are available. This is not the first or
last time someone has needed to search a file - someone has already invented
the wheel for searching, USE IT!!

I *have* looked at a database solution, and the previous implementation *DID*
use secondary hash indexes into the data. This is *exactly* what we are trying
to move away from due to the added complexity, maintenance issues, etc. that
just aren't justified for this component. Meeting the goal of < 10 second
searches w/ a simple/maintainable solution is (in our case!) infinitely more
desirable than sub-second searches at the expense of increased complexity and
maintainability. Honestly, it is that simple, why do you want to make it so
much more complicated?

I consider text-files to be much more complicated, because they typically
need so much care and feeding. A database is pretty trouble-free. I can't
imagine the problems you'd run across to make you resort to going back to
text files! This is not a reasonable argument at all (given the information
that you've given).

Sense of entitlement? I'm baffled by that one. I had a question about a
particular requirement, got a lot more comments about the _requirements_ and
very few responses to the original question.

Because again, *many* people who post, don't understand the implications of
what they are saying. Most people don't post questions about what they
already know about - they post about something they are not as familiar
with. And if they aren't familiar, then they likely don't know of X and Y
shortcuts or easier ways to do what they are doing.

And you do come off as arrogant when you come around asking for comments,
then criticize the people who respond. You come around asking for help and
then criticize the way in which it's provided!

One of the more important things that I've learned about answering questions in
this and other forums is this:

1) don't assume
2) have faith that the op knows what they want (unless they indicate

otherwise)

I completely disagree. Most do not. Newsgroups are used by people during the
development process, not after the project is done and they are presenting a
demo. Many times, nay, MOST times - people just want to accomplish a fixed
task "I want to search through 100MB of data in < 10s".. one would expect
that people are going to say "There isn't anything that you can write or
implement that is going to be quicker and search faster than a database"

3) if the op isn't clear, ask a follow-up question
4) when it is clear what the op is after, specifically answer that question
5) if there are other methods or alternatives available but differ slightly
from the stated requirements, add that as a post script

OK, so after your little lesson, I got an 80% woo hoo!!! :-)

"Take it someplace else"??? Sorry, I wasn't aware that you were the moderator
of this newsgroup.

Just my opinion.

Sorry to hear that. I don't think that you are an idiot, and even if I did, I
wouldn't post it in a public forum.

hahah oh this is different, taking the high road now, eh? Nice.

Well, this has been fun.

Julie · Sep 23, 2004

Drebin said:
Flat-files are the most inefficient and error-prone way to store, search and
sort data. Without structure that is enforced, you can't guarantee the data
integrity. Without data integrity, you completely undermine the usefullness
of your entire System! Without structure that is enforced, you can't rely on
more efficient ways to search and sort.

As I've previously said, the data source is an external piece of equipment of
which I (and our entire team) don't have control over.

Sorting is not a requirement.

Databases were invented because they simplified what everyone was doing
manually again and again. After you've written a sorting algorithm enough
times, you think "Damn, I wish I didn't have to keep doing this over and
over". You start "leveraging" technology and reusability.

I realize all of that. As I've said, a database simply isn't required for this
component for about 10 reasons that I posted in a separate thread.

Imagine there is a "program" that stores data in a structured way, can
enforce the integrity of that data, allows for various ways to increase the
efficiency of searching and can sort very quickly. Sounds great! But it's
"called" a RDBMS.. and now all of a sudden you hate it again??

Nope, I use database(s) all of the time, and it is used as well in the project
I'm working on. As I've stated, this particular component simply doesn't have
the requirement for database (structured) access, and as an average, actually
decreases performance/productivity when factoring in all costs associated with
a database.

All I'm saying - is that when someone CHOOSES to stay with a text file, in
that native format, it's bad for several reasons:

As I've said, I'm not _choosing_ to stay with a text file, that the the
requirement.

-Structure is not enforced

Not required

-No guarantee of data integrity

Basic file system integrity is sufficient

-No efficient, native way of searching

Nothing more than 1 hit in a 100 MB file in 10 seconds or less on a 1.3 GHz/500
MB RAM is required.

-No efficient, native way of sorting

Not required.

In other words, this is the most inefficient and useless form of data. You
have to write from scratch, anything you want to do to the data.

In my case, the 'writing from scratch' amounts to 10 lines of code.

When this data is in a database, all of these are taken care of for free.

That is where you are just plain wrong. You are not considering the entire
'database' picture:

- implementation and development costs
- licensing costs
- installation requirements/costs
- initial/subsequent processing of data costs (how is the data going to get
into the database? Remember, this is *not* structured data, forcing it into
columns would essentially dictate that either a new table is created for each
(of *many*) variant formats, or all of the data is stored in a single blob)
- maintenance costs

None of that is free, and *must* be factored into the analysis of the problem.
I can say that all of these items were considered, /prior/ to the formal
definition of the requirements.

And when talking about price and managability - I'll take managing a bunch
of databases over managing free-form text files all day long.

That sounds great for your project. For ours, this isn't the case.

The bottom line of this, is that when someone is fired-up about working in
text-files - that spells disaster for me, because you are having a logical
computer rely on data that could very well be inconsistent/illogical, and
you have to write even the most basic utility (like searching). And I call
that gross inefficiency. In fact, you can't GET any more inefficient!!

Not at all fired up about using text files, however that is what the
requirement is, and I can therefore work within those bounds.

Personally, I prefer to use the most simple solution that solves the problem,
nothing more.

I STILL question that you understand the implications of writing a System in
the year 2004 based on raw text files. I can't imagine a computer person
that understands the implications of doing this - and being OK with it..
nay, thinking that it's the BETTER way to do things!! I can't imagine not
leveraging the technologies that are available. This is not the first or
last time someone has needed to search a file - someone has already invented
the wheel for searching, USE IT!!

And that is what I was asking for. Instead of providing an answer to my
question, you changed the question.

Because again, *many* people who post, don't understand the implications of
what they are saying. Most people don't post questions about what they
already know about - they post about something they are not as familiar
with. And if they aren't familiar, then they likely don't know of X and Y
shortcuts or easier ways to do what they are doing.

I understand that. I do not fall into that category.

And you do come off as arrogant when you come around asking for comments,
then criticize the people who respond. You come around asking for help and
then criticize the way in which it's provided!

Please show me where I've been critical. Some respondents started by saying
that a database is a solution to the problem. I simply stated that a database
(and indexing/sorting) is not a viable solution.

"There isn't anything that you can write or
implement that is going to be quicker and search faster than a database"

I'm not asking for anything quicker than a database, I'm asking for the
simplest solution to the problem as defined by the requirements.

hahah oh this is different, taking the high road now, eh? Nice.

No, taking the same road. Your where the one that called me an idiot.

Daniel O'Connell [C# MVP] · Sep 23, 2004

For what its worth, by everything you've written in this thread, a flat file
search was almost certainly the right way to go.

Let me guess, this was a semi-structured tag based file, perhaps something
like

SMPL 1848.44 48174 85771
MP21 181.11 87

etc?

Julie · Sep 23, 2004

Daniel O'Connell said:
For what its worth, by everything you've written in this thread, a flat file
search was almost certainly the right way to go.

Let me guess, this was a semi-structured tag based file, perhaps something
like

SMPL 1848.44 48174 85771
MP21 181.11 87

etc?

Yes, pretty close. The datafiles contain protein information in a very loose
format. The objective is to search the file for a particular protein.

Bill Woodruff · Sep 24, 2004

Why not stop feeding this troll opportunities for further bandwidth waste ?

What's the absolute fastest way...	5	Oct 21, 2009
Fastest way to sort	14	Oct 30, 2009
Fastest way to strip NewLine from string (.Net 2, VS 2005)	5	Feb 25, 2007
Fastest path through maze	13	Mar 18, 2010
Search for a string	1	Feb 29, 2012
How to I write this statement so I can search for any text	1	Dec 12, 2012
Searching 18,000 strings of an average size of 10Kb	8	Jan 11, 2007
Fastest Way to search for a string in a large text file (75 to 100mb)	9	Feb 28, 2008

Fastest way to search text file for string

Jon Skeet [C# MVP]

Willy Denoyette [MVP]

James Curran

Daniel O'Connell [C# MVP]

Julie

Drebin

Julie

Julie

John Timney \(Microsoft MVP\)

Willy Denoyette [MVP]

Julie

Julie

Willy Denoyette [MVP]

Drebin

Julie

Drebin

Julie

Daniel O'Connell [C# MVP]

Julie

Bill Woodruff

Ask a Question

Similar Threads