P
Phill
How would you load a 100MB txt file into a DB and then search it for a
word? How would that work?
word? How would that work?
Julie said:Thanks for the tips. I wasn't aware that mm file support wasn't
available in .NET, seems short-sighted to me.
Julie said:Like I indicated in another follow-up: "These are loosely formatted
datafiles from external laboratory instruments." The requirement is
to work w/ these files, not change the file format.
James said:What good is a memory mapped file in an environment where you cannot
directly access memory?
Michael said:Exactly my point. The only way I can think of to search 100 MB files
quickly would require a separate index, preferably kept in a separate file
so that it didn't have to be re-created from scratch every darn time you run
the program. That being said, that's exactly what SQL does; any type of
separate indexing method would basically be re-inventing the wheel. And
heck, they're giving away MSDE for free, so you don't even have to buy SQL
Server. Any solution short of using a separate index of some sort is going
to be very non-scalable and comparatively slow. But alack and alas, to each
her own...
Michael said:You got some serious specifications, but haven't given enough information to
really help you find a solution.
So I suggested using a language that was
designed specifically to perform text processing.
Maybe you could provide more information, like:
You specify 10 seconds to locate text matches in a 100 MB flat text file.
Is the file already loaded into memory?
Do you want to load the whole file into memory first?
Does the load time count against your 10 seconds, or is
it in addition to? If it counts against your 10 seconds, can your
recommended system configuration load it in 10 seconds? (If not, the whole
point is moot).
How many matches are you searching for? One match, every
match?
Is the file structured in such a way that its format can be
leveraged to speed up the process?
Are there certain fields that are
searched more than others in searches?
No.
Assuming I was *stuck* with a 100 MB flat text file, and no option to
utilize a SQL database or other method of access, I suppose I'd have to
*reinvent the wheel* and create a separate index file to retrieve the data
in reasonable time frames. Of course that may not be an option for you.
Drebin said:Create a loader program/BCP/DTS job to split the records accordingly and
load them into a table structure (say named "customer").. then you could do
something like:
select customerid from customer where customer_lastname = 'smith'
James said:Well, admittedly I don't know the specifics of the contract, but it
seems that you are taking that a bit too literally. As far as I can see,
you are required to:
A) Read in the 100 MB flat text file, and
B) Spit out results found within that file.
Exactly how you get from A) to B) is strictly your concern, and if you want
to implement it by loading it into a SQL database or otherwise indexing it,
no one else needs even know about it.
Julie said:Assume the input/output requirements to be the same as grep.
I have no desire to get into a pissing contest -
James Curran said:And it's *completely acceptable* for GREP to be implemented using a SQL
server behind the scenes. Granted that probably wouldn't be a very good
implementation, but it as long as the *inputs* and *outputs* are as
expected, the implementation is irrelevant.
Michael said:Oh that's simple then. You point your browser at
http://www.thecodeproject.com/csharp/wingrep.asp#xx825897xx and contact
Jean-Michel Bezeau for a copy of his Grep stand-alone class. Then you
implement and see if it does the job in 10 seconds or less.
Willy Denoyette said:The real problem is to find the fastest way to read the data into memory.
Reading a 100Mb file will take something like 5 - 10 seconds depending on
the IO subsystem used (RAID 0, single 7200 RPM drive) and assuming the data
is not cached.
Depending on the results of the file data load time, you can decide to use a
naive algorithm or opt for a faster algorithm like the Boyer-Moore
algorithm.
Consider that:
- the Boyer-Moore algorithm (with a pattern length of 10) is about 4 - 6
times faster than a naive algorithm,
- a naive algorithm should be able to search a 10 char pattern in less than
one sec on a decent system (P4 - 2.8 GHz).
So it's really important to know exactly how much time will be spent to
bring the data in memory before you decide upon the searching algorithm.
Drebin said:Julie,
Please don't take offense, I have no desire to get into a pissing contest -
I am just flabbergasted on your viewpoint. I can honestly say I've never met
anyone that thinks like this. So after this, I'll back off![]()
First, to talk to what you said before about "getting solid requirements
first" - the problem with getting requirements isn't that people are
incompetent, it's very much human nature to not being able to know what you
want, until you see some of your project come to life. Using the analogy of
building a house, you may not realize until after the 2nd floor is in
place - that you have a really cool view, and it would've been nice to build
a balcony. So - I think that people many times are UNABLE to make solid
choices on requirements because you just can see that far ahead.
So I believe human nature makes it quite impossible to make 100% accurate
requirements. It's just not possible (for most large projects).
As to your point below, computers are based on structure. If you want to
KEEP your data unstructured, you are going to be fighting the computer the
whole way. So if you are asking me, I would be spending all of my time right
now, getting this unstructured data - into some sort of structure. If you
source for this data gives you it unstructured, then you need to get
yourself a new data source or build a converter of some sort. You eluded to
this being data from an electronic device of some sort. It seems to me, if
it just gave you an array of numbers and values that were random - this
would be completely useless.
So - bottom line, if you have to deal with data that doesn't have structure,
you should first address that. You are going to spend 3x as much time trying
to do any little thing with this data. Versus - if you just get this
straightened out in the beginning. If you do this, then you can leverage a
TON of technology and products (like SQL server) rather than writting a
custom-version of them for this specific problem. "Write-once-use-once"
software is soooo "mid-90's".. "Write-once-use-many-times" is what things
have evolved to.