Help - parsing large text files

ArunPrakash · Aug 23, 2004

Hi,
I have a web application that looks for a particular string in a set
of huge files( the files grow into MBs - max i have seen is 30 MB ). (
search using reg expressions ). the string can occur multiple times in
a file. whenver the string is found in a line, the whole line must be
printed in the output. What i am doing is,
1. traversing each file in the directory
2. In each file, read line by line
3. Match the regular expression. If the search string is there, add
the line to a datatable( which will be used for display )

The problems am facing with this approach is the operation takes tooo
long due to the size of the files and number of files to search too.

The limitations are.
1. I cannot read the entire file into a string and then do the search
because of the size of the files.
2. Network - The files are scattered over a LAN.
3. I could not find a way to use BufferedStream or something like that
( what if the searched string itself is split accross different
chunks? )

can anyone help me with this.

Thanks & Regards,
Arun Prakash. B

Ignacio Machin \( .NET/ C# MVP \) · Aug 23, 2004

Hi,

The only way I see to improve your performance is using threads for reading
several files at the same time, and even so I'm not very sure this would
help you much though.

Also this does not seems like a good task for a web app, you may get
timeout too often.
It would better if a user can "schedule a job" then the search is done in
the background ( MSMQ? ) and does not impact the user experience.

Hope this help,

ArunPrakash Balakrishnan · Aug 23, 2004

Actually we considered windows applications also but found that web
application suiting our needs better. The problem with the background
process is how to pass back the info to the end-user( the HTTP protocol
!!! ).

Niki Estner · Aug 23, 2004

I think you should first identify the bottleneck: you can measure the number
of bytes/second in your current application, and compare that to the
throughput of reading big chunks of data without any processing.
If the network is the bottleneck, the best thing you can do is do the
processing locally, transfering only the results; or buy a faster network.
Otherwise, I'd suggest reading the file in big chunks: Searching for a RE
should be faster on a long string than on many short ones. You can look for
newlines to the "left" and to the "right" form the matches.
I assume the RE is compiled?

Niki

ArunPrakash Balakrishnan · Aug 23, 2004

Yeah. We've considered copying the files to the local folder also. But
again, i guess the bottleneck is with the size of the files only( some
benchmark that we did with files on various machines and locally ).
anybody knows how unix grep optimizes the search. i found a grep utility
in C# which again does the same thing( reading line by line and finding
a match ).. if i can find out how unix grep optimizes the search, i can
implement that and see how it works.

Ignacio Machin \( .NET/ C# MVP \) · Aug 23, 2004

Hi,

If the process takes a long time, as I'm sure it does !!!, it is not
suitable to run it on real time, do as I said, schedule it and then the user
can get notified by email or when he logs back in the page, otherwise you
will have to create a "Wait" page , like the one you get when using expedia,
hotels.com, etc
even so I believe that this search can take more longer that what is
advised in a web app.

cheers,

Niki Estner · Aug 23, 2004

ArunPrakash Balakrishnan said:
Yeah. We've considered copying the files to the local folder also.

I don't think this would do much good. As far as I got you, the files you
search in are distributed over a network; Instead of reading complete files
over the network just to find out a few lines you should send the search
pattern to the computer which hosts the file and do the processing there.

But
again, i guess the bottleneck is with the size of the files only( some
benchmark that we did with files on various machines and locally ).

No, you need to find the bottleneck that limits your application's
throughput. You have two factors, network bandwidth and processing speed.
One of the two is the limiting factor, the "bottleneck". Tweaking the other
one will have little or no effect on the throughput. (this doesn't depend on
file sizes)

anybody knows how unix grep optimizes the search. i found a grep utility
in C# which again does the same thing( reading line by line and finding
a match ).. if i can find out how unix grep optimizes the search, i can
implement that and see how it works.

If you only have fixed search strings, google for "boyer moore". (.net
regex's use this optimization.)

Niki

Help - parsing large text files

ArunPrakash

Ignacio Machin \( .NET/ C# MVP \)

ArunPrakash Balakrishnan

Niki Estner

ArunPrakash Balakrishnan

Ignacio Machin \( .NET/ C# MVP \)

Niki Estner