Multiple threads reading from the same file... is it possible?

W

WATYF

Hi there... I have a huge text file that needs to be processed. At the
moment, I'm loading it into memory in small chunks (x amount of lines)
and processing it that way. I'd like the process to be faster, so I'd
like to try creating multiple threads, and having them load different
chunks of the file at the same time and process it asynchronously.

Is it possible to do something like that, and if so, what would be
needed to do so?


WATYF
 
M

Michel Posseth [MCP]

Yes it is possible, you can open a file shared so it should be possible to
read it through multiple apps / threads


However are you sure that file access is your botleneck ?

i had once the task of writing a data conversion program on a 7 + GB file ,
on start it took hours to complete in the end it took only a few minutes to
complete
where submitting the data to the database took the most time .

The one that gave me the highest performance boost was using a stringbuilder
object instead of using a temp string


HTH

Michel
 
G

Guest

First, you have to devise a way for the multiple threads to know which lines
to process. If you plan to delete the processed lines somehow, you would
have to open the file with read/write lock invoked so that only one thread
can get to it at a time. If you do it this way, then you obviously must
protect your file open with Try/Catch b/c two or more threads can be trying
to open the file concurrently and only one can lock it at one time.

If you are not going to delete from the file, then you still have to devise
a way for each thread to know what to process. You could have a "global"
counter/pointer, locked with a mutex while updating from a thread, that told
you where the next thread was to start. However, if one or more threads
abort for some reason, then the counter/pointer would not be updated to
reflect the unprocessed records.

If you simply want to have multiple threads reading the file concurrently,
keeping up with whether a record has been processed by another thread, they
can do that by opening the file with read/share access options.

HTH
(e-mail address removed)

Try our latest time saving tool, Visual Class Organizer, free for 30 days.

http://www.knowdotnet.com/articles/VisualOrganizerProductHome.html
 
T

Trevor Benedict

Why not think about reading the file using a single thread and let other
thread(s) process the chunks. This must be easier for you to deal with.

Regards,

Trevor Benedict
MCSD
 
W

WATYF

I need the reading to be multi-threaded the most. Reading from the
file is what takes the longest. Actually processing the chunks of the
file takes much less time than the loading it into memory.

I may have figured it out, though... although I'm not sure yet if it's
really doing what I expect it to be doing.

I used a combination of TextReaders/TextWriters that were created
using TextReader.Synchronized and also synchronized arraylists which
were created using ArrayList.Synchronized. One of the FileOptions for
the Streams is FileOptions.Asynchronous, so I used that when opening
the files for reading/writing.

WATYF
 
E

eBob.com

If the huge text file is on one hard drive and if the CPU processing is
minimum (as you have said in a separate post) I don't see how multiple
threads will help. Multiple threads are not going to make the hard drive
spin any faster.

Bob
 
C

C-Services Holland b.v.

WATYF schreef:
I need the reading to be multi-threaded the most. Reading from the
file is what takes the longest. Actually processing the chunks of the
file takes much less time than the loading it into memory.

You're basically saying that the harddisk is the bottleneck. It's
supplying the data as fast as it can. Creating multiple threads reading
that file would only make it slower because you'd be making the head of
the drive jump all over the place to satisfy all the requests. Moving
that head takes time.

The only time you can win is probably the time it takes processing, so
you could hand that off to another thread so the reading thread can keep
reading and doesn't have to stop to process the data.
 
W

WATYF

If the huge text file is on one hard drive and if the CPU processing is
minimum (as you have said in a separate post) I don't see how multiple
threads will help. Multiple threads are not going to make the hard drive
spin any faster.

Bob

You know... what you're saying makes perfect sense, but apparently, it
doesn't work that way for whatever reason...maybe someone smarter than
me can explain it. :blush:) Anyway... I figured out how to do it. I did a
write up on it in case anyone else runs into the same scenario.

http://www.musicalnerdery.com/net-p...file-sequentially-using-multiple-threads.html

Using multiple threads on a ~2GB file increased my performance
significantly. The total processing time went from 4.1 minutes to 3.0
minutes.

WATYF
 
W

WATYF

Yes it is possible, you can open a file shared so it should be possible to
read it through multiple apps / threads

However are you sure that file access is your botleneck ?

i had once the task of writing a data conversion program on a 7 + GB file ,
on start it took hours to complete in the end it took only a few minutes to
complete
where submitting the data to the database took the most time .

The one that gave me the highest performance boost was using a stringbuilder
object instead of using a temp string

HTH

Michel


I am sure that disk I/O was the main issue, but thanks for reminding
me about StringBuilders... they're next on my list to investigate for
ways to speed this thing up.

WATYF
 
E

eBob.com

Thanks for sharing your experience.

A while ago there was a performance problem known (to me at least) as
"missing a revolution". This would happen when you read and processed
record 1; then you go to read record 2; but by then record 2 would have
already passed the read head and the program then has to wait almost a whole
revolution of the disk to read the next record. BUT today with hard drives
having a cache I don't see how this would explain what you are seeing. I
hope someone who knows more about hard drive I/O then we do will explain
this to us.

Bob
 
G

Guest

Unless you are programming multiple processors, I find it hard to understand
how threads will help you since the critical path of your application seems
to be to completely process the file. I don't think threads run at the same
time, i.e., when one thread is executing, others are waiting for processor
time. It used to be called time sharing on the old IBM mainframes.
 
W

WATYF

Thanks for sharing your experience.
A while ago there was a performance problem known (to me at least) as
"missing a revolution". This would happen when you read and processed
record 1; then you go to read record 2; but by then record 2 would have
already passed the read head and the program then has to wait almost a whole
revolution of the disk to read the next record. BUT today with hard drives
having a cache I don't see how this would explain what you are seeing. I
hope someone who knows more about hard drive I/O then we do will explain
this to us.

Bob


I think I figured it out. I put an updated explanation at the very end
of the article.

http://www.musicalnerdery.com/net-p...file-sequentially-using-multiple-threads.html

WATYF
 
W

WATYF

Unless you are programming multiple processors, I find it hard to understand
how threads will help you since the critical path of your application seems
to be to completely process the file. I don't think threads run at the same
time, i.e., when one thread is executing, others are waiting for processor
time. It used to be called time sharing on the old IBM mainframes.

Yes, I'm programming for multiple processors. There is logic in the
code that only creates as many threads as there are processors on the
machine. So a dual core would use two threads and a quad core would
use four, etc. You're essentially correct, though... multiple threads
on a single CPU don't do any good in this case (so the code is setup
to only create one thread in that instance).

WATYF
 
C

Cor Ligthert[MVP]

Dennis,
It used to be called time sharing on the old IBM mainframes.

But those had multiports on the diskaccess or whatever they called that in
those times. IBM has tried that on their PS systems with I thought it was
called multi channel, as far as I know has never a disk builder made a disk
for that channel. It was in my idea one of the failures from the PS system.

Cor
 
C

Cor Ligthert[MVP]

Michel,

Rinze has written almost the same, however I can not resist.

I answer forever on this. "Do you want to hear the rumble of your disk?"

:)

See for the rest my answer to Dennis.

Cor
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top