Multiple threads reading from the same file... is it possible?

WATYF · Sep 20, 2007

Hi there... I have a huge text file that needs to be processed. At the
moment, I'm loading it into memory in small chunks (x amount of lines)
and processing it that way. I'd like the process to be faster, so I'd
like to try creating multiple threads, and having them load different
chunks of the file at the same time and process it asynchronously.

Is it possible to do something like that, and if so, what would be
needed to do so?

WATYF

Michel Posseth [MCP] · Sep 20, 2007

Yes it is possible, you can open a file shared so it should be possible to
read it through multiple apps / threads

However are you sure that file access is your botleneck ?

i had once the task of writing a data conversion program on a 7 + GB file ,
on start it took hours to complete in the end it took only a few minutes to
complete
where submitting the data to the database took the most time .

The one that gave me the highest performance boost was using a stringbuilder
object instead of using a temp string

HTH

Michel

Guest · Sep 20, 2007

First, you have to devise a way for the multiple threads to know which lines
to process. If you plan to delete the processed lines somehow, you would
have to open the file with read/write lock invoked so that only one thread
can get to it at a time. If you do it this way, then you obviously must
protect your file open with Try/Catch b/c two or more threads can be trying
to open the file concurrently and only one can lock it at one time.

If you are not going to delete from the file, then you still have to devise
a way for each thread to know what to process. You could have a "global"
counter/pointer, locked with a mutex while updating from a thread, that told
you where the next thread was to start. However, if one or more threads
abort for some reason, then the counter/pointer would not be updated to
reflect the unprocessed records.

If you simply want to have multiple threads reading the file concurrently,
keeping up with whether a record has been processed by another thread, they
can do that by opening the file with read/share access options.

HTH
(e-mail address removed)

Try our latest time saving tool, Visual Class Organizer, free for 30 days.

http://www.knowdotnet.com/articles/VisualOrganizerProductHome.html

Trevor Benedict · Sep 20, 2007

Why not think about reading the file using a single thread and let other
thread(s) process the chunks. This must be easier for you to deal with.

Regards,

Trevor Benedict
MCSD

WATYF · Sep 20, 2007

I need the reading to be multi-threaded the most. Reading from the
file is what takes the longest. Actually processing the chunks of the
file takes much less time than the loading it into memory.

I may have figured it out, though... although I'm not sure yet if it's
really doing what I expect it to be doing.

I used a combination of TextReaders/TextWriters that were created
using TextReader.Synchronized and also synchronized arraylists which
were created using ArrayList.Synchronized. One of the FileOptions for
the Streams is FileOptions.Asynchronous, so I used that when opening
the files for reading/writing.

WATYF

eBob.com · Sep 21, 2007

If the huge text file is on one hard drive and if the CPU processing is
minimum (as you have said in a separate post) I don't see how multiple
threads will help. Multiple threads are not going to make the hard drive
spin any faster.

Bob

C-Services Holland b.v. · Sep 21, 2007

WATYF schreef:

I need the reading to be multi-threaded the most. Reading from the
file is what takes the longest. Actually processing the chunks of the
file takes much less time than the loading it into memory.

You're basically saying that the harddisk is the bottleneck. It's
supplying the data as fast as it can. Creating multiple threads reading
that file would only make it slower because you'd be making the head of
the drive jump all over the place to satisfy all the requests. Moving
that head takes time.

The only time you can win is probably the time it takes processing, so
you could hand that off to another thread so the reading thread can keep
reading and doesn't have to stop to process the data.

WATYF · Sep 21, 2007

If the huge text file is on one hard drive and if the CPU processing is

minimum (as you have said in a separate post) I don't see how multiple
threads will help. Multiple threads are not going to make the hard drive
spin any faster.

Bob

You know... what you're saying makes perfect sense, but apparently, it
doesn't work that way for whatever reason...maybe someone smarter than
me can explain it.

) Anyway... I figured out how to do it. I did a
write up on it in case anyone else runs into the same scenario.

http://www.musicalnerdery.com/net-p...file-sequentially-using-multiple-threads.html

Using multiple threads on a ~2GB file increased my performance
significantly. The total processing time went from 4.1 minutes to 3.0
minutes.

WATYF

WATYF · Sep 21, 2007

Yes it is possible, you can open a file shared so it should be possible to
read it through multiple apps / threads

However are you sure that file access is your botleneck ?

i had once the task of writing a data conversion program on a 7 + GB file ,
on start it took hours to complete in the end it took only a few minutes to
complete
where submitting the data to the database took the most time .

The one that gave me the highest performance boost was using a stringbuilder
object instead of using a temp string

HTH

Michel

I am sure that disk I/O was the main issue, but thanks for reminding
me about StringBuilders... they're next on my list to investigate for
ways to speed this thing up.

WATYF

eBob.com · Sep 21, 2007

Thanks for sharing your experience.

A while ago there was a performance problem known (to me at least) as
"missing a revolution". This would happen when you read and processed
record 1; then you go to read record 2; but by then record 2 would have
already passed the read head and the program then has to wait almost a whole
revolution of the disk to read the next record. BUT today with hard drives
having a cache I don't see how this would explain what you are seeing. I
hope someone who knows more about hard drive I/O then we do will explain
this to us.

Bob

Trevor Benedict · Sep 21, 2007

This is an intersting read
http://forums.storagereview.net/index.php?showtopic=3200

Regards,

Trevor Benedict

Guest · Sep 22, 2007

Unless you are programming multiple processors, I find it hard to understand
how threads will help you since the critical path of your application seems
to be to completely process the file. I don't think threads run at the same
time, i.e., when one thread is executing, others are waiting for processor
time. It used to be called time sharing on the old IBM mainframes.

WATYF · Sep 22, 2007

Thanks for sharing your experience.

A while ago there was a performance problem known (to me at least) as
"missing a revolution". This would happen when you read and processed
record 1; then you go to read record 2; but by then record 2 would have
already passed the read head and the program then has to wait almost a whole
revolution of the disk to read the next record. BUT today with hard drives
having a cache I don't see how this would explain what you are seeing. I
hope someone who knows more about hard drive I/O then we do will explain
this to us.

Bob

I think I figured it out. I put an updated explanation at the very end
of the article.

http://www.musicalnerdery.com/net-p...file-sequentially-using-multiple-threads.html

WATYF

WATYF · Sep 22, 2007

Unless you are programming multiple processors, I find it hard to understand

how threads will help you since the critical path of your application seems
to be to completely process the file. I don't think threads run at the same
time, i.e., when one thread is executing, others are waiting for processor
time. It used to be called time sharing on the old IBM mainframes.

Yes, I'm programming for multiple processors. There is logic in the
code that only creates as many threads as there are processors on the
machine. So a dual core would use two threads and a quad core would
use four, etc. You're essentially correct, though... multiple threads
on a single CPU don't do any good in this case (so the code is setup
to only create one thread in that instance).

WATYF

Cor Ligthert[MVP] · Sep 22, 2007

Dennis,

It used to be called time sharing on the old IBM mainframes.

But those had multiports on the diskaccess or whatever they called that in
those times. IBM has tried that on their PS systems with I thought it was
called multi channel, as far as I know has never a disk builder made a disk
for that channel. It was in my idea one of the failures from the PS system.

Cor

Cor Ligthert[MVP] · Sep 22, 2007

Michel,

Rinze has written almost the same, however I can not resist.

I answer forever on this. "Do you want to hear the rumble of your disk?"

:-)

See for the rest my answer to Dennis.

Cor

eBob.com · Sep 22, 2007

WATYF said:
I think I figured it out. I put an updated explanation at the very end
of the article.

http://www.musicalnerdery.com/net-p...file-sequentially-using-multiple-threads.html

WATYF

Sounds plausible to me. Thanks

basic threading question	2	Dec 7, 2009
Downgrading from Windows 11	1	Aug 16, 2023
HTTP multipart/form-data in .NET	3	Mar 15, 2009
Worker thread questions	4	Apr 12, 2007
Multi-threaded app running procedures on a module...	6	Sep 29, 2005
Threads eating 100% CPU	2	Mar 25, 2009
Writing to a file from more than one thread possible?	12	Jul 25, 2007
Basic Threading question	19	May 11, 2007

Multiple threads reading from the same file... is it possible?

WATYF

Michel Posseth [MCP]

Guest

Trevor Benedict

WATYF

eBob.com

C-Services Holland b.v.

WATYF

WATYF

eBob.com

Trevor Benedict

Guest

WATYF

WATYF

Cor Ligthert[MVP]

Cor Ligthert[MVP]

eBob.com

Ask a Question

Similar Threads