Problem with reading a large text file

H

haibhoang

I have a Windows Service that is trying to parse a large (> 1Gig) text
file. I am keep getting OutOfMemoryException exception. Here is the
code that's having problem:

using (StreamReader streamReader = new
StreamReader(stream, Encoding.ASCII))
{
string line = "";
DateTime currentDate = DateTime.Now.Date;
while (streamReader.Peek() > -1)
{
line = streamReader.ReadLine();
}
}

I read the documentation and realized that the ReadLine() method is not
every efficient. Is there another way that I can do this?

Thanks,
Hai
 
G

Guest

According to the StreamReader.ReadLine
documentation(http://msdn.microsoft.com/library/d...lrfSystemIOStreamReaderClassReadLineTopic.asp),
a OutOfMemoryException is generated when There is insufficient memory to
allocate a buffer for the returned string.

From the sounds of it, you are trying to have it read a line that is just
too big for it to be able to read at once.

One way around this would be to explicitly read in smaller blocks at a time.
Given that you are using a StreamReader, take a look at one of the two
versions of Read().

The first reads a single character at a time, while the second (the one that
takes arguments) reads a block from the stream of a specified size.

Which ever way you go, both are far safer than a ReadLine() or worse yet
ReadToEnd() with regards to trying to read extremely large sets of data all
at once.

Brendan
 
G

Guest

Every time when you are using "line = " in the loop .NET allocates new string
object in memory. In the same time previous content became eligible for GC,
but you never know when it happened. In case of have processing like parsing
it might not happened till the end of the loop. So you'll get out of memory.
To avoid this simply use StringBuilder. At this case you will allocate
memory only once before loop began.
 
W

Willy Denoyette [MVP]

RayProg said:
Every time when you are using "line = " in the loop .NET allocates new
string
object in memory. In the same time previous content became eligible for
GC,
but you never know when it happened. In case of have processing like
parsing
it might not happened till the end of the loop. So you'll get out of
memory.
To avoid this simply use StringBuilder. At this case you will allocate
memory only once before loop began.

No, this is not the reason for the OOM, Brendan is right to the point.
Also, your description does not reflect how the GC works, whenever the GC
reaches the gen0 threshold ( sizes vary between 256KB and a few MB) the CLR
will hijack the current thread and force a GC collection, nothing can stop
this from happening. Don't forget that an application has to enter the CLR
to instantiate a new object, at that time the CLR inspects the GC heap
statistics and decides to start a GC action when a "trigger point" is met.

Willy.
 
H

haibhoang

I followed your intruction but the process is so slow now.

using (Stream stream = System.IO.File.OpenRead(fileName))
{
using (StreamReader streamReader = new
StreamReader(stream, System.Text.Encoding.ASCII))
{
char[] buffer = new char[202];
int read = 0;
while (streamReader.Peek() > -1)
{
read = streamReader.Read(buffer, 0, 202);
}
}
}
 
G

Guest

If at all possible I would read more than 202 characters at a time.

I’m going to guess that the size of the each record you want to read from
your file is 202 characters long. If we assume a 1 gigabyte file of those
records (1,073,741,824 charactors/bytes long), you have roughly 5,315,553
such records. Reading a single record at a time requires 5.3 million separate
accesses to the disk.

On the other hand, if you increase the size of each read, and then parse out
that data you save yourself a huge amount of work... for example, lets say
you read 10 records in at a time... you bring the required # of separate disk
accesses to just ~300 thousand, much better than 5.3 million.

Increase the read size by another 10 fold and you drop your disk accesses
down to ~30 thousand times.

Depending on the amount of memory available, feel free to play around with
adjusting the amount of data you read each time. Granted, the larger amount
you read reduces the amount number of times you have to hit the disk... it
also increases the memory requirements of your application and the
possibility of other slowdowns. Keep testing and tweaking it until you get it
right, or at least as fast as is acceptable.

Just remember, disk access is one of the slowest forms of I/O you can do on
a computer.

Brendan
 
B

Bill Butler

Question:
Does the HUGE file have any carriage returns? (sounds like it doesn't)
I followed your intruction but the process is so slow now.

I am confused.
Prior to this it sounded like it didn't work at all. Is this true?
If so... how can it be slower NOW when it didn't work before?

What exactly do you mean by "the process is so slow now"?
Parsing a file over a Gig in size will never be fast.
using (Stream stream = System.IO.File.OpenRead(fileName))
{
using (StreamReader streamReader = new
StreamReader(stream, System.Text.Encoding.ASCII))
{
char[] buffer = new char[202];
int read = 0;
while (streamReader.Peek() > -1)
{
read = streamReader.Read(buffer, 0, 202);
}
}
}

You may be able to improve your performance by making the file reads more "Chunky".
Read 202000 bytes in one read instead of 1000 reads of 202 bytes.

Good luck,
Large file processing is fun
Bill
 
J

Jon Skeet [C# MVP]

I have a Windows Service that is trying to parse a large (> 1Gig) text
file. I am keep getting OutOfMemoryException exception. Here is the
code that's having problem:

using (StreamReader streamReader = new
StreamReader(stream, Encoding.ASCII))
{
string line = "";
DateTime currentDate = DateTime.Now.Date;
while (streamReader.Peek() > -1)
{
line = streamReader.ReadLine();
}
}

I read the documentation and realized that the ReadLine() method is not
every efficient. Is there another way that I can do this?

A few points/questions:

1) Rather than calling Peek, the usual way of writing the above is:

while ( (line=streamReader.ReadLine()) != null)
{
// Do something with line
}

2) Given the above, you don't need to initialise line to "" to start
with.

3) What are you actually doing with the lines? If you're keeping them
in memory in an ArrayList or something, then yes, you'll run out of
memory. If you're just reading them and discarding them (as in your
code sample) you shouldn't have any problems.

4) What's the longest line in your text file? If it's enormous, that
could be the problem.
 
G

Gordon Smith \(eMVP\)

Bill said:
Good luck,
Large file processing is fun
Bill

Exactly. For arbitrarily large files, many systems have a background thread
reading the file filling in a buffer using read ahead then synchronize the
consumtion of that buffer for parsing in a different thread. You can have
as simple or complex of a solution as you can imagine depending on what
exactly you're trying to optimize (ram usage, raw speed, etc.).
 
J

Jon Skeet [C# MVP]

Just remember, disk access is one of the slowest forms of I/O you can do on
a computer.

Also note, however, that modern OSes to buffering. I've just written a
1GB file to disk, and then read it using the previously posted code but
with various different buffer sizes.

I would *expect* that as the file is as big as my physical memory, OS
file caching itself won't come into play here - only OS buffering.

Here are the results:

Size: Time taken
100: 00:00:42.1406250
200: 00:00:41.8906250
500: 00:00:41.6406250
5000: 00:00:42
50000: 00:00:41.7500000

(Note that this is on a laptop, so the disk is pretty slow.)

In other words, changing the buffer size really doesn't help here.
(I've tried a few other things, and they don't help much either...)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top