Count Lines in (Huge) Text Files

N

NvrBst

Whats the best way to count the lines? I'm using the following code
at the moment:


public long GetNumberOfLines(string fileName) {
int buffSize = 65536;
int streamSize = 65536;

long numOfLines = 0;
byte[] bArr = new byte[buffSize];
using(FileStream br = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.None, streamSize,
FileOptions.RandomAccess))
for(int i = br.Read(bArr, 0, buffSize); i > 0; i =
br.Read(bArr, 0, buffSize))
if(i == buffSize) foreach(byte A in bArr) { if(A ==
(byte)'\n') numOfLines++; }
else for(int ii = 0; ii < i; ii++) if(bArr[ii] ==
(byte)'\n') numOfLines++;
return numOfLines;
}



I'm able to count files under 300MB files (100MB in about 2 or 3
seconds, 300MB in about 20 seconds). My problem is that I need to
count lines in really big files (1GB -> 2GB files). Is there any
suggestions to make that possible? Or the above basically as good as
it gets?

--Notes--
1. I've tried a lot of combinations of buffSize/streamSize, and 65kb/
65kb seems to work fastest for files above 100MB. Making them higher
doesn't seem to increase speed much, streamSize can be ~8kb without
noticable changes, however, thought it doesn't hurt to keep them the
same.

2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.

3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both. Cleaner way to do this?

4. FileOptions.RandomAccess seems to work better than
"FileOptions.SequentialScan" for me, expecially if I call
"CountLines()" twice in a row. IE 300MB files, 20seconds for first,
and ~5seconds for second. Is this normal? Or should SequentialScan
be better for the above? Is disposing whats messing that up?

First Scan is what I'm more-so try'ing to achive since I can play
around with the caching options after I'm able to count once in a
decent amount of time.

Thanks
 
P

Pavel Minaev

Whats the best way to count the lines?  I'm using the following code
at the moment:

public long GetNumberOfLines(string fileName) {
    int buffSize = 65536;
    int streamSize = 65536;

    long numOfLines = 0;
    byte[] bArr = new byte[buffSize];
    using(FileStream br = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.None, streamSize,
FileOptions.RandomAccess))
        for(int i = br.Read(bArr, 0, buffSize); i > 0; i =
br.Read(bArr, 0, buffSize))
            if(i == buffSize) foreach(byte A in bArr) { if(A ==
(byte)'\n') numOfLines++; }
            else for(int ii = 0; ii < i; ii++) if(bArr[ii] ==
(byte)'\n') numOfLines++;
    return numOfLines;

}

Unless your files are restricted to ASCII or Latin-1, the above is
wrong. For some multibyte encodings, '\n' can occur as a part of
another characters, and shouldn't be treated as a newline in that
case.
2. I've tried other classes/methods like StreamReader, LINQ "Count" on
byte[], ReadByte(), etc, they all are usally a lot slower than the
above.

StreamReader is probably slower because it opens files as UTF-8 by
default, so it has to do the conversions. Even if it never actually
encounters a multibyte sequence, the checks are still made for every
char read.

LINQ Count is slower because it uses IEnumerable to access the
string's characters. This means two virtual method calls for every
character.

ReadByte() is slower because you forego caching then, but I wonder
what it would look like if you'd wrap your FileStream into
BufferedStream before using it.
3. foreach seems to work a lot faster than the "for(...)" statment
which is why I do both.  Cleaner way to do this?

Yes - try using generic 3-argument static method Array.IndexOf() in a
loop.
 
H

Hilton

What about:

int count = 0;
using (StreamReader sr = new StreamReader (filename))
{
string line;
while ((line = sr.ReadLine()) != null)
{
count++;
}
}

or if that doesn't help, then use your code but instead of using foreach,
use IndexOf. Let us know what you find.

Hilton
 
N

NvrBst

Thank you all for your suggestions
On what kind of hard drive are you testing?
WDC WD800JD-60LSA5 (http://www.wdc.com/en/products/productspecs.asp?
driveid=83). Its a SATA HD, and the site says 150MB transfer. My HD
hasn't been defragmented in a while so that might be why the large
files are such poor for initial loading.

I made a mistake about 20sec vs. 3sec. With re-testing, very initial
load is about 60seconds on a 270MB file, and ~0.5seconds after that.
I was looking at ticks initially and was reading them by about a
factor of 10 off.

I should of posted some numbers instead of saying "a lot". What I
noticed was the following: Averaging 10 results, calling CountLines()
twice right after each other, calling the EXE 10 times. Times are in
Seconds.

--FileOptions.RandomAccess-- ForEach
1st Min(0.5) Max(0.656) Avg(0.656)
2nd Min(0.328) Max(0.406) Avg(0.3592)

--FileOptions.RandomAccess-- For Loop (11-25% slower than ForEach)
1st Min(0.609) Max(0.625) Avg(0.619)
2nd Min(0.453) Max(0.484) Avg(0.467)

--FileOptions.RandomAccess-- For /w Array.IndexOf (20-25% faster than
ForEach)
1st Min(0.406) Max(0.422) Avg(0.414)
2nd Min(0.266) Max(0.312) Avg(0.284)

--FileOptions.Sequential-- For /w Array.IndexOf
This turned out weird for me. Basically 1st and 2nd are always about
the same. But sometimes it does a "~4s/~4s" and sometimes "~0.5/
~0.5". Very Initial load shows up as "~60s/~60s" vs "~60s/0.284" that
RandomAccess Does. Again, I didn't look into this much, maybe because
I'm disposing the FileHandle after I'm done with it. When it is low
(under 1s), it's average is about 5% bigger than the RanAccess
version.

I also noticed that if I changed my two for loops to "while((i = XX) >
0)" type thing, there is a ~10% drop in speed which seemed strange to
me (thought they'd be the same).

Yes - try using generic 3-argument static method Array.IndexOf() in a loop.
Ahh this implementation was faster than all the ones I've tried
before, hands down. Thanks
How accurate do you need your count to be?
What I was doing was playing around with the DataGridView / ListView
classes (in virtual mode), but needed to set the VirtualListSize. I
was thinking counting the lines should be quick enough and then I'd
have the exact amount needed. Since the (very first) initial load
time is too slow for 1G-2G files I'm going to play around with other
approaches. Basically I was thinking one of the two:

a) Have a Background Worker counting the lines, and a timer that
updates the VirtualListSize every 5 or so seconds.
b) "Memory Mode" type thing where I base everything on the file size,
and treat the Index as a percent (IE Index is at 22% of size, Seek to
that position and build a few pages of the cache). When out of range
of personal cache, rebuild on Current Index %.
c) Pure would be harder if I don't have the seek position of the line
I need to process for the DGV/LV. This would turn out to be simular
to b) except for the element level instead of page level.


I'm leaning towards b (maybe for just files above 50 or 100MB) type
thing, but a) might be okay too, and work for all file sizes. I'll
play around with them, thanks for all your suggestions.

In the end the best LineCount Implementation I got was the following
if it is useful for anyone, or if someone can suggest an improvement.


readonly int buffSize = 65536, streamSize = 65536;
public long GetNumberOfLines(string fileName) {
long numOfLines = 0;
byte[] bArr = new byte[buffSize];
using(FileStream fs = new FileStream(fileName, FileMode.Open,
FileAccess.Read, FileShare.Read, streamSize,
FileOptions.RandomAccess))
for(int i = fs.Read(bArr, 0, buffSize); i > 0; i =
fs.Read(bArr, 0, buffSize))
for(int ii = Array.IndexOf<byte>(bArr, (byte)'\n', 0); ii !
= -1; ii = Array.IndexOf<byte>(bArr, (byte)'\n', ++ii))
numOfLines++;
return numOfLines;
}
 
N

NvrBst

What about:

int count = 0;
using (StreamReader sr = new StreamReader (filename))
{
  string line;
  while ((line = sr.ReadLine()) != null)
  {
    count++;
  }

}

or if that doesn't help, then use your code but instead of using foreach,
use IndexOf.  Let us know what you find.

Hilton

Oww I tried StreamReader for the first post. Basically it's about the
same for the inital load (~1 min for a 270MB file), but the runs after
that are about 3x longer than the implementation above with
Array.IndexOf.
 
N

NvrBst

Thank you all for your suggestions
WDC WD800JD-60LSA5 (http://www.wdc.com/en/products/productspecs.asp?
driveid=83).  Its a SATA HD, and the site says 150MB transfer.  My HD
hasn't been defragmented in a while so that might be why the large
files are such poor for initial loading.

Keep in mind that the transfer rate described in that sheet is the  
interface, not the actual throughput of the drive.  For example, from the  
storagereview.com web site again, the big-brother to that drive (WD3200JD)  
has a _maximum_ transfer rate of about 66MB/s, and drops to as low as  
40MB/s at the slowest part of the drive (with a constant rotational speed,  
the linear speed of the disk media depends on whether data's being read  
 from the inner or outer parts of the drive platter.

The interface rate is useful when you're reading small amounts of data,  
when it's been prefetched by the drive into its internal buffer.  But when  
you start reading large files, you run into the actual physical limits of 
the drive, and you'll never see the published transfer rate.
I made a mistake about 20sec vs. 3sec.  With re-testing, very initial
load is about 60seconds on a 270MB file, and ~0.5seconds after that.
I was looking at ticks initially and was reading them by about a
factor of 10 off.
I should of posted some numbers instead of saying "a lot".  What I
noticed was the following: Averaging 10 results, calling CountLines()
twice right after each other, calling the EXE 10 times.  Times are in
Seconds. [...]

I don't understand those numbers.  The first read should be a lot slower  
than the second, not just a little slower.  The numbers you posted suggest  
that the file you're reading is so small that the disk transfer limits  
don't wind up being an issue.  If that's true, you'll want to be very  
careful extrapolating those results to larger files.

If those are in fact numbers from very large files, then perhaps you could  
elaborate on why a) the times are so fast even for the first read, and b) 
why the second read is only a little faster than the first.

Seems odd to me.

I think it's great that Pavel's suggestion to use IndexOf() produced a  
significant improvement. But with your elaboration, I'm not sure you're  
measuring something that will turn out to be significant.

For one, the really interesting case is the slow one for a large file.  
For another, since you're trying to implement a virtual view in a control,  
you're going to need a quick way to go to specific lines, not just know  
how many lines there are.  I suspect that the code that scans input data  
for line-breaks just won't be the bottleneck.  For the first read of a  
file, the disk itself is going to slow you down, and if you're creating an  
index of lines as you read the file, then the work of managing that data  
structure is likely to swamp the effort of just scanning the bytes.

It seems to me that your implementation is fine, and it's great that it  
performs well.  But I'm wondering what's going to happen when you try to  
use this with some actual large file and the necessary work to avoid  
having to keep scanning the file looking for specific lines as the user  
scrolls through the control.

Pete

Sorry, I should of elaborated more. For the numbers the initial (very
first) load was always about 60 seconds (50-80seconds) for all cases.
After that what was happening was a batch script which ran the EXE 10
times to get the resulting numbers. It then gave the Min, Max,
Average of those 10 runs. So the number improvments I were given were
all in relation to the file's inital opening being done already. The
"1st" "2nd" relates to the program calling "GetLineCount()" twice in a
row. So 1st call gave "1st" results, and 2nd call gave "2nd
results" (1 EXE run). RandomAccess seems to make the 2nd call a lot
better, while others (IE Sequential) keeps "1st" and "2nd" the same.

Since the inital load is taking so long the small improvments probably
wont matter, but, eventually I'm going to be processing the log files
(IE picking out certain lines, etc). Logs are mostly of TCP/IP trafic
so I will be picking out (filtering) data from certain IP address's a
lot. So this might help that.

What I was doing was adding the Position of the newline to a
"List<uint>" (counting was just simpler for me to convey the problem),
and using that Position to get the item in the
"GetVirtualItem(Index#Requested)" call. With my 300MB log files, the
List only grows to be about 3MB so far but, potentially, this could be
bad if say there are only about 10 characters per line, on a 1G file
(I don't think I will run into this though, the lines usally are very
long for my files). I missed a word in "c)" which was suppose to say
"Pure Estimation", which this type of thing is what I was talking
about with "don't have the seek position".

Using the "Memory Type" approach, I wouldn't have to store seek
positions, which is why it'd be good for huge (1G+) files. Right now
in my implementation (using a backgroundworker + ProgressChange to
update VirtualListSize) seems to work okay. On the 300MB files it
takes about 1.5 mins to finish processing, but the list is populated
with data almost instantly, so I'm able to look at it and scroll while
data is still being added. I havn't done much testing with the real
1-2GB files yet though, and havn't attempt the "Memory Type" approach.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top