Reading text files using pointers?

G

Guest

Hi

I'm trying to learn a bit about performance, hope someone can help me out

I have a text file with 8-bit characters in it. In order to improve performance, I'm using a BinaryReader instead of a StreamReader. I've made two versions of my method, one which uses typesafe code, and one which uses unsafe code with pointers. I've read several places that direct pointer access will eliminate bounds-checking when accessing an array, and would like to see the effect myself. However, the typesafe code is as fast or faster than the unsafe code. The methods are practically identical, except for the access bit

Shared code
BinaryReader reader = ... // Open data reader
int fileSize = ... // Get size of file
int bufSize = 32768
byte[] buf = null
int bytesRead = 0
int totalBytesRead = 0
byte msgStartCode = (byte) '$'

Typesafe version is like
d

buf = reader.ReadBytes(bufSize)
bytesRead = buf.Length
totalBytesRead += bytesRead

for (int bufIndex = 0; bufIndex < buf.Length; bufIndex++

if (buf[bufIndex] == msgStartCode

// Parse message

} // fo

} while (totalBytesRead < fileSize)

wheras the unsafe version is like

d

buf = reader.ReadBytes(bufSize)
bytesRead = buf.Length
totalBytesRead += bytesRead

// Pin memory
fixed (byte* bufPtrUnsigned = &buf[0]

sbyte* bufPtr = (sbyte*) bufPtrUnsigned; // Use to build message strin

for (int bufIndex = 0; bufIndex < buf.Length; bufIndex++

if (buf[bufIndex] == msgStartCode

// Parse message, using e.g. new String(bufPtr, bufIndex, msgLength)

} // fo

} // fixe

} while (totalBytesRead < fileSize)

Shouldn't the last version be faster

Thanks in advance for any help
 
E

Einar Buffer

Darn typos...

Einar Høst said:
Hi,

I'm trying to learn a bit about performance, hope someone can help me out.

I have a text file with 8-bit characters in it. In order to improve
performance, I'm using a BinaryReader instead of a StreamReader. I've made
two versions of my method, one which uses typesafe code, and one which uses
unsafe code with pointers. I've read several places that direct pointer
access will eliminate bounds-checking when accessing an array, and would
like to see the effect myself. However, the typesafe code is as fast or
faster than the unsafe code. The methods are practically identical, except
for the access bit:
Shared code:
BinaryReader reader = ... // Open data reader.
int fileSize = ... // Get size of file.
int bufSize = 32768;
byte[] buf = null;
int bytesRead = 0;
int totalBytesRead = 0;
byte msgStartCode = (byte) '$';

Typesafe version is like:
do
{
buf = reader.ReadBytes(bufSize);
bytesRead = buf.Length;
totalBytesRead += bytesRead;

for (int bufIndex = 0; bufIndex < buf.Length; bufIndex++)
{
if (buf[bufIndex] == msgStartCode)
{
// Parse message.
}
} // for

} while (totalBytesRead < fileSize);

wheras the unsafe version is like:

do
{
buf = reader.ReadBytes(bufSize);
bytesRead = buf.Length;
totalBytesRead += bytesRead;

// Pin memory.
fixed (byte* bufPtrUnsigned = &buf[0])
{
sbyte* bufPtr = (sbyte*) bufPtrUnsigned; // Use to build message string

for (int bufIndex = 0; bufIndex < buf.Length; bufIndex++)
{
if (buf[bufIndex] == msgStartCode)

Sorry, this should be:
if (bufPtr[bufIndex] == msgStartCode)
....cut & paste is a bad idea.
 
J

Jon Skeet [C# MVP]

Einar H?st said:
I have a text file with 8-bit characters in it.

What *exactly* do you mean by "8-bit characters"? Which encoding is the
file using?
In order to improve performance, I'm using a BinaryReader instead of a
StreamReader.

What makes you think BinaryReader will be faster than StreamReader?

In particular, are you sure you have a performance problem to start
with? How have you identified this to be the bottleneck?
I've made two versions of my method, one which uses typesafe code, and
one which uses unsafe code with pointers. I've read several places
that direct pointer access will eliminate bounds-checking when
accessing an array, and would like to see the effect myself. However,
the typesafe code is as fast or faster than the unsafe code. The
methods are practically identical, except for the access bit:

Bounds checking is often removed by the JIT compiler anyway.
 
G

Guest

Hi Jon

Thanks for your reply

You see, I'm a bit of a curious newbie. I don't really have any performance problems, I just want to try tweaking my routine in order to learn more about performance and .NET. Its a learning thing more than a necessity. And one of the things I learn from is good questions that make me think :-

Regarding the encoding of the file, I'm not really sure what it is, but it seems that the number of bytes in the file correspond to the number of characters. Is there any check I can do to determine the encoding precisely

I don't know if BinaryReader is any faster than StreamReader. The move from StreamReader to BinaryReader was basically done because I wanted to do without the ReadLine method. You see, I'm reading this file containing messages starting with a dollar sign. I just want one of six messages, and so I thought I'd create less garbage by avoiding to create strings for the unwanted messages. I was examining the app in CLR profiler, and found I allocated 16MB to read a file of 3.5MB, of which 8.5MB was strings. Creating my own byte-buffer and searching for dollar signs has reduced the allocation amount to approximately 6.5MB, of which 3.5MB is bytes and 2.0MB is strings. The number of garbage collections went down from 44 to 4! I've doubled the speed of my routine. To try to squeeze out a little more, I decided to experiment with unsafe code, but so far, I haven't got much effect out of that. I guess the benefits are small compared to the other processing I'm doing in my routine..

If you have any further comments, I'd love to hear them!
 
J

Jon Skeet [C# MVP]

Einar H?st said:
You see, I'm a bit of a curious newbie. I don't really have any
performance problems, I just want to try tweaking my routine in order
to learn more about performance and .NET. Its a learning thing more
than a necessity. And one of the things I learn from is good questions
that make me think :)

While in general I applaud such sentiments (and I like tweaking with
things myself) I would recommend avoiding unsafe code until you
*really* need it. I haven't even *looked* at it myself, on the grounds
that I can't see myself needing it and if I don't actually know it,
I'll be less tempted to start using it where I don't really need it.
Regarding the encoding of the file, I'm not really sure what it is,
but it seems that the number of bytes in the file correspond to the
number of characters. Is there any check I can do to determine the
encoding precisely?

Not really - it could be any number of encodings. What's producing the
file in the first place?
I don't know if BinaryReader is any faster than StreamReader. The move
from StreamReader to BinaryReader was basically done because I wanted
to do without the ReadLine method.

Well, you can use StreamReader without using ReadLine. You can read a
character at a time, or a block of characters.
You see, I'm reading this file
containing messages starting with a dollar sign. I just want one of
six messages, and so I thought I'd create less garbage by avoiding to
create strings for the unwanted messages. I was examining the app in
CLR profiler, and found I allocated 16MB to read a file of3.5MB, of
which 8.5MB was strings.

That sounds about right, yes, assuming the whole thing was being loaded
at a time.
Creating my own byte-buffer and searching for
dollar signs has reduced the allocation amount to approximately 6.5MB,
of which 3.5MB is bytes and 2.0MB is strings. The number of garbage
collections went down from 44 to 4! I've doubled the speed of my
routine. To try to squeeze out a little more, I decided to experiment
with unsafe code, but so far, I haven't got much effect out of that. I
guess the benefits are small compared to the other processing I'm
doing in my routine...

To find the dollars, you could read chunks in at a time (e.g. 16K
chars) into a fixed buffer, and search within that buffer. When you've
found the appropriate dollar, read the rest of that buffer and then all
the buffers after that (or whatever).
 
E

Einar Buffer

Jon Skeet said:
While in general I applaud such sentiments (and I like tweaking with
things myself) I would recommend avoiding unsafe code until you
*really* need it. I haven't even *looked* at it myself, on the grounds
that I can't see myself needing it and if I don't actually know it,
I'll be less tempted to start using it where I don't really need it.

Indeed, I think it will be a while before I include it in any of my
professional work - if ever. As it turns out, the unsafe approach was
slightly faster (5-10%) than the typesafe one when I did no other
processing, just looked for dollars. However, once I started doing other
stuff - even with the exact same code - it all evened out. Perhaps some side
effect of having pinned memory for a prolonged time?
Not really - it could be any number of encodings. What's producing the
file in the first place?

I guess I could find out - it's a data logging program written by a
co-worker. It's reading data from the serial port and persisting it to file.
It's written in C++... I'd guess the guy who wrote it used some default
value in the win32 API if possible.
Well, you can use StreamReader without using ReadLine. You can read a
character at a time, or a block of characters.

Yeah, I guess you're right - still, C# characters are 16 bit, right? In
general, would there be any performance differences if the two classes are
used for the same task, I wonder?
To find the dollars, you could read chunks in at a time (e.g. 16K
chars) into a fixed buffer, and search within that buffer. When you've
found the appropriate dollar, read the rest of that buffer and then all
the buffers after that (or whatever).

Indeed, this is approximately what I do - I read 32K bytes, scan for
dollars, check the message type, skip some bytes if its one of the five I
don't want, parse it otherwise. If I need some extra bytes to figure out the
message type or message content, I read the amount I need from the stream.

Thanks again!
 
J

Jon Skeet [C# MVP]

Einar Buffer said:
Indeed, I think it will be a while before I include it in any of my
professional work - if ever. As it turns out, the unsafe approach was
slightly faster (5-10%) than the typesafe one when I did no other
processing, just looked for dollars. However, once I started doing other
stuff - even with the exact same code - it all evened out. Perhaps some side
effect of having pinned memory for a prolonged time?

Probably more that the actual work was the bottleneck, not looking for
the dollars.
I guess I could find out - it's a data logging program written by a
co-worker. It's reading data from the serial port and persisting it to file.
It's written in C++... I'd guess the guy who wrote it used some default
value in the win32 API if possible.

It may well be Encoding.Default - the default ANSI encoding for the
platform. (That's not the default encoding for StreamReader though.)
Yeah, I guess you're right - still, C# characters are 16 bit, right? In
general, would there be any performance differences if the two classes are
used for the same task, I wonder?

Probably not, but I wouldn't like to say for sure.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top