How to read data structures from binary files

H

herobeat

Hi all,

I'm new to C Sharp programming, though a long time ago I was fairly
competent in C and C++.

I have data structures in a known format in binary files. I need to
be able to read them, preferably as easily as possible into a struct.

For example, I have a struct that looks something like this:

public struct MyStructure
{
public UInt32 MyNumber1;
public UInt32 MyNumber2;
public UInt32 MyNumber3;
...
}

In my code, I'd like to do something like this:

MyStructure s;
BinaryReader f =
new BinaryReader(File.Open("myfile.bin", FileMode.Open));
s = f.ReadBytes(sizeof s); // This doesn't work!!!

Obviously, that last night is heinously wrong in managed code, but I
don't know how to do what it is trying to do. What is the best way to
take binary data out of a file and get it into a structure in C
Sharp? Surely there's got to be something better than:

s.MyNumber1 = f.ReadUInt32();
s.MyNumber2 = f.ReadUInt32();
s.MyNumber3 = f.ReadUInt32();
....

On a side note, since I am still pretty new to structs as C Sharp
knows them, is this even the entity I should be using? If not, what
is?

Thanks for any advice!
 
J

Jon Skeet [C# MVP]

Obviously, that last night is heinously wrong in managed code, but I
don't know how to do what it is trying to do. What is the best way to
take binary data out of a file and get it into a structure in C
Sharp? Surely there's got to be something better than:

s.MyNumber1 = f.ReadUInt32();
s.MyNumber2 = f.ReadUInt32();
s.MyNumber3 = f.ReadUInt32();
...

On a side note, since I am still pretty new to structs as C Sharp
knows them, is this even the entity I should be using? If not, what
is?

Thanks for any advice!

The more explicit code is much better in my view - it puts everything
directly under your control. There are probably ways of doing it in the
same way as you did in C, but it's very brittle IMO. If anything
changes (compiler struct packing etc) then it breaks. Just put the
serialization/deserialization code in the relevant type and it's nice
and easy to use.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Hi all,

I'm new to C Sharp programming, though a long time ago I was fairly
competent in C and C++.

I have data structures in a known format in binary files. I need to
be able to read them, preferably as easily as possible into a struct.

For example, I have a struct that looks something like this:

public struct MyStructure
{
public UInt32 MyNumber1;
public UInt32 MyNumber2;
public UInt32 MyNumber3;
...
}

In my code, I'd like to do something like this:

MyStructure s;
BinaryReader f =
new BinaryReader(File.Open("myfile.bin", FileMode.Open));
s = f.ReadBytes(sizeof s); // This doesn't work!!!

For that to even be possible, you would have to apply some attributes to
the structure, so that you know that the members are stored exactly in
the order that you want them, and that they are stored without padding.

By default the members are stored with padding when needed, so that the
offset of a variable that uses more than one byte is stored on an even
machine word boundary. Also the members may be rearranged to minimise
the amount of padding used.

As this is done for efficiency, I suggest that you leave the member
ordering to the compiler. Even if the code for populating the data gets
a bit longer, the overall performance should be better.
Obviously, that last night is heinously wrong in managed code, but I
don't know how to do what it is trying to do. What is the best way to
take binary data out of a file and get it into a structure in C
Sharp? Surely there's got to be something better than:

s.MyNumber1 = f.ReadUInt32();
s.MyNumber2 = f.ReadUInt32();
s.MyNumber3 = f.ReadUInt32();
...

Not really. That is the most straight forward and managable way of doing
it. As Jon suggested, that code should go into the struct/class.
On a side note, since I am still pretty new to structs as C Sharp
knows them, is this even the entity I should be using? If not, what
is?

There is a distinct difference between structs in C++ and structs in C#.
A struct in C# is always a value type, which might cause you trouble if
you expect it to work as a reference type.

Generally in C# you should use a class instead of a struct, unless you
actually want a value type and know exactly how to implement it properly.
 
H

herobeat

The more explicit code is much better in my view - it puts everything
directly under your control. There are probably ways of doing it in the
same way as you did in C, but it's very brittle IMO. If anything
changes (compiler struct packing etc) then it breaks. Just put the
serialization/deserialization code in the relevant type and it's nice
and easy to use.

As this is done for efficiency, I suggest that you leave the member
ordering to the compiler. Even if the code for populating the data gets
a bit longer, the overall performance should be better.

Thanks a ton; this is exactly the kind of "best practices" knowledge
that I'm trying to learn.
 
E

Eyal Safran

Thanks a ton; this is exactly the kind of "best practices" knowledge
that I'm trying to learn.

I just love it when people write an answer without supplying a code
snippet.

I hope this will help you:

class Program
{
static void Main(string[] args)
{
MyStructure structure;
byte[] buffer = new byte[1024];
// Fill the buffer with data you read
// ...
GCHandle handle = GCHandle.Alloc(buffer,
GCHandleType.Pinned);
try
{
// Now you need a pointer
// Option A: if your struct does not start from the
beginning of the buffer
// Option B: if your struct starts from the beginning
of the buffer
IntPtr ptr = IntPtr.Zero;

// Oprion A.
int index = 5;
ptr = Marshal.UnsafeAddrOfPinnedArrayElement(buffer,
index);

// Option B.
ptr = handle.AddrOfPinnedObject();

structure = (MyStructure)Marshal.PtrToStructure(ptr,
typeof (MyStructure));
}
finally
{
handle.Free();
}
}
}

[StructLayout(LayoutKind.Sequential)]
public struct MyStructure
{
public UInt32 MyNumber1;
public UInt32 MyNumber2;
public UInt32 MyNumber3;
}

Best Regards,

Eyal Safran
 
C

Carl Daniel [VC++ MVP]

Göran Andersson said:
There is a distinct difference between structs in C++ and structs in
C#. A struct in C# is always a value type, which might cause you
trouble if you expect it to work as a reference type.

A pedantic point - structs (and classes) in C++ are always value types - C++
doesn't have the concept of a reference type (in the sense that it's used by
the CLR, that is). A C# struct is just about exactly the same as a C struct
or a C++ struct or class. It's the C# class that's distinctly different, as
reference types in the CLR cannot be allocated anywhere except on the
managed heap. C++/CLI blurs the line by working both ways.

-cd
 
H

herobeat

I just love it when people write an answer without supplying a code
snippet.

(snip...)

Best Regards,

Eyal Safran

Immensely! Unfortunately, the files I'm reading total around a
gigabyte, so all of that seeking and reading and individual assignment
is killing the performance. I'm hoping that this will make a big
difference.

Thanks again,
TonyV
 
J

Jon Skeet [C# MVP]

Immensely! Unfortunately, the files I'm reading total around a
gigabyte, so all of that seeking and reading and individual assignment
is killing the performance. I'm hoping that this will make a big
difference.

Why would you have to do a lot of seeking? With a BufferedStream
wrapping the FileStream, I wouldn't have thought you'd lose very much
performance in doing things the (IMO) robust way.
 
H

herobeat

Why would you have to do a lot of seeking? With a BufferedStream
wrapping the FileStream, I wouldn't have thought you'd lose very much
performance in doing things the (IMO) robust way.

The seeking is because some of the data later in the file indicates
which of the data earlier in the file I need to read. Don't look at
me, I didn't design the files. :-(

The reading is because, I dunno, it just seems intuitive to me that if
I could read 16 bytes directly into a block of four Int32s, for
example, it would be more efficient than if I read 4 bytes and
assigned it to one Int32, 4 more bytes and assigned it to another
Int32, 4 more bytes and assigned it to yet another Int32, and last but
not least, 4 last bytes to assign to yet another Int32. I could be
wrong, but right now, my application's performance is miserable. I'm
hoping there's some way to fix it in C# so that I don't have to go
back to C++, but I'll admit, right now, it's looking pretty grim.
 
J

Jon Skeet [C# MVP]

The seeking is because some of the data later in the file indicates
which of the data earlier in the file I need to read. Don't look at
me, I didn't design the files. :-(

In that case it sounds like the seeking is inherent - i.e. you won't be
able to avoid it whatever language you use (unless you read the whole
thing into memory, of course).
The reading is because, I dunno, it just seems intuitive to me that if
I could read 16 bytes directly into a block of four Int32s, for
example, it would be more efficient than if I read 4 bytes and
assigned it to one Int32, 4 more bytes and assigned it to another
Int32, 4 more bytes and assigned it to yet another Int32, and last but
not least, 4 last bytes to assign to yet another Int32. I could be
wrong, but right now, my application's performance is miserable. I'm
hoping there's some way to fix it in C# so that I don't have to go
back to C++, but I'll admit, right now, it's looking pretty grim.

There's a trade-off between efficiency and elegance/flexibility. I tend
to advise more in favour of the latter than the former. See the next
post (which I've started writing already :) for more information.
 
G

G.Doten

The reading is because, I dunno, it just seems intuitive to me that if
I could read 16 bytes directly into a block of four Int32s, for
example, it would be more efficient than if I read 4 bytes and
assigned it to one Int32, 4 more bytes and assigned it to another
Int32, 4 more bytes and assigned it to yet another Int32, and last but
not least, 4 last bytes to assign to yet another Int32. I could be
wrong, but right now, my application's performance is miserable. I'm
hoping there's some way to fix it in C# so that I don't have to go
back to C++, but I'll admit, right now, it's looking pretty grim.

Well that wouldn't be very type safe, would it. You are asking to be
able to load up an int or a string or a struct with some random set of
bytes. That's now how C# works. It is possible to do this, but not
recommended at all. Stick to C/C++ for that. :)

How do you know the performance is miserable? How are you measuring the
performance and what are you comparing the numbers to? I find it hard to
see how loading bytes from the file into type safe variables would be
any slower doing it with an non-type safe method. It sounds like any
performance hits are the seeking you mention in another reply.
 
J

Jon Skeet [C# MVP]

How do you know the performance is miserable? How are you measuring the
performance and what are you comparing the numbers to? I find it hard to
see how loading bytes from the file into type safe variables would be
any slower doing it with an non-type safe method. It sounds like any
performance hits are the seeking you mention in another reply.

Without a bit of tuning (for buffering, for instance) I would imagine
the performance would suffer quite a bit. Likewise loading the bytes
from the file and calling BitConverter.ToUInt32 *is* likely to be a bit
slower than just loading them straight into memory. For one thing, the
memory needs to be copied, once per uint involved. Then each call to
BitConverter needs to perform bounds checking - again, cheap but not
free.

I believe the actual conversion itself is cheap, but there's "stuff"
around it. In most cases, the elegance of this solution is likely to
outweigh the performance costs, but just *sometimes* it won't.
 
G

G.Doten

Jon said:
Without a bit of tuning (for buffering, for instance) I would imagine
the performance would suffer quite a bit. Likewise loading the bytes
from the file and calling BitConverter.ToUInt32 *is* likely to be a bit
slower than just loading them straight into memory. For one thing, the
memory needs to be copied, once per uint involved. Then each call to
BitConverter needs to perform bounds checking - again, cheap but not
free.

I believe the actual conversion itself is cheap, but there's "stuff"
around it. In most cases, the elegance of this solution is likely to
outweigh the performance costs, but just *sometimes* it won't.

I would hazard to guess that it would not be possible to show any real
difference between reading the binary data "straight into" a struct as
opposed to using various conversion methods. But you're right, there is
a difference, but I'll bet it can be ignored. Because... my guess is
that any perceived slowness is 99.9999% due to seeking within the file.

But more to the point is does the OP have any numbers to compare this
conversion/seeking process against or is the claimed slowness simply a
subjective perception? That would interest me.
 
B

Bill Butler

[/QUOTE]
<snip>

Have you profiled to see where the performance bottleneck actually is?

Are you attempting to load the whole file into memory at the same time
as opposed to sequential processing?
If so you might have paging issues.

Are you actually looking at every field in every object, or just a small
subset of the data?
If you are only looking at a subset of the data you might get by with
some lazy evaluation.

Before you rewrite your process for low level efficiencies, make sure
that you know exactly where the bottleneck is.

Bill
 
J

Jon Skeet [C# MVP]

G.Doten said:
I would hazard to guess that it would not be possible to show any real
difference between reading the binary data "straight into" a struct as
opposed to using various conversion methods.

I've done a few tests, without any seeking involved. I have a file
(Orcas beta 2 DVD, in fact!) which is over 3GB long.

Test 1: Use BinaryReader over a FileStream directly to read ~750
million uints, one at a time.

Code:
using System;
using System.IO;

class Test4BytesAtATime
{
const int Iterations = 750*1000*1024;

static void Main(string[] args)
{
DateTime start = DateTime.Now;
using (FileStream input = File.OpenRead("test.dat"))
using (BinaryReader reader = new BinaryReader(input))
{
for (int i=0; i < Iterations; i++)
{
reader.ReadUInt32();
}
}
DateTime end = DateTime.Now;
Console.WriteLine ((end-start).TotalSeconds);
}
}



Test 2: As test 1, but using a BufferedStream with a 4K buffer
Code:
using System;
using System.IO;

class Test4BytesAtATimeBuffered
{
const int Iterations = 750*1000*1024;

static void Main(string[] args)
{
DateTime start = DateTime.Now;
using (FileStream input = File.OpenRead("test.dat"))
using (BufferedStream buffer = new BufferedStream(input))
using (BinaryReader reader = new BinaryReader(buffer))
{
for (int i=0; i < Iterations; i++)
{
reader.ReadUInt32();
}
}
DateTime end = DateTime.Now;
Console.WriteLine ((end-start).TotalSeconds);
}
}


Test 3: Use BinaryReader to read the 750*4 bytes 4K at a time,
effectively discarding the result each time. The reason for using
BinaryReader rather than FileStream is to avoid having to manually loop
round to read the right amount each time.
Code:


All files were compiled with csc options "/o+ /debug-". Each test was
run 3 times. My laptop has 2GB of memory, so I don't expect the file
cache to be relevant between runs. Times for the tests:

1) 88/86/90
2) 94/85/91
3) 107/66/66/95 (4 runs to investigate the inconsistency)

The CPU is reasonably idle running test 3, but fairly heavily loaded
running 1 and 2. We don't know whether that's significant to the OP or
not.

My conclusions:
a) BufferedStream isn't worth using in this case - there must be enough
buffering lower down the food chain to make it irrelevant
b) Conversions cost CPU time
c) Using straight reads can be significantly quicker (25%)
d) My system isn't very good for benchmarking, given the inconsistent
results of test 4
But you're right, there is
a difference, but I'll bet it can be ignored. Because... my guess is
that any perceived slowness is 99.9999% due to seeking within the file.

One correction to my previous post - when using BinaryReader instead of
BitConverter, the IL for reading a UInt32 involves bitshifting. I don't
know if the JIT gets rid of that though.
But more to the point is does the OP have any numbers to compare this
conversion/seeking process against or is the claimed slowness simply a
subjective perception? That would interest me.

Likewise.
 
H

herobeat

I don't have any performance numbers, at least nothing measured in any
detail. Just eyeballing my clock, it took around eleven seconds to
iterate through the headers of the files I'm working on. I imagine
extracting the data I need out of them would take considerably
longer. When I said that the performance was "miserable," I didn't
really mean relative to anything else, just that it's slow in an
absolute sense and I'm trying to improve it.

I may indeed be barking up the wrong tree. I really don't know how
the innards of C# code works, and for all I know, the optimization of
the code will allow it to run just as fast (for practical purposes) as
if I load the structures directly.

As for the seeking, I wouldn't say theres a LOT of it, but there's
definitely some. The files are large archive files that are kind of
like tar files (but not the same format), and I'm trying to write a
program that will allow one to view and extract individual files and/
or directories within them. The directory data is stored in headers
before file data and contain things such as filenames, sizes,
datestamps, etc. The directory data is stored with a fixed-size
header that stores information about the directory, plus an array of
fixed-size blocks that describe each of its child files and
directories. The names are actually stored in a separate chunk, thus
the having to bounce around some. I'm having to seek to the names,
read them into an array, then seek to the directory data structures to
get the hierarchical structure. It's weird, I know, and kind of hard
to explain. I didn't design it.

It's something like:

0x0000: File header
0x0010: Directory entries
0x0010 + 0x30 * # of entries: String table header
0x001C + 0x30 * # of entries: Strings
(filenames to match up with directory entries)

In all there are around 50,000 directory entries I'm trying to
retrieve information about spread out among around 50 files. Each
file has between 10 and 1,500 entries in it.

I hope I'm not coming off as being like, "C# sux!!!" or anything, I'm
just trying to learn about the language and accomplish something
productive at the same time. And for what it's worth, I REALLY
appreciate everyone's help and advice. I'm still pretty new to C#
(although I do have experience in C and C++), so it means a lot to
me. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top