Reading Binary Files

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I need to split a large binary file into two binary files. I have a delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to end
of file. Kindly let me know whether this si possible

Thanks
Rohith
 
Rohith said:
I need to split a large binary file into two binary files. I have a
delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to
end
of file. Kindly let me know whether this si possible

Just open a System.IO.FileStream against the file. Read it out by chunks
into a byte[] and examine the chunks for your delimeter. Write the chunks
to a first and then a seconde FileStream.

David
 
Ya..This will work. But I have a huge binary file nearly 1GB. Is there an
alternate solution to find the delimiter position with checking on every
chunk looping through it

David Browne said:
Rohith said:
I need to split a large binary file into two binary files. I have a
delimiter
(say NewLine) in the binaryfile. I need to split the binary file such that
the first file is upto the NewLine and the Second file is from NewLine to
end
of file. Kindly let me know whether this si possible

Just open a System.IO.FileStream against the file. Read it out by chunks
into a byte[] and examine the chunks for your delimeter. Write the chunks
to a first and then a seconde FileStream.

David
 
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?
 
Rohith said:
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?

First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.

Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.

Without knowing anything about the structure of this file, it is difficult to be more helpful.

Good luck
Bill
 
Hi Bill,


:

First:
Are you sure that the binary data CANNOT contain a newline?
If it can.....Oh well.

NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.
Are there any data length headers embedded in the Binary data?
If so, you can possibly seek right to the position you need.
Most binary files have header fields that provide datalengths or offsets.

The thing is that I will not be knowing the actual postition. I will be
knowing only the delimitter.
Without knowing anything about the structure of this file, it is difficult to be more helpful.

Regarding the structure, Its only raw chunk of bytes.

Thanks
Rohith
 
Rohith said:
Hi Bill,


:



NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.


The thing is that I will not be knowing the actual postition. I will be
knowing only the delimitter.


Regarding the structure, Its only raw chunk of bytes.

Typically you would prepend a header onto the file indicating, say the
number of files contained, their names and offsets. Then you can seek
around in the file to find the offsets.

David
 
NewLine need not be a delimitter.Actually my requirement is that I have to
serialize two binaryfiles in a single binary file and then deserialize it.
The delimitter can be anything for that matter. I just need a way find the
position of the delimitter in that file.

If YOU are the one responsible for combining the data and then separating it, is there any reason
why you can't have a header in the file? If you could include a header, you could easily include the
sizes/offsets of the raw chunks. Then you would have no need of a delimiter.
If your hands are tied and you can do nothing more than a delimiter, then you have problems. You
need to choose a delimter that CANNOT exist in the binary data, but ANY value can exist in binary
data. You would need to scan your data to make sure that the delimiter is acceptible, and then find
a way to keep track of what the delimiter was.
If your only option is to use a delimiter, you have no choice, but to search for it linearly,
and you may need to have a multi-byte delimiter if every 8 bit combination exists in the data.

I personally would fight for the header.

Good luck,
Bill
 
Rohith said:
Thanks David. If checking through every byte for a delimitter, then it would
be a huge performance blow...I was actually confused whether there is
alternate solution for this?

The cost of looking through memory is likely to be much smaller than
the IO cost in the first place.

As Bill suggested though, if you're the one who gets to combine the
files, it's easy - just include the lengths of each file.
 
You're gonna have to read the file to split it anyway. If there is no
header that tells where the delimiter is or if you cannot create one,
then you will have to read the file in manually.

Typically, you would read a certain amount at a time into a memory
buffer, for example 4K, then search that buffer for the delimiter.

The performance should not be too bad.
 
Thanks for the Replies.

I would not be able to add header to the files, as I have a set of previous
version(of my Application) binary files that does not have header. Now my new
requirement is to add two separate binary files in a single binary file and
deserialize them. But If i add header to the new files i will not be able to
identify which files to split and which not.
 
Rohith said:
Thanks for the Replies.

I would not be able to add header to the files, as I have a set of previous
version(of my Application) binary files that does not have header. Now my new
requirement is to add two separate binary files in a single binary file and
deserialize them. But If i add header to the new files i will not be able to
identify which files to split and which not.

Sure you can.

Add the header to the Compound files only.
Start it of with a MAGIC String of bytes that remains the same.
Although there is a tiny possibility of incorrectly identifying a Simple file as being Compound, you
can control how tiny by extending the MAGIC String length.

Also

If the binary data is not random there will be some sequences of bytes that are FAR more likely than
others. Careful selection of the MAGIC String can effectively eliminate a false positive.

Good Luck
Bill
 
As its a Huge file nearly(2GB), it would not be easy to form magic bytes
that will be present only once. Also the text present in the binary file will
not be the same..So to find the magic bytes do i have to search throught he
file every time before serializing?
 
Rohith said:
As its a Huge file nearly(2GB), it would not be easy to form magic bytes
that will be present only once.

They'd only have to not be present at the start of the old files.
Compare that with your delimiter idea which relies on the delimiter
*never* being present.
Also the text present in the binary file will
not be the same..So to find the magic bytes do i have to search throught he
file every time before serializing?

No, you'd look for the magic bytes at the start of the file when
*deserializing*.
 
Jon Skeet said:
No, you'd look for the magic bytes at the start of the file when
*deserializing*.

But how do i ensure that my previous version (application) files does not
have these magic bytes at the start of the file. Also How do I identify
magic bytes...
 
Rohith said:
But how do i ensure that my previous version (application) files does not
have these magic bytes at the start of the file.

Well, what are these files? Many file formats already have a magic
number at the start of the file.

It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?

If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.
Also How do I identify magic bytes...

That's simple - by reading the first 8 bytes (or however long your
magic number is) and seeing whether or not they are the same as the
magic number.
 
Well, what are these files? Many file formats already have a magic
number at the start of the file.
It seems to me that if you're only now considering how to deal with the
problem, then you've got that problem whether you use extra headers or
not. I don't see how your delimiter idea is any better (and it strikes
me as more likely to be a lot worse).

Where are these files coming from? Can you change existing ones when
you upgrade to a new version of your software?

No. I will not be even able to find whether this a older or newer version
file.
If you generate a random sequence of 16 bytes, the chances of any
existing files happening to start with that same sequence is
*extremely* small (the same as two GUIDs colliding). I suspect that
would actually be good enough, and the best you can do in the situation
you're in.

My previous version files does not have this magic bytes written. But When I
desrialize them I will not be in a position to tell whether this a previous
version or new version file. So I will not be able to get the length of this
first file from the magic bytes.
 
Rohith said:
No. I will not be even able to find whether this a older or newer version
file.

Okay. In that case, you would certainly have no chance with a single-
byte delimiter as you were planning, would you?
My previous version files does not have this magic bytes written. But When I
desrialize them I will not be in a position to tell whether this a previous
version or new version file. So I will not be able to get the length of this
first file from the magic bytes.

You can find the length of a whole file very easily (eg use
FileStream.Length after opening it, or FileInfo.Length).

Basically, if you start deserializing and don't see the magic number,
the contents is just the whole of the file.

If you *do* see the magic number, you then read whatever header
information you've put into the new files, and deserialize
appropriately.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top