Parsing Binary Files

  • Thread starter Thread starter Hemang Shah
  • Start date Start date
H

Hemang Shah

Hello fellow Coders!

ok, I"m trying to write a very simple application in C#. (Yes its my first
program)

What I want to do is :

1) Open a binary file
2) Search this file for a particular string.
3) Close the file

Now is there any special thing I should do as this is a binary file ?

Any code examples would very greating appreciated.

Thank You

Hemang Shah
 
Hemang Shah said:
ok, I"m trying to write a very simple application in C#. (Yes its my first
program)

What I want to do is :

1) Open a binary file
2) Search this file for a particular string.
3) Close the file

Now is there any special thing I should do as this is a binary file ?

Well, if you're trying to search for a *string*, you'll need to know
the encoding - or by "string" do you mean "sequence of bytes"?
 
Hello Jon

I'm trying to search for occurances of "OU=" in the binary file yes so its a
sequence of bytes.

If I open the file in hexviewer, I can see these and search for it. Rather
then opening up the file in hexviewer everytime, I want to write a utility
to search it and display it.

I did find some code online which opens the file in binary mode and displays
it on a text box.
But what you see in the text box is not the same what you see in hexviewer.
Moreover, I don't really understand the code.

Here is the code:

void DisplayFile()

{

int nCols = 16;

FileStream inStream = new FileStream(chosenfile, FileMode.Open,

FileAccess.Read);

long nBytesToRead = inStream.Length;

if (nBytesToRead > 65536/4)

nBytesToRead = 65536/4;

int nLines = (int)(nBytesToRead/nCols) + 1;

string [] lines = new string[nLines];

int nBytesRead = 0;

for (int i=0 ; i<nLines ; i++)

{

StringBuilder nextLine = new StringBuilder();

nextLine.Capacity = 4*nCols;

for (int j = 0 ; j<nCols ; j++)

{

int nextByte = inStream.ReadByte();

nBytesRead++;

if (nextByte < 0 || nBytesRead > 65536)

break;

char nextChar = (char)nextByte;

if (nextChar < 16)

nextLine.Append(" x0" + string.Format("{0,1:X}",

(int)nextChar));

else if

(char.IsLetterOrDigit(nextChar) ||

char.IsPunctuation(nextChar))

nextLine.Append(" " + nextChar + " ");

else

nextLine.Append(" x" + string.Format("{0,2:X}",

(int)nextChar));

}

lines = nextLine.ToString();

}

inStream.Close();

this.textBoxContents.Lines = lines;

}

Thank You

__________________________________________________________________________

Hemang Shah MCSE A+
Enterprise Messaging Support
Direct phone: (905) 568-0434 x 23854

Email: (e-mail address removed)

Office hours: Wed to Sat from 19:00-06:00 hrs EST.
 
Hemang Shah said:
I'm trying to search for occurances of "OU=" in the binary file yes so its a
sequence of bytes.

But OU= is a sequence of *characters*. Do you mean you're looking for
the sequence of bytes which form the ASCII encoding for "OU="? I
suspect that's what you're after.
If I open the file in hexviewer, I can see these and search for it. Rather
then opening up the file in hexviewer everytime, I want to write a utility
to search it and display it.

I did find some code online which opens the file in binary mode and displays
it on a text box.
But what you see in the text box is not the same what you see in hexviewer.
Moreover, I don't really understand the code.

The first thing is to ditch that code. It's bad in many, many ways.

I don't have time to write some sample code for you right now, but I'll
try tomorrow afternoon. Basically, you should read the file in chunks,
and then look through for the correct sequence, knowing that it might
go across a "chunk boundary".
 
Yes you are right, that is what I'm trying to achieve.. A sequence of
*Characters* which I thought comprised a string.

I can send you a sample of the type of files I'm trying to read if you like.

I would really appreciate if you could write me a sample, that would be
going over & beyond!

You can write it tomorrow or whenever you can. Or you can point me to some
good resources which would teach / explain the logic behind it.

Reading in chunks makes sense. Sometimes the files that I'll be parsing
will even exceed 16 to 80GB in size. But i'll only have to parse the first
few 100MBs of data to get the "OU=".

Thanks a lot again in advance.

Hemang
 
Hemang Shah said:
Yes you are right, that is what I'm trying to achieve.. A sequence of
*Characters* which I thought comprised a string.

I think what Jon was trying to say is that *bytes* and *characters* are two
different things: In .net, characters are usually unicode characters, i.e.
have a size of 2 bytes. You can convert these to a variety of binary
representations (including plain ASCII) which have a different layout.
Now, in your binary file, do you want to look for occurances of a string in
*unicode* representation or ascii (or other) representation?
...
I would really appreciate if you could write me a sample, that would be
going over & beyond!

Here's a little sample I've come up with:
It reads binary blocks of data from a file, then tests every possible
position. After that, it copies the trailing n bytes of the buffer to the
beginning and starts reading after byte n, so it can find matches on "chunk
boundaries". (I think it works)
Note that this is not the fastest searching algorithm; (google for
"boyer-moore" for more info). But I'd guess in your case the HD is the
bottleneck anyway.


using System;
using System.IO;

class BinarySearch
{
static void Main()
{
string stringToLookFor = "7777";
string filePath = @"C:\SomePath\pi.txt";

// convert the string to a binary (ASCII) representation
byte[] bufferToLookFor =
System.Text.Encoding.ASCII.GetBytes(stringToLookFor);

int matchCounter = 1; // count matches for nicer output

// open the file in binary mode
using (Stream stream = new FileStream(filePath, FileMode.Open,
FileAccess.Read))
{
byte[] readBuffer = new byte[16384]; // our input buffer
int bytesRead = 0; // number of bytes read
int offset = 0; // offset inside read-buffer
long filePos = 0; // position inside the file
before read operation
while ((bytesRead = stream.Read(readBuffer, offset,
readBuffer.Length-offset)) > 0)
{
for (int i=0; i<bytesRead+offset-bufferToLookFor.Length; i++)
{
bool match = true;
for (int j=0; j<bufferToLookFor.Length; j++)
if (bufferToLookFor[j] != readBuffer[i+j])
{
match = false;
break;
}
if (match)
{
Console.WriteLine("{0,5}. \"{1}\" found at {3:x}",
matchCounter++, stringToLookFor, filePath, filePos+i-offset);
//return;
}
}
// store file position before next read
filePos = stream.Position;

// store the last few characters to ensure matches on "chunk
boundaries"
offset = bufferToLookFor.Length;
for (int i=0; i<offset; i++)
readBuffer = readBuffer[readBuffer.Length-offset+i];
}
}
Console.WriteLine("No match found");
}
}


Niki
 
Hemang Shah said:
I would really appreciate if you could write me a sample, that would be
going over & beyond!

Is the sample Niki provided okay for you? (I like the idea of copying
the buffer - nice simple way of dealing with boundaries.)
 
Thank you Niki & Jon

I took the sample and it worked for me. I was able to get proper matches.

Now I have some questions if you don't mind me asking:

1) The code right now is Case Sensitive I guess to the string we want to
search is that correct ?
2) If I want it to be not case sensitive, do I type the string in every
posible combination and search with each of those bytes ? or is there a
better solution
3) What I want to do is, after the search is met, I want to read x amount
of characters after that search and display it. Now the # of characters
after the search is not fixed, it could be 1 word or it could be a sentence.
I would know it because it will truncate with another search string.
4) I don't understand the copying of buffer so that we can check across
boundries, I understand the concept but I cannot follow the code from there.
Also, how do I handle my fetching the info if it is across boundries.
5) Our input buffer is set to 16 bytes. Is there any reason its 16 ? or it
could be any size.

I hope I was able to ask the right questions.

Thank You

Hemang.






Niki Estner said:
Hemang Shah said:
Yes you are right, that is what I'm trying to achieve.. A sequence of
*Characters* which I thought comprised a string.

I think what Jon was trying to say is that *bytes* and *characters* are
two different things: In .net, characters are usually unicode characters,
i.e. have a size of 2 bytes. You can convert these to a variety of binary
representations (including plain ASCII) which have a different layout.
Now, in your binary file, do you want to look for occurances of a string
in *unicode* representation or ascii (or other) representation?
...
I would really appreciate if you could write me a sample, that would be
going over & beyond!

Here's a little sample I've come up with:
It reads binary blocks of data from a file, then tests every possible
position. After that, it copies the trailing n bytes of the buffer to the
beginning and starts reading after byte n, so it can find matches on
"chunk boundaries". (I think it works)
Note that this is not the fastest searching algorithm; (google for
"boyer-moore" for more info). But I'd guess in your case the HD is the
bottleneck anyway.


using System;
using System.IO;

class BinarySearch
{
static void Main()
{
string stringToLookFor = "7777";
string filePath = @"C:\SomePath\pi.txt";

// convert the string to a binary (ASCII) representation
byte[] bufferToLookFor =
System.Text.Encoding.ASCII.GetBytes(stringToLookFor);

int matchCounter = 1; // count matches for nicer output

// open the file in binary mode
using (Stream stream = new FileStream(filePath, FileMode.Open,
FileAccess.Read))
{
byte[] readBuffer = new byte[16384]; // our input buffer
int bytesRead = 0; // number of bytes read
int offset = 0; // offset inside read-buffer
long filePos = 0; // position inside the file
before read operation
while ((bytesRead = stream.Read(readBuffer, offset,
readBuffer.Length-offset)) > 0)
{
for (int i=0; i<bytesRead+offset-bufferToLookFor.Length; i++)
{
bool match = true;
for (int j=0; j<bufferToLookFor.Length; j++)
if (bufferToLookFor[j] != readBuffer[i+j])
{
match = false;
break;
}
if (match)
{
Console.WriteLine("{0,5}. \"{1}\" found at {3:x}",
matchCounter++, stringToLookFor, filePath, filePos+i-offset);
//return;
}
}
// store file position before next read
filePos = stream.Position;

// store the last few characters to ensure matches on "chunk
boundaries"
offset = bufferToLookFor.Length;
for (int i=0; i<offset; i++)
readBuffer = readBuffer[readBuffer.Length-offset+i];
}
}
Console.WriteLine("No match found");
}
}


Niki
 
Hemang Shah said:
Thank you Niki & Jon

I took the sample and it worked for me. I was able to get proper matches.

Now I have some questions if you don't mind me asking:

1) The code right now is Case Sensitive I guess to the string we want to
search is that correct ?
Yes.

2) If I want it to be not case sensitive, do I type the string in every
posible combination and search with each of those bytes ? or is there a
better solution

Well, you could supply multiple byte arrays, and check whether the nth
byte is any of the acceptable ones, rather than just a single
acceptable one. You then just supply a lower case version and an upper
case version - you don't need to come up with every combination.
3) What I want to do is, after the search is met, I want to read x amount
of characters after that search and display it. Now the # of characters
after the search is not fixed, it could be 1 word or it could be a sentence.
I would know it because it will truncate with another search string.

To what extent is this *really* a binary file? Pretty much everything
you've said has been in terms of text.
4) I don't understand the copying of buffer so that we can check across
boundries, I understand the concept but I cannot follow the code from there.

I haven't actually looked at Niki's code myself.
Also, how do I handle my fetching the info if it is across boundries.
5) Our input buffer is set to 16 bytes. Is there any reason its 16 ? or it
could be any size.

It could be set to any size. I'd usually use about 32K myself.
 
Hemang Shah said:
Thank you Niki & Jon

I took the sample and it worked for me. I was able to get proper matches.

Now I have some questions if you don't mind me asking:

1) The code right now is Case Sensitive I guess to the string we want to
search is that correct ?
Yes.

2) If I want it to be not case sensitive, do I type the string in every
posible combination and search with each of those bytes ? or is there a
better solution

I'd convert the input string to uppercase, and convert each byte in the
buffer to uppercase too before comparing.
3) What I want to do is, after the search is met, I want to read x amount
of characters after that search and display it. Now the # of characters
after the search is not fixed, it could be 1 word or it could be a
sentence. I would know it because it will truncate with another search
string.

If you have the offset in the file, you can use Stream.Seek & Stream.Read to
do that.
4) I don't understand the copying of buffer so that we can check across
boundries, I understand the concept but I cannot follow the code from
there.

Try to use a short buffer (e.g. 20 bytes), and a short file and step through
the code with the debugger. IMO that's generally the best way to see what a
program does.
Also, how do I handle my fetching the info if it is across boundries.

As I said, I'd use a separate Stream.Read call to extract that info.
5) Our input buffer is set to 16 bytes. Is there any reason its 16 ? or
it could be any size.

It's set to 16 kbytes. HD access can only be performed in 4 k pages, so it
should be at least 4k (otherwise the HD will have to read the same page more
than once). I usually make it a little bigger so the overhead for calling
into the OS isn't done that often.
If you don't care for performance (e.g. for testing or debugging) you can
make it any size as long as it's bigger than the search string.
I hope I was able to ask the right questions.

There are no stupid questions. Only stupid answers...

Niki
 
Back
Top