Search through a (large) binary file.

Tom Spink · Sep 15, 2009

Michelle said:
UPDATE

[ . . . ]

But during the reverse engineering we discovered a second one.
I'm now looking for some 'Rabin-Karp algorithm' C# examples.
I think that's a better solution then run the brute-force search twice.

Click to expand...

The challenge is even greater. The record header contains two variables.
So the search must take place at two locations with wildcards.

0x07 0x00 0x?? 0x00 0x00 0x00 0x07 0x00 0x?? 0x00 0x00 0x00 0x08 0x00

(We know the possible byte values)

So my only solution is to use Regex ?

Regards,

Michelle

This sounds like a highly structured file - *surely* there is some sort
of descriptor at the start of it that contains a pointer to these
records.

Presumably there is some software out there to read and write these
files - I doubt very much they do any binary searching. There must be
some kind of allocation table, or header structure that defines the rest
of the file, even a pointer to the first record, which contains a
pointer to the next (in a linked-list style).

When programming becomes this complex, it's usually best to step back
and take another look at the problem.

Tom Spink · Sep 15, 2009

Peter said:
I know that wouldn't be
_my_ first choice approaching a problem like this.

My first choice is the Microsoft "documentation" for the PE format.

Michelle · Sep 16, 2009

Peter,

[. . . ]
Frankly, the more you explain about the basic problem, the less I feel
Files have structures; I can guarantee you whatever this kind of file is,
the intended user code doesn't need to search for things. It simply
parses the data and knows the precise location of particular kinds of
data within the file.

The original file has a structure with non fixed-length records.
There's less documentation (and no official one) available, so reverse
engineering is the only option.
We have the data obtained through data-recovery.
So, between the requested records, there's other data stored.
We can no longer rely on the original file structure.
Because we don't need the complete record information, it's enough to find
the mentioned patterns and reed some bytes before or after these patterns.
Of course there is risk of false positives. But the combination of the
pattern and the available bytes found, makes it acceptable.
Because we need to do this more than once, and using WinHex is a lot of
work, we decided to (try) write an application for this (only for internal
use).
Unfortunately I can't share all the details to the "why"s and "what"s, etc.
related to our problem.

Summarized it comes down:
Search for the pattern: 0xFF 0x56 0x13 0x1A 0x1B 0x08 0x7B 0x15 0x61 0x08
0x00 0x15 0x1E
Read the previous 4 bytes (convert them to a Decimal value)

Search for the pattern: 0x07 0x00 0xXX 0x00 0x00 0x00 0x07 0x00 0xYY 0x00
0x00 0x00 0x08 0x00 ( where 0xXX can have 4 and 0xYY 6 different values)
Read the next 8 bytes (convert them to a Decimal value)

[. . .] You could search for the sub-components individually. Look for
one, then look for the other in the specific place it should be if you
find the first. Though, if Regex makes the code simpler, it might well
be worth it anyway, even if it doesn't perform as well.

Can I do this with the example you created ?

Yesterday I spent my day reading documentation on various algorithms and C#
examples.
The problem is that all the examples I've found on the Internet, they
intended to search for strings and not bytes.
So, I've got a challenge

)

I appreciate your help extremely !

Michelle

Michelle · Sep 16, 2009

Tom,

This sounds like a highly structured file - *surely* there is some sort
of descriptor at the start of it that contains a pointer to these
records.

[ . . . ]

Please read my reply on Peter's contribution.

And also for you Tom.
I appreciate your help extremely !

Michelle

Peter Duniho · Sep 16, 2009

[...]
Summarized it comes down:
Search for the pattern: 0xFF 0x56 0x13 0x1A 0x1B 0x08 0x7B 0x15 0x61 0x08
0x00 0x15 0x1E
Read the previous 4 bytes (convert them to a Decimal value)

Search for the pattern: 0x07 0x00 0xXX 0x00 0x00 0x00 0x07 0x00 0xYY 0x00
0x00 0x00 0x08 0x00 ( where 0xXX can have 4 and 0xYY 6 different values)
Read the next 8 bytes (convert them to a Decimal value)

I'm concerned that you keep writing "Decimal", when nothing about any of
the description of the problem suggests you have or need Decimal values.
Decimal is a specific type in .NET, a base-10 floating point structure.
As I mentioned before, it takes 16 bytes.

You can, of course, store an Int32 or Int64 value in a Decimal variable
_after_ you've converted the raw bytes to Int32 or Int64 as appropriate.
But that's a step completely independent of the file i/o and searching,
and so any discussion of the Decimal type seems out of place here.

The fact that it keeps coming up makes me concerned that you may not
understand the distinction between Decimal and other numeric types, and/or
the implications regarding how the number is stored.

[. . .] You could search for the sub-components individually. Look
for
one, then look for the other in the specific place it should be if you
find the first. Though, if Regex makes the code simpler, it might well
be worth it anyway, even if it doesn't perform as well.

Click to expand...

Can I do this with the example you created ?

Can you do which? You quoted two options: searching for sub-components
individually, and using Regex.

You can do the former by modifying the code I posted. You'll simply have
to come up with a way of representing your search string in a way that can
be translated into calls to the FRangesEqual() method. For example, have
a List<T> of structs where the struct data type stores a reference to a
byte[] containing the byte string you want to find, along with an offset
within the search range for that byte string. Then pass that list to the
"find" method, where it starts the search with the first element in the
list, and then upon finding each element in the list, it executes another
search at the offset relative to the current position in the file for the
next element in the list. Repeat that until you run out of elements in
the list or find a mis-match; if you run out of elements, you've found a
match.

Using your two sample search strings, the first time you search, the
List<T> would have just one element, referencing a single byte string to
look for, "0xFF 0x56 0x13 0x1A 0x1B 0x08 0x7B 0x15 0x61 0x08 0x00 0x15
0x1E", and an offset of 0. The next time you search, the List<T> would
have three elements. The first would reference the byte string "0x07
0x00" and an offset of 0, the next the byte string "0x00 0x00 0x00 0x07
0x00" and an offset of 3, and the last the byte string "0x00 0x00 0x00
0x08 0x00" and an offset of 9.

You can't use the code I posted to do a Regex search, not directly
anyway. IMHO, if you're going to use Regex, you might as well go back to
porting the PowerShell script you found.

Yesterday I spent my day reading documentation on various algorithms and
C#
examples.
The problem is that all the examples I've found on the Internet, they
intended to search for strings and not bytes.
So, I've got a challenge )

Any algorithm you find that is specifically for strings, you should be
able to easily modify to handle bytes instead. The main issue you may run
into would be examples that take advantage of string functions in existing
libraries rather than implementing the algorithm themselves. You can
either provide versions of those functions that work with bytes, or just
stick to those algorithm examples that don't depend on library functions,
but instead do all their own processing (in which case, simply changing
any place a string is used to byte[] and any place a char is used to a
byte).

Pete

Michelle · Sep 16, 2009

Peter,

I'm concerned that you keep writing "Decimal", when nothing about any of
the description of the problem suggests you have or need Decimal values.
Decimal is a specific type in .NET, a base-10 floating point structure.
As I mentioned before, it takes 16 bytes.

Is using 'Decimal notation' better ?

[. . . ]
and so any discussion of the Decimal type seems out of place here.

Correct, it's not the main issue.

[. . . ]

Can you do which? You quoted two options: searching for sub-components
individually, and using Regex.

I meant search for the sub-components individually.

You can do the former by modifying the code I posted. You'll simply have
to come up with a way of representing your search string in a way that can
be translated into calls to the FRangesEqual() method.

[ . . .]

I examine this and see if I can get it done

You can't use the code I posted to do a Regex search, not directly
anyway. IMHO, if you're going to use Regex, you might as well go back to
porting the PowerShell script you found.

Okay, that's not an option.

[. . . ]

You can either provide versions of those functions that work with bytes,
or just stick to those algorithm examples that don't depend on library
functions, but instead do all their own processing (in which case, simply
changing any place a string is used to byte[] and any place a char is
used to a byte).

See if I can get it done.

Michelle

Michelle · Sep 16, 2009

Peter,

[. . .]

You can use the Position property to adjust from where you're reading in
the file; save the current position, set the current position to 4 bytes
earlier than the offset of the found string, read the 4 bytes of
interest, then restore the current position to the previously saved
value.

Int64 Offset1 = (ibBaseOffset + ibOffset); offset = 41152
Int64 Offset2 = stream.Position; offset = 49152

Why is Offset1 not equal to Offset2 ?
Offset1 is the right offset. Changing the block size has affect ( byte[]
rgbBlockCur = new byte[4096]; )

I tried several options with stream.Seek(Offset, SeekOrigin) to set the
current position and restore the previously saved position.
But when I read the previous 4 bytes and restore the position to the
previous saved,
the returned offset is not right anymore. The search continues, but it's not
the right offset anymore.

The first 'hit' has a right offset and the previous read bytes are right.
After restoring the position to the previous saved, then it goes wrong.

Michelle

Peter Duniho · Sep 16, 2009

Peter,

[. . .]
Int64 Offset1 = (ibBaseOffset + ibOffset); offset = 41152
Int64 Offset2 = stream.Position; offset = 49152

Why is Offset1 not equal to Offset2 ?

Because the old position is how far you've read into the file, not
necessarily where you're processing in the file.

Offset1 is the right offset. Changing the block size has affect ( byte[]
rgbBlockCur = new byte[4096]; )

I tried several options with stream.Seek(Offset, SeekOrigin) to set the
current position and restore the previously saved position.
But when I read the previous 4 bytes and restore the position to the
previous saved,
the returned offset is not right anymore. The search continues, but it's
not
the right offset anymore.

It sounds to me as though you are somehow also resetting your processing
position. Without you showing how you modified my original code, I can't
offer any specific advice. But you _ought_ to be only modifying the
stream's Position; the ibBaseOffset and ibOffset variables should keep
their old values, and of course you also need to retain the byte[] buffers
used in the searching. Obviously, the easiest way to manage this is to
not return from the "FindByteString()" method at all when you find a
match; just do the necessary processing with the file then continue.

If for some reason that's not possible, and you don't want to literally
save and restore the necessary values and buffers, an alternative would be
to modify the "FindByteString()" method so that it takes an argument
specifying where to start searching. Then the caller, having done
whatever appropriate processing with the search result, can call the
method again, specifying the start offset as the previously returned
offset plus one. In the method, you'd simply set the stream Position to
this offset immediately after opening the stream.

Note that this latter approach will be less efficient, because the code
will be closing and reopening the stream, as well as re-reading bytes it'd
already read the previous time. I'd recommend simply not returning from
the method until you reach the end of the stream, rather than for each
time you find the string being searched for.

Pete

Michelle · Sep 17, 2009

Peter,

[. . . ]

Without you showing how you modified my original code, I can't offer any
specific advice.

[. . . ]

while (true)
{
if (ibOffset + rgbPattern.Length <= cbBlockCur)
{
if (FRangesEqual(rgbBlockCur, ibOffset,
rgbPattern.Length, rgbPattern))
{
Int64 currOffset = (ibBaseOffset + ibOffset);
//correct offset match
byte[] prevBytes = new Byte[4];

stream.Seek(currOffset-4, SeekOrigin.Begin);
//is 'stream.Seek(-4, SeekOrigin.Current);' possible ?)
stream.Read(prevBytes, 0, 4);
//or 'stream.Read(prevBytes, 0, prevBytes.Length);'
// do something using the byte array prevBytes
stream.Seek(currOffset, SeekOrigin.Begin);
// or maybe do nothing, because after reading 4 bytes is back again

Michelle

Peter Duniho · Sep 17, 2009

[...]
Int64 currOffset = (ibBaseOffset + ibOffset);
//correct offset match
byte[] prevBytes = new Byte[4];

stream.Seek(currOffset-4, SeekOrigin.Begin);
//is 'stream.Seek(-4, SeekOrigin.Current);' possible ?)
stream.Read(prevBytes, 0, 4);
//or 'stream.Read(prevBytes, 0, prevBytes.Length);'
// do something using the byte array
prevBytes
stream.Seek(currOffset, SeekOrigin.Begin);
// or maybe do nothing, because after reading 4 bytes is back again

You need to restore the _actual_ stream Position, not the calculated
search offset. The latter is simply what bytes you're inspecting at the
moment, while the former is the next byte the code will need to read, if
and when it gets around to reading more data.

Your code should look more like this:

Int64 currOffset = stream.Position;
byte[] prevBytes = new Byte[4];

stream.Position = ibBaseOffset + ibOffset - 4;
stream.Read(prevBytes, 0, 4);
stream.Position = currOffset;

Pete

Michelle · Sep 17, 2009

Peter,

It works properly. I'm all the way now

)
Indeed, it's the difference between how far I've read into the file and
where I'm processing in the file.
For me an example says more than just an description.

It's perhaps a stupid question, but 'stream.Read' needs to reed the bytes
from the file again (disk access)
Is it possible to read direct from the byte array, because that's in memory
?
It may not a big difference, but I'm curious.

Michelle

Peter Duniho · Sep 17, 2009

[...]
It's perhaps a stupid question, but 'stream.Read' needs to reed the bytes
from the file again (disk access)
Is it possible to read direct from the byte array, because that's in
memory
?
It may not a big difference, but I'm curious.

Sure, it's possible. At the most simple, you can just check to make sure
that "ibOffset" is >= 4, and if it's not have some fallback code that goes
ahead and reads from the file as you're doing now (since the byte[]
buffers for the scanning won't contain the 4 bytes you need).

If you really want to avoid extra stream accesses under all circumstances,
you could keep a 4 byte (or 8 byte, since it appears from your previous
example you could need as many as 8 bytes) buffer that's updated each time
you update the byte[] buffers. That is, rather than discarding the
previous buffer entirely, save the last 4 (or 8) bytes in the buffer to a
new "scan-back" buffer you add to the code.

That said, Windows is buffering your file data for you already. There's
some extra overhead repositioning the stream, reading, and resetting the
position, but it's probably not nearly as much as one might think.

Pete

Michelle · Sep 17, 2009

Peter,

Thanks for the explanation.
As mentioned before, it's not so important but I was only curious.

Regards,

Michelle

Michelle · Sep 17, 2009

Peter,

My poor man's solution for searching the header [ 0x07 0x00 0x?? 0x00 0x00
0x00 0x07 0x00 0x?? 0x00 0x00 0x00 0x08 0x00 ]
It performs not bad, but probably a nightmare for a real programmer

))

// # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # #
byte[] rgbPattern = { 0x04, 0x00 };
//[ . . .]
if (FRangesEqual(rgbBlockCur, ibOffset, rgbPattern.Length, rgbPattern))
{
Int64 currOffset = stream.Position;
byte[] HeaderBytes = new Byte[14];
stream.Position = ibBaseOffset + ibOffset;
stream.Read(HeaderBytes, 0, 14);
string strHeaderBytes =
(Regex.Replace(BitConverter.ToString(HeaderBytes), "-", ""));

if (IsValidHeader(strHeaderBytes))
{
//Read record
BinaryReader br = new BinaryReader(stream, Encoding.Default);
int firstValue = br.ReadInt32();
int secVaue = br.ReadInt32();
// . . . etc
}

static bool FRangesEqual(byte[] rgb, Int64 ibStart, Int64 ibLength, byte[]
rgbPattern)
{
// unchanged
}

static bool IsValidHeader(string strDataEntered)
{
return Regex.IsMatch(strDataEntered,
@"07000[1-3]00000007000[1-6]0000000800");
}

// # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# # # # # # # # # # # # # # # # # # # # # # # # # # #

Regards,

Michelle

Michelle · Sep 18, 2009

Any feedback on my code ?

You can do the former by modifying the code I posted. You'll simply have
to come up with a way of representing your search string in a way that can
be translated into calls to the FRangesEqual() method.

Couldn't get that working.

Michelle

Peter Duniho · Sep 18, 2009

Any feedback on my code ?

Sorry...haven't had much time to look at it. A quick skim suggests to me
that it ought to work fine; you're delegating most of the work to reading
straight from the file again, which is an effective way of keeping the
code simple and reducing the change for bugs.

Of course, it's not as efficient, but I think you're aware of that. And
correct and working beats fast any day.

Couldn't get that working.

Sorry to hear that. I did offer a description of a specific mechanism you
might use to accomplish that. Hopefully with some more effort, you'll be
able to translate that to code. If having that particular efficiency is
important to you, that is. It might not be.

Pete

Michelle · Sep 19, 2009

Peter,

Sorry...haven't had much time to look at it.

I did not know whether it was clear that I would like your opinion.
Thanks for your response.

Of course, it's not as efficient, but I think you're aware of that. And
correct and working beats fast any day.

How can I make it more efficient ?

[. . . ]
Sorry to hear that. I did offer a description of a specific mechanism you
might use to accomplish that. Hopefully with some more effort, you'll be
able to translate that to code. If having that particular efficiency is
important to you, that is. It might not be.

It works without error (so far we've verified) and that's most important.
But efficiency is the second most important.
I do not have enough knowledge and experience with C # to construct your
explanation into code.
If you have time and opportunity to spare, you might want to send an
example. I would at least grateful.
I've learned a lot from studying your earlier example.

(because of my translation from English, the aim sometimes lost :-((( )

Thanks in advance,

Michelle

Peter Duniho · Sep 19, 2009

[...]
It works without error (so far we've verified) and that's most important.
But efficiency is the second most important.
I do not have enough knowledge and experience with C # to construct your
explanation into code.
If you have time and opportunity to spare, you might want to send an
example. I would at least grateful.

Here's a short snippet illustrating what I mean. Assume a structure like
this:

struct PatternSubset
{
public readonly int ib;
public readonly byte[] rgb;

public PatternSubset(int ib, byte[] rgb)
{
this.ib = ib;
this.rgb = rgb;
}
}

then, rather than just calling FRangesEqual() for a given pattern, call it
in a loop like this:

// This is probably passed into the search method, but I'll just show
a regular
// local variable initialization here
PatternSubset[] rgps = new
{
new PatternSubset(0, new byte[] { 0x07, 0x00 }),
new PatternSubset(3, new byte[] { 0x00, 0x00, 0x00, 0x07, 0x00 }),
new PatternSubset(9, new byte[] { 0x00, 0x00, 0x00, 0x08, 0x00 })
};

// Instead of a single call to FRangesEqual(), you'll have something
like this:
bool fMatches = true;
foreach (PatternSubset ps in lps)
{
if (!FRangesEqual(rgbBlockCur, ibOffset + ps.ib, ps.rgb.Length,
ps.rgb))
{
fMatches = false;
break;
}
}

if (fMatches)
{
// do whatever
}

That's the basic idea. The above doesn't do anything to deal with the
cross-block boundary, so you'll have to adapt the idea shown in the
example to the code you're actually using. It won't work as it is shown
exactly above. But if you take the time to understand both the original
example, and the example above, you should be able to combine them
correctly.

One suggestion for doing that combination: you'll note that in the above,
the call to FRangesEqual() is logically equivalent to the call to
FRangesEqual() in the original example. So, the calculated value
"ibOffset + ps.ib" is logically equivalent to the "ibOffset" value in the
original example. It may make the code more readable if you abstract out
the logic that handles dealing with one or two blocks, like this:

In the original example, instead of the section of code that looks like
this:

if (ibOffset + rgbPattern.Length <= cbBlockCur)
{
if (FRangesEqual(rgbBlockCur, ibOffset, rgbPattern.Length,
rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else if (ibOffset + rgbPattern.Length <= cbBlockCur + cbBlockNext)
{
if (FRangesEqual(rgbBlockCur, ibOffset, cbBlockCur - ibOffset,
rgbPattern) &&
FRangesEqual(rgbBlockNext, 0, ibOffset + rgbPattern.Length -
cbBlockCur, rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else
{
return -1;
}

You'd replace it with a section of code that looks like this:

if (ibOffset + rgbPattern.Length <= cbBlockCur + cbBlockNext)
{
if (FRangesEqualInBlocks(rgbBlockCur, cbBlockCur, rgbBlockNext,
ibOffset, rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else
{
return -1;
}

The FRangesEqualInBlocks() method then looks like this (yes, I know that's
a lot of method arguments...and that's not even allowing for the one more
argument required if you want error checking):

bool FRangesEqualInBlocks(byte[] rgbBlockCur, int cbBlockCur, byte[]
rgbBlockNext, long ibOffset, byte[] rgbPattern)
{
if (ibOffset + rgbPattern.Length <= cbBlockCur)
{
return FRangesEqual(rgbBlockCur, ibOffset, rgbPattern.Length,
rgbPattern);
}

return FRangesEqual(rgbBlockCur, ibOffset, cbBlockCur - ibOffset,
rgbPattern) &&
FRangesEqual(rgbBlockNext, 0, ibOffset + rgbPattern.Length -
cbBlockCur, rgbPattern);
}

Finally, applying that to the example I started out with in this post,
instead of the code I wrote where you'd just call FRangesEqual(), instead
you'd call this new FRangesEqualInBlocks() method:

foreach (PatternSubset ps in lps)
{
if (!FRangesEqualInBlocks(rgbBlockCur, cbBlockCur, rgbBlockNext,
ibOffset + ps.ib, ps.rgb))
{
fMatches = false;
break;
}
}

Oh, one other thing. The original code example was able to detect the
termination condition based solely on "ibOffset" as compared to the
remaining data available. You can do basically the same thing even with
the list of segments to compare; just apply that logic to the last segment
in the list (you may want to validate the list of segments by ensuring
that they are in order of the offset, and that they don't overlap...that
is, the offset plus the length of the byte[] for one segment is less than
the offset for the next segment).

Validation might look something like this:

int cbPattern = -1;

foreach (PatternSubset ps in rgps)
{
if (ps.ib <= cbPattern)
{
// report error

// note that technically, "ps.ib < cbPattern" would
// be okay, but you may want to consider the "equals"
// condition an error anyway, because if it happens,
// it means someone passed in multiple pattern subsets
// that could have been merged. It just depends on how
// strict you want to be with callers.
}

cbPattern = ps.ib + ps.rgb.Length;
}

And then the termination condition becomes this:

if (ibOffset + cbPattern <= cbBlockCur + cbBlockNext)
{
// code to loop through PatternSubset[]
}
else
{
return -1;
}

Hope that helps, and that walking you through the different variations
leading to the eventual solution is okay. Hopefully that allows you to
understand the evolution of the design better, without confusing you too
much, and leaving some actual work for _you_ to do.

Pete

Michelle · Sep 20, 2009

Peter,

Thanks for the code and explanation.
This week i'm going to work with the code and I'll let you know if it
succeeded.

Regards,

Michelle

Michelle · Sep 24, 2009

Peter,

I'm stuck.
I have your code and explanations read several times teh last days, but I do
not manage it.
I give up. I'm lost with the different variants. It's confusing me to much.
You've inserted so much time and ernergie already, that's all I can ask of
you.
Very sincere thanks for all the work and patience.

Regards,

Michelle.

Hex contents of a binary file	1	Sep 20, 2009
Binary value	6	Dec 8, 2006
Read/Write binary file data to/from string	2	Dec 10, 2009
File.Move large binary file	5	Nov 21, 2005
Opening Large Binary file efficiently	11	Nov 30, 2006
Hex to Binary Conversion	3	Oct 7, 2008
read binary file to a byte array	11	Feb 25, 2010
Dealing with very large binary files	5	Aug 6, 2007

Search through a (large) binary file.

Tom Spink

Tom Spink

Michelle

Michelle

Peter Duniho

Michelle

Michelle

Peter Duniho

Michelle

Peter Duniho

Michelle

Peter Duniho

Michelle

Michelle

Michelle

Peter Duniho

Michelle

Peter Duniho

Michelle

Michelle

Ask a Question

Similar Threads