Search through a (large) binary file.

P

Peter Duniho

Peter,

Thank you for the explanation. It's more difficult then I thought.
I thought that searching through binary data a common issue was (you can
do
this with every hex editor) and that there were plenty of examples
available.

I don't know about the relative quantity of examples one way or the
other. As far as it being a common feature in hex editors, well...a) not
all hex editors are created equal (any editor that requires the entire
file data to be loaded into memory at once, which is a lot of them, can
trivially search the entire file contents without worrying about the
block-level i/o), and b) for someone with enough programming experience to
be able to write a good hex editor, this really isn't a very difficult
problem.
[...]
To try your example, I created a console application (using Visual
Studio 2008)
Maybe it's a little bit silly, but how can I add my pattern (hex value)
as
an argument to FindByteString -> byte[] rgbPattern?

The same way you'd initialize any array. For example:

byte[] rgbPattern = { 0xff, 0x56, 0x13, 0x1a };

If you want to provide a text string as initialization, that's possible as
well. You'll just have to parse each pair of hex digits in the string as
a byte and add them to an appropriate array. For example:

string strPattern = "ff56131a";
byte[] rgbPattern = RgbFromString(strPattern);

where:

byte[] RgbFromString(string strPattern)
{
byte[] rgbRet = new byte[strPattern.Length / 2];

for (int ib = 0; ib < rgbRet.Length; ib++)
{
rgbRet[ib] = byte.Parse(strPattern.Substring(ib * 2, 2),
NumberStyles.HexNumber);
}

return rgbRet;
}
int FindByteString(string strInputFile, byte[] rgbPattern)

Get an error on this and don't know how to solve it:
if (++ibOffset < cbBlockCur.Length)

'int' does not contain a definition for 'Length' and no extension method
'Length' accepting a first argument of type 'int' could be found (are you
missing a using directive or an assembly reference?)

As mentioned before, I'm a rookie using C#, so please be patient.
But I'm willing to learn.

For what it's worth, it's my opinion that one of the best ways to learn
programming is to read other people's code and figure out how it works.
And of course, if you do that, in the process you should be able to figure
out how to fix any parts of the code that might not be exactly correct.

As far as that specific error goes, the purpose of that line of code is
simply to continue the loop if there are still unprocessed bytes left in
the current block (all of the rest of the code in the loop, after that
"if" statement and its block, handles updating the in-memory blocks
holding the data being processed to the next block-size part of the
file). Knowing that, it should be clear that the correct statement is:

if (++ibOffset < cbBlockCur)

That is, just remove the ".Length"...it's superfluous and left over from a
variation of the code I had before I changed it and posted my message.

Hope that helps.

Pete
 
M

Michelle

Rossum,

I spend some time reading about the mentioned algorithms.
The "Boyer-Moore Algorithm" seems to be the most popular and fastest. (Also
used in several hex editors).
Unfortunately (so far) I couldn't find a c# example for searching binary
files.
Because I've not much experience using C#, it's difficult to translate the
'theory' to a working example.
So if you know one, I appreciate a link to it.
I'd like to read other people's code and figure out how it works to learn
about. (It's for education only :)) )

Regards,

Michelle
 
M

Michelle

Peter,

It's a shame, but i can't get it to work.
I'm stuck with this error: "An object reference is required for the
non-static field, method, or property
'binSearch.Program.FindByteString(string, byte[])'"
I tried several solutions found, but no luck so far.

namespace binSearch
{
class Program
{
static void Main(string[] args)
{
byte[] rgbPattern = { 0xFF, 0x56, 0x13, 0x1A, 0x1B, 0x08, 0x7B,
0x15, 0x61, 0x08, 0x00, 0x15, 0x1E };
FindByteString(@"D:\BinarySearch\myfile.bin");
}
int FindByteString(string strInputFile, byte[] rgbPattern)
{
byte[] rgbBlockCur = new byte[4096],
[. . .]

Michelle
 
T

Tom Spink

Michelle said:
Peter,

It's a shame, but i can't get it to work.
I'm stuck with this error: "An object reference is required for the
non-static field, method, or property
'binSearch.Program.FindByteString(string, byte[])'"
I tried several solutions found, but no luck so far.

namespace binSearch
{
class Program
{
static void Main(string[] args)
{
byte[] rgbPattern = { 0xFF, 0x56, 0x13, 0x1A, 0x1B, 0x08, 0x7B,
0x15, 0x61, 0x08, 0x00, 0x15, 0x1E };
FindByteString(@"D:\BinarySearch\myfile.bin");
}
int FindByteString(string strInputFile, byte[] rgbPattern)
{
byte[] rgbBlockCur = new byte[4096],
[. . .]

Michelle

You need to qualify the method definition with the static keyword:

///
static int FindByteString(string strInputFile, byte[] rgbPattern)
///

Put the keyword 'static' before the 'int', and it should remove that
error. It's probably best if you know why... so I suggest a bit of
research on the 'static' keyword and what static members are.
 
M

Michelle

Tom,
You need to qualify the method definition with the static keyword:
static int FindByteString(string strInputFile, byte[] rgbPattern)

Put the keyword 'static' before the 'int', and it should remove that
error. It's probably best if you know why... so I suggest a bit of
research on the 'static' keyword and what static members are.

I did some research on that and tried this solution already.
When I Put the keyword 'static' before the 'int' i get more errors (3 times)
"An object reference is required for the non-static field, method, or
property 'binSearch.Program.FRangesEqual(byte[], int, int, byte[])' "

Michelle
 
T

Tom Spink

Michelle said:
Tom,
You need to qualify the method definition with the static keyword:
static int FindByteString(string strInputFile, byte[] rgbPattern)

Put the keyword 'static' before the 'int', and it should remove that
error. It's probably best if you know why... so I suggest a bit of
research on the 'static' keyword and what static members are.

I did some research on that and tried this solution already.
When I Put the keyword 'static' before the 'int' i get more errors (3 times)
"An object reference is required for the non-static field, method, or
property 'binSearch.Program.FRangesEqual(byte[], int, int, byte[])' "

Michelle

That's because you've got more methods that you need to qualify with the
static keyword.

So, you've got a method called FRangesEqual - I bet that doesn't have
'static' on it either!

Stick it on, and on any others methods you've defined.

I don't mean to be rude, but if you'd done your research, you'll see why
you need to qualify all these method declarations with 'static'. If a
static method calls another method defined a class - and you don't have
an object reference, then that method must be static.
 
M

Michelle

Tom,

Errors fixed.
I become a little bit tirred, so I'm not as sharp longer.
Excuses for that.

Michelle
 
T

Tom Spink

Michelle said:
Tom,

Errors fixed.
I become a little bit tirred, so I'm not as sharp longer.
Excuses for that.

Michelle

I'm glad you've got it sorted! That makes me smile.

See! :)
 
M

Michelle

Peter,
Try something like this (warning: uncompiled, untested, unoptimized, no
error-checking, doesn't handle files > 2GB, and assumes that the search
pattern is always no longer than the length of a single block that's
read...proof of concept only):

With the help of Tom I've solved some errors. So it works :))
I want to first understand and optimize this part, before I get busy with
reading the previous four bytes.

Why doesn't it handle files > 2GB ? (this could be an issue for me. The
files are sometimes even larger then 2 Gb)
Is it correct that it finds only the first occurence of the pattern. (if so,
how to find all occurences ?)
About the error-checking (and as part of that) compare the search pattern
length and the length of a single block, I think I can solve that.
But can you give me some hits about the optimalisation (and what about the
"Boyer-Moore Algorithm")

Thanks in advance,

Michelle
 
T

Tom Spink

Hello, Michelle.
Why doesn't it handle files > 2GB ? (this could be an issue for me. The
files are sometimes even larger then 2 Gb)

It's the old, "There's not enough atoms in the Universe" problem - but
in this case, there's not enough bits in an integer.

An 'int' is 32 bits.

2^32 = 4294967296

This is 4 GiB, I hear you think. But, an int is signed, so in actual
fact, the largest value for an int is:

int.MaxValue = (2^31 - 1) = 2147483647 = 2GiB

(it's "minus one" because the maximum value is one less than the
number of possible values)

This is because of the way negative numbers are represented, half the
possible values represent a positive value (0 to 2147483647) and half
the possible values represent a negative value (-2,147,483,648 to -1).
 
P

Peter Duniho

Peter,


With the help of Tom I've solved some errors. So it works :))
I want to first understand and optimize this part, before I get busy with
reading the previous four bytes.

Why doesn't it handle files > 2GB ? (this could be an issue for me. The
files are sometimes even larger then 2 Gb)

Tom's answered that. If you want larger files to be handled, you need to
use 64-bit integers instead of 32-bit integers anywhere the code has an
int used as an offset within the file.
Is it correct that it finds only the first occurence of the pattern. (if
so,
how to find all occurences ?)

You are correct...it returns only the first occurrence. To find all
occurrences, instead of returning when a match is found, you'd simply do
some appropriate action there and then continue processing as if you
hadn't found a match. That appropriate action could be to call a delegate
(e.g. Action<int>, where the delegate takes the offset into the file and
does something), or perhaps just add the offset to a List<int> and then
return the List said:
About the error-checking (and as part of that) compare the search pattern
length and the length of a single block, I think I can solve that.
But can you give me some hits about the optimalisation (and what about
the
"Boyer-Moore Algorithm")

Boyer-Moore is a special-case of the finite-state-machine solution I
alluded to earlier. For a search for a single string, the general-purpose
FSM solution has a small potential to be slightly faster than the current
brute-force approach, and the Boyer-Moore the potential to be slightly
faster than that. Any speed improvement when searching for a single
string would be highly dependent on the input data.

The reason those might be a little faster is that they both guarantee that
you only ever have to examine a given byte from the file once. The
brute-force solution I posted has the potential for examining a given byte
multiple times, depending on how many times it follows something that
looks like a potential match.

In practice, both the generalized FSM and the Boyer-Moore solution involve
a little bit of overhead themselves. At the same time, assuming random
data, on average the brute-force algorithm will only have a 2-byte false
start (the smallest false start possible) only 1 out of 256 iterations,
and only longer than that 1 out of 65536 iterations. I.e. most of the
time even the brute-force algorithm only examines each byte once, if you
only searching for one string.

Where Boyer-Moore and the generalized FSM shine is when you have multiple
strings to compare against, because they continue to allow you to only
ever examine each byte of the file once, whereas the brute-force solution
increases exponentially in cost as your list of search strings increases
in size.

If you have only a single string at a time you're ever looking for, I
wouldn't bother with the FSM or Boyer-Moore (or similar).

As far as optimizing the code I posted goes, I haven't really thought
about it much. I would say that if it performs acceptably well for you,
you might as well spend your time on other things. Don't bother trying to
make it faster until you know for sure that making it faster will make a
difference to your user, and then make it faster by measuring the
performance of the code with a profiler so you know what parts are slower
and should be improved.

Pete
 
M

Michelle

Peter,
Tom's answered that. If you want larger files to be handled, you need to
use 64-bit integers instead of 32-bit integers anywhere the code has an
int used as an offset within the file.

Solved that already.
If you read the explanation is actually very logical and simple.
. . To find all occurrences, instead of returning when a match is found,
you'd simply do some appropriate action there and then continue
processing as if you hadn't found a match.

Solved that already.
Rest and take away what, works enlightening,
Boyer-Moore is a special-case of the finite-state-machine solution I
alluded to earlier. For a search for a single string, the general-purpose

Thanks for the clear explanation.
.... I would say that if it performs acceptably well for you, you might
as well spend your time on other things.

It performs much faster then the Powershell script. For now it is more than
enough.
I am now involved in reading the previous four bytes.

Michelle
 
M

Michelle

I've tried to write code to read the previous four bytes.
This is my code:

static void FindPrevBytes(int ioffset)
{
FileStream fs = File.OpenRead(@"D:\myfile.bin");
// opens the file for reading
BinaryReader br = new BinaryReader(fs);
br.BaseStream.Position = ioffset - 4;
// position = offset pattern found -4 bytes
byte[] arrBytes = br.ReadBytes(4);
// array containing the four bytes
string strBytes = (Regex.Replace(BitConverter.ToString(arrBytes), "-",
"")); // string output WITHOUT dashes.
int intDecValue = int.Parse(strBytes,
System.Globalization.NumberStyles.HexNumber); // convert Hex to
Dec

Console.WriteLine("Value found: " + intDecValue );
// write
}


I'm not happy about this; FileStream fs = File.OpenRead(@"D:\myfile.bin");
Because it opens the file for reading the second time.
It works, but I'm not sure this is the best solution.
Please give me feedback.

Best regards,

Michelle
 
P

Peter Duniho

[...]
I'm not happy about this; FileStream fs =
File.OpenRead(@"D:\myfile.bin");
Because it opens the file for reading the second time.
It works, but I'm not sure this is the best solution.
Please give me feedback.

The only reason to use BinaryReader is if you want to convert directly
from bytes read to some more complex type. If you're only ever going to
read the plain bytes, just read them directly from the FileStream you're
working with. You can use the Position property to adjust from where
you're reading in the file; save the current position, set the current
position to 4 bytes earlier than the offset of the found string, read the
4 bytes of interest, then restore the current position to the previously
saved value.

Your processing of those four bytes seems odd too, but perhaps we just
don't know enough about the format. If you have 4 bytes in an array, it
would seem that you ought to just be using BitConverter to convert those
directly to a 32-bit int, rather than doing all that stuff with the string.

If, as it appears from your conversion code, the bytes in the file are
big-endian (most-significant-byte first) you can handle that much more
cleanly simply by swapping the bytes before converting to a 32-bit int.
If you like, you can just encapsulate that in a single method:

int Int32FromMSBArray(byte[] rgb)
{
byte bT = rgb[0];

rgb[0] = rgb[3];
rgb[3] = bT;
bT = rgb[1];
rgb[1] = rgb[2];
rgb[2] = rgbT;

return BitConverter.ToInt32(rgb, 0);
}

(The above code modifies the input array; normally this shouldn't be a
problem, but if you need the original array preserved, you'll want to make
a copy).

Pete
 
P

Peter Duniho

[...]
(The above code modifies the input array; normally this shouldn't be a
problem, but if you need the original array preserved, you'll want to
make a copy).

Or alternatively, just swap the bytes _after_ the conversion, using
traditional bit-shift/or operations (the code's less obvious to some
people, but it will be much more efficient).

Pete
 
M

Michelle

Peter,
[ . . .] If you're only ever going to read the plain bytes, just read
them directly from the FileStream you're working with. You can use the
Position property to adjust from where you're reading in the file; save
the current position, set the current position to 4 bytes earlier than
the offset of the found string, read the 4 bytes of interest, then
restore the current position to the previously saved value.

So I must use Get and Set using the FileStream.Position property.
But how can I read the bytes from a specific position from the FileStream .
I couldn't find that using the FileStream, so I'd used BinaryReader.
If, as it appears from your conversion code, the bytes in the file are
big-endian (most-significant-byte first) you can handle that much more
cleanly simply by swapping the bytes before converting to a 32-bit int.

All I need is the Decimal value of these four bytes.
I can't use BitConverter to convert an Array directly to Decimal.
I can use a System.IO.MemoryStream object instead:

//Create a decimal from a bye array.
public static decimal ByteArrayToDecimal (byte[] src)
{
//Create a MemoryStream containing the byte array.
using (MemoryStream stream = new MemoryStream(src))
{
//Create a BinaryReader to read the decimal from the stream.
using (BinaryReader reader = new BinaryReader(stream))
{
// Read and return the decimal from the BinaryReader/MemoryStream
return reader.ReadDecimal();
}
}
}

Instead of that, I used:

string strBytes = (Regex.Replace(BitConverter.ToString(arrBytes), "-",
"")); // string output WITHOUT dashes.
int intDecValue = int.Parse(strBytes,
System.Globalization.NumberStyles.HexNumber); //convert to dec


Best regards,

Michelle
 
P

Peter Duniho

So I must use Get and Set using the FileStream.Position property.
But how can I read the bytes from a specific position from the
FileStream .
I couldn't find that using the FileStream, so I'd used BinaryReader.

FileStream _only_ allows reading bytes, either one at a time or into a
byte[]. As far as reading bytes from a specific position in the file, you
have to set the Position property to the position where you want to read,
and then you simply read. It's just that simple.
All I need is the Decimal value of these four bytes.
I can't use BitConverter to convert an Array directly to Decimal.

I don't really understand that. I mean, yes it's true...BitConverter
doesn't provide support for reading or writing the Decimal type. But the
code you posted before doesn't use the Decimal type anyway.
I can use a System.IO.MemoryStream object instead:

//Create a decimal from a bye array.
public static decimal ByteArrayToDecimal (byte[] src)
{
//Create a MemoryStream containing the byte array.
using (MemoryStream stream = new MemoryStream(src))
{
//Create a BinaryReader to read the decimal from the stream.
using (BinaryReader reader = new BinaryReader(stream))
{
// Read and return the decimal from the BinaryReader/MemoryStream
return reader.ReadDecimal();
}
}
}

That looks like a fine way to convert a byte array to the Decimal numeric
type. But if you've already got a FileStream, you should be able to use
BinaryReader directly on it, and if the stored data in the file isn't
actually in the Decimal numeric format, then reading it as Decimal isn't
going to work anyway.

Note that the Decimal numeric type takes up 16 bytes in a file. So if you
are reading only 4 bytes, then obviously either you are reading the data
as the wrong format, or not reading enough bytes. Either way, you won't
get valid results.
Instead of that, I used:

string strBytes = (Regex.Replace(BitConverter.ToString(arrBytes), "-",
"")); // string output WITHOUT dashes.
int intDecValue = int.Parse(strBytes,
System.Globalization.NumberStyles.HexNumber); //convert to dec

Do you get the values you expect when you do that? If so, then your
number must not be Decimal in the first place.

Beyond that, as I mentioned before, the code above will parse your bytes
as if they are big-endian. So if the data in the file isn't in big-endian
format, that's also a problem. Again, the question is, do you get the
values you expect?

Basically, you've provided no information here that would allow anyone to
know for sure what the format of the numbers in your file are. Unless and
until you do so, no one can offer any specific advice as to how to read
that data; we can only make assumptions based on code you've shown,
including the assumption the code you've shown does what you want.

Pete
 
M

Michelle

Peter,
[...]
As far as reading bytes from a specific position in the file, you have to
set the Position property to the position where you want to read, and
then you simply read. It's just that simple.
Clear.

Note that the Decimal numeric type takes up 16 bytes in a file. So if you
are reading only 4 bytes, then obviously either you are reading the data
as the wrong format, or not reading enough bytes. Either way, you won't
get valid results.

Bytes read: 00-C0-22-4F
Decimal value: 00C0224F -> 12591695
It starts always (far as we know now) with 0x00, so actually I only need 3
bytes

It's the way it's used in my file.
Do you get the values you expect when you do that? If so, then your
number must not be Decimal in the first place.

Yes, as expected.
Basically, you've provided no information here that would allow anyone to
know for sure what the format of the numbers in your file are.
[...]

We're reverse engineering a proprietary file.
As mentioned above i'd need to read 3/4 bytes and convert it to a Decimal
value.

In my earlier postings, I'll assume that I only wanted to search for one
pattern.
But during the reverse engineering we discovered a second one.
I'm now looking for some 'Rabin-Karp algorithm' C# examples.
I think that's a better solution then run the brute-force search twice.
As you can see, the truth is closer today than yesterday ;-))

Best regards,

Michelle
 
M

Michelle

UPDATE

[ . . . ]
But during the reverse engineering we discovered a second one.
I'm now looking for some 'Rabin-Karp algorithm' C# examples.
I think that's a better solution then run the brute-force search twice.

The challenge is even greater. The record header contains two variables.
So the search must take place at two locations with wildcards.

0x07 0x00 0x?? 0x00 0x00 0x00 0x07 0x00 0x?? 0x00 0x00 0x00 0x08 0x00

(We know the possible byte values)

So my only solution is to use Regex ?

Regards,

Michelle
 
P

Peter Duniho

UPDATE

[ . . . ]
But during the reverse engineering we discovered a second one.
I'm now looking for some 'Rabin-Karp algorithm' C# examples.
I think that's a better solution then run the brute-force search twice.

The challenge is even greater. The record header contains two variables.
So the search must take place at two locations with wildcards.

0x07 0x00 0x?? 0x00 0x00 0x00 0x07 0x00 0x?? 0x00 0x00 0x00 0x08 0x00

(We know the possible byte values)

So my only solution is to use Regex ?

No, not necessarily. You could search for the sub-components
individually. Look for one, then look for the other in the specific place
it should be if you find the first. Though, if Regex makes the code
simpler, it might well be worth it anyway, even if it doesn't perform as
well.

That said, you have a broader problem in that the more variability in the
data that's allowed, the greater the chance that you'll find the pattern
you're looking for, but not in the context you intend (i.e. a false
positive search result). You have that chance even with a regular search
pattern, but as the pattern gets shorter with more variation allowed, the
odds increase.

And I note that the above string of bytes is quite a bit different from,
and quite a bit simpler than, the search pattern you showed in earlier
posts. I would guess there's a much higher chance of seeing that pattern
in the wrong context than the other.

Frankly, the more you explain about the basic problem, the less I feel
that a simple search-and-replace is really the right way to go about
things. Files have structures; I can guarantee you whatever this kind of
file is, the intended user code doesn't need to search for things. It
simply parses the data and knows the precise location of particular kinds
of data within the file.

IMHO, your efforts would be better spent trying to reverse engineer the
file to the point where you can accomplish the same, rather than investing
effort on speeding up string searches on the data. Even better, just get
the documentation for the file format and code from that, rather than all
this investment in reverse-engineering.

I obviously don't have all the details with respect to the "why"s,
"what"s, etc. related to your problem. But it seems like you've taken a
time-consuming, difficult path that is practically guaranteed to be the
one that provides the least reliable results. I know that wouldn't be
_my_ first choice approaching a problem like this. :)

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top