Search through a (large) binary file.

M

Michelle

I need some help to search through a (sometimes large) binary file.

I'd like to search within this binary file for a pattern containing a
particular hex value (e.g. FF56131A1B087B15610800151E).
When the pattern is found, i need to know the (start) offset, because then I'd
like to read the 4 previous bytes (need the hex values).
I suppose that i need to read the file in blocks (500 kb or 1 mb) because
it's a large file.

I 'am a rookie using C#, so if possible please share a piece of code.

Thanks in advance,

Michelle
 
J

Jeff Johnson

I need some help to search through a (sometimes large) binary file.

I'd like to search within this binary file for a pattern containing a
particular hex value (e.g. FF56131A1B087B15610800151E).
When the pattern is found, i need to know the (start) offset, because then
I'd like to read the 4 previous bytes (need the hex values).
I suppose that i need to read the file in blocks (500 kb or 1 mb) because
it's a large file.

Define "large," because you'd be better of if you can read the entire thing
into a byte array and then just work off that byte array. Using chunks
forces you to deal with the situation where the data you're searching for is
split across the end of one chunk and the beginning of the next.
 
M

Michelle

It's not only reading a binary file, but also searching for a particular hex
value (pattern).
And what about if my pattern is at the block border ?
 
M

Michelle

Mostly more then 1 Gb.

Jeff Johnson said:
Define "large," because you'd be better of if you can read the entire
thing into a byte array and then just work off that byte array. Using
chunks forces you to deal with the situation where the data you're
searching for is split across the end of one chunk and the beginning of
the next.
 
P

Peter Duniho

[...]
Define "large," because you'd be better of if you can read the entire
thing into a byte array and then just work off that byte array. Using
chunks forces you to deal with the situation where the data you're
searching for is split across the end of one chunk and the beginning of
the next.

Mostly more then 1 Gb.

Jeff is correct that it will be more of a hassle to deal with per-block
reads from the file when searching for a specific string of bytes. But
with a file that large, you probably will want to anyway. A general
purpose solution would involve some kind of list or circular buffer of
blocks you've read, but if you can ensure that the length of the search
string is always less than the size of a single read block, you could get
away with a couple of variables or a two-element array.

Other than dealing with the cross-boundary comparison logic, it should be
straightforward. Just read data and at each byte offset, compare your
search string to the bytes starting at that offset. Just repeat until you
either find the byte string you're looking for or you've reached a byte
offset that is closer to the end of the file than you have bytes in your
search string.

Pete
 
T

Tom Spink

Hi Michelle,
I need some help to search through a (sometimes large) binary file.

I'd like to search within this binary file for a pattern containing a
particular hex value (e.g. FF56131A1B087B15610800151E).
When the pattern is found, i need to know the (start) offset, because then I'd
like to read the 4 previous bytes (need the hex values).
I suppose that i need to read the file in blocks (500 kb or 1 mb) because
it's a large file.

I 'am a rookie using C#, so if possible please share a piece of code.

Thanks in advance,

Michelle

Do you know if the hex value you are searching for is aligned at all?
 
T

Tom Spink

Peter said:
[...]
Define "large," because you'd be better of if you can read the entire
thing into a byte array and then just work off that byte array. Using
chunks forces you to deal with the situation where the data you're
searching for is split across the end of one chunk and the beginning of
the next.

Mostly more then 1 Gb.

Jeff is correct that it will be more of a hassle to deal with per-block
reads from the file when searching for a specific string of bytes. But
with a file that large, you probably will want to anyway. A general
purpose solution would involve some kind of list or circular buffer of
blocks you've read, but if you can ensure that the length of the search
string is always less than the size of a single read block, you could get
away with a couple of variables or a two-element array.

Other than dealing with the cross-boundary comparison logic, it should be
straightforward. Just read data and at each byte offset, compare your
search string to the bytes starting at that offset. Just repeat until you
either find the byte string you're looking for or you've reached a byte
offset that is closer to the end of the file than you have bytes in your
search string.

Pete

I'd say it'd probably be easier to use a finite state machine,
and just read one byte at a time. The maximum size of the buffer would
be the length of the array to find.
 
T

Tom Spink

Hi All,

Tom said:
I'd say it'd probably be easier to use a finite state machine,
and just read one byte at a time. The maximum size of the buffer would
be the length of the array to find.

I haven't tested this in the slightest (I've got to run off to work),
except compile-tested and ran with a very (very) contrived sample.

Let me know what you guys think - and what I've forgotten about, and
please rip it apart and correct it!

It takes a stream, and finds the offset of the first matching sequence
of bytes in 'arr'. Hopefully.

///
public static long FindArrayOffset(Stream s, byte[] arr)
{
int testByte, testIndex = 0;
long startingOffset = 0;

for (testByte = s.ReadByte(); testByte >= 0; testByte = s.ReadByte()) {
if (arr[testIndex++] != (byte)testByte) {
testIndex = 0;
startingOffset = s.Position;
}

if (testIndex == arr.Length)
return startingOffset;
}

return -1;
}
///

I haven't thought through all the code paths, but I've got a brain cell
nagging me about off-by-one errors, so pay particular attention to the
post-increment operator, as I may have cocked that up.
 
M

Michelle

Hi Tom,
Do you know if the hex value you are searching for is aligned at all?

I don't exactly understand what you mean with "aligned" (maybe it's because
English is not my native language)
I need to search in all the rubbish for the pattern
(FF56131A1B087B15610800151E)
and then read the previous 4 bytes. (For example:
0923080709C0224FFF56131A1B087B15610800151E -> 09C0224F)
The pattern is a kind of 'record footer'. But the 'records' doesn't have a
fixed length..

Is this the information you're needed?

Michelle
 
M

Michelle

This is what i've done so far:

-<code>-

void ReadFile()
{
string pattern = "FF56131A1B087B15610800151E";
string inFile = @"D:\myfile.bin";
string outFile = @"D:\myfile.txt";
long blockSize = 512000;
int prevBytes = 4;

FileStream stream = File.OpenRead(inFile);
if (stream.Length < blockSize)
{
blockSize = stream.Length;
}
long lastBlock = stream.Length % blockSize;
byte[] buffer = new byte[blockSize];
int extraBytes = pattern.Length / 2 + prevBytes - 1;

// ERROR
// long blocks = (long)Math.Ceiling(stream.Length / blockSize);
// The call is ambiguous between the following methods or
properties: 'System.Math.Ceiling(decimal)' and 'System.Math.Ceiling(double)'

int block = 1;

while (block <= blocks)
{
if (block==blocks && block==lastBlock)
{
blockSize = lastBlock;

// ERROR
// byte[] buffer = new byte[blockSize];
// A local variable named 'buffer' cannot be declared in
this scope because it would give a different meaning to 'buffer',
// which is already used in a 'parent or current' scope
to denote something else

}
}

stream.Read(buffer, 0, blockSize);
string hexStr = String.Join("", (rimBytes + buffer));
String.Format("{0:X2}",hexStr);
// more to go
}

-<code>-

It's based on a Powershell script i found with Google.

-<script.ps1>-

param (
[string]$pattern = $(throw '-pattern can''t be $null'),
[string]$file = $(throw '-file can''t be $null'),
[long]$blockSize = 50kb,
[int]$prevBytes = 8
)
if ($blockSize -lt 1kb) {throw '-blockSize must be at least 1kb'}
$file = $(if (test-path $file) {(rvpa $file).path} else {
throw "Cannot find $path"})
$stream = [io.file]::OpenRead($file)
if ($stream.length -lt $blockSize) {$blockSize = $stream.length}
$lastBlock = $stream.length % $blockSize
$buffer = new-object byte[] $blockSize
$extraBytes = $pattern.length / 2 + $prevBytes - 1
$blocks = [math]::ceiling($stream.length / $blockSize)
$block = 1
while ($block -le $blocks) {
if ($block -eq $blocks -and $lastBlock) {$blockSize = $lastBlock
$buffer = new-object byte[] $blockSize}

[void]$stream.read($buffer, 0, $blockSize)
$hexStr = [string]::join('', ($rimBytes + $buffer | % {'{0:X2}' -f $_}))
$rimBytes = $buffer[-$extraBytes..-1]
[regex]::matches($hexStr, $pattern, 'ignoreCase') |
% {$i = $_.index - $prevBytes * 2
[string]::join('', $hexStr[$i..($i + $prevBytes * 2 - 1)]) |
% {$target = $_
$hexBytes = 0..($_.length - 1) | ? {!($_ -band 1)} |
% {"0x$($target.subString($_,2))"}
}
}
$block++
}
[void]$stream.close
[void]$stream.dispose

-<script.ps1>-
 
T

Tom Spink

Michelle said:
Hi Tom,


I don't exactly understand what you mean with "aligned" (maybe it's because
English is not my native language)
I need to search in all the rubbish for the pattern
(FF56131A1B087B15610800151E)
and then read the previous 4 bytes. (For example:
0923080709C0224FFF56131A1B087B15610800151E -> 09C0224F)
The pattern is a kind of 'record footer'. But the 'records' doesn't have a
fixed length..

Is this the information you're needed?

Michelle

Hi Michelle,

I posted a possible solution in reply to another message - I was waiting
for community approval, however!
 
P

Peter Duniho

[...]
public static long FindArrayOffset(Stream s, byte[] arr)
{
int testByte, testIndex = 0;
long startingOffset = 0;

for (testByte = s.ReadByte(); testByte >= 0; testByte =
s.ReadByte()) {
if (arr[testIndex++] != (byte)testByte) {
testIndex = 0;
startingOffset = s.Position;
}

if (testIndex == arr.Length)
return startingOffset;
}

return -1;
}
///

I haven't thought through all the code paths, but I've got a brain cell
nagging me about off-by-one errors, so pay particular attention to the
post-increment operator, as I may have cocked that up.

I would say that the biggest issue is that if the real match is found
starting within a partial match, that won't be found. For example,
searching for "ABC" within the string "AABC". By the time the code
"knows" that the string "AA" doesn't match with "AB", it's passed where
the actual location of the searched-for string starts (the next character
it's going to examing for a match with the beginning of the search string
is 'B').

The problem is fixable, but would result in a more elaborate FSM
implementation (it's not as simple as just backing up one byte, because
the problem is more generalized than that...e.g. search for "AABC" within
"AAABC", etc.). It's not clear that the OP really needs this level of
complexity; managing cross-block searches isn't really _that_ hard, while
implementing a correct FSM can actually be a little tricky. And assuming
false starts during the search are few and very short, repeatedly
examining a given byte after a false start shouldn't hurt performance too
much.

All that said, it's true that one significant advantage of an FSM approach
is the ability to scan strictly sequentially, one byte at a time, through
the file, no cross-block issues to worry about. If that's an appealing
feature to the OP, in spite of the added complexity the FSM will bring,
there have been several good solutions suggested along these lines in past
threads in this newsgroup that essentially address the "search for a
string in a stream of characters" problem. Applying the same solution to
bytes instead of characters is trivial.

Here's a link to one such message thread:
http://groups.google.com/group/micr...csharp/browse_thread/thread/462037d91c08e693/

It includes a bit of a discussion on the topic, includes suggestions for
at least two different viable approaches, and one of my posts in the
thread links to a couple of posts in yet another previous thread where (in
the first post) I posted a non-tricky, basic, generic state graph
implementation, and then (in the follow-up) a little performance-fix to
that original code.

Pete
 
P

Peter Duniho

Hi Tom,


I don't exactly understand what you mean with "aligned" (maybe it's
because
English is not my native language)

I believe that Tom's asking if the pattern you're looking for will always
begin at a byte the offset of which is exactly some multiple of some
number larger than 1.

For example, a 32-bit-aligned pattern could be found at byte offset 0, 4,
8, etc. but never at 1, 2, 3, 5, 6, 7, etc.

If it's aligned, then that obviously can reduce somewhat the overhead of
non-matching characters, because you have to examine on average 1/cb the
number of bytes, where "cb" is the byte length of the alignment (e.g. for
32-bit alignment, "cb" is "4").

Pete
 
M

Michelle

It's not aligned.


Peter Duniho said:
I believe that Tom's asking if the pattern you're looking for will always
begin at a byte the offset of which is exactly some multiple of some
number larger than 1.

For example, a 32-bit-aligned pattern could be found at byte offset 0, 4,
8, etc. but never at 1, 2, 3, 5, 6, 7, etc.

If it's aligned, then that obviously can reduce somewhat the overhead of
non-matching characters, because you have to examine on average 1/cb the
number of bytes, where "cb" is the byte length of the alignment (e.g. for
32-bit alignment, "cb" is "4").

Pete
 
M

Michelle

Tested the Powershell script.

It's very slow. . . but it works !!
Is it possible to convert the script to C#. I suppose it's much faster in
C#.
Maybe it needs some performance tuning, but it illustrates what I need.

Regards,

Michelle
 
P

Peter Duniho

Tested the Powershell script.

It's very slow. . . but it works !!
Is it possible to convert the script to C#. I suppose it's much faster
in C#.
Maybe it needs some performance tuning, but it illustrates what I need.

Actually, the main reason it's slow is the fundamental implementation, not
the language And if you port that logic to C#, you'll have the same
problems. It's converting bytes to characters for the purpose of the
searching, as well as doing string concatenation to deal with the
cross-block issues.

It may be that you gain some speed by moving to a compiled .NET
implementation, but you'll still have the other overhead.

That said, sure...it's possible to port the script to C#. It's just code.

Pete
 
P

Peter Duniho

This is what i've done so far: [...]

Unfortunately, it's not even close. You haven't even successfully ported
the block-read logic, and that's not the interesting part of sample. It
"handles" cross-block boundaries simply by maintaining enough "previous
data" to account for the length of the search pattern, and then prepending
that to the recently read data before each match attempt. It's also
formatting the bytes as text before doing the match, which is very
inefficient.

Try something like this (warning: uncompiled, untested, unoptimized, no
error-checking, doesn't handle files > 2GB, and assumes that the search
pattern is always no longer than the length of a single block that's
read...proof of concept only):

int FindByteString(string strInputFile, byte[] rgbPattern)
{
byte[] rgbBlockCur = new byte[4096],
rgbBlockNext = new byte[rgbBlockCur.Length];
FileStream stream = File.OpenRead(strInputFile);
int cbBlockCur, cbBlockNext, ibOffset = 0, ibBaseOffset = 0;

cbBlockCur = stream.Read(rgbBlockCur, 0, rgbBlockCur.Length);
cbBlockNext = stream.Read(rgbBlockNext, 0, rgbBlockNext.Length);

while (true)
{
if (ibOffset + rgbPattern.Length <= cbBlockCur)
{
if (FRangesEqual(rgbBlockCur, ibOffset, rgbPattern.Length,
rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else if (ibOffset + rgbPattern.Length <= cbBlockCur +
cbBlockNext)
{
if (FRangesEqual(rgbBlockCur, ibOffset, cbBlockCur -
ibOffset, rgbPattern) &&
FRangesEqual(rgbBlockNext, 0, ibOffset +
rgbPattern.Length - cbBlockCur, rgbPattern))
{
return ibBaseOffset + ibOffset;
}
}
else
{
return -1;
}

if (++ibOffset < cbBlockCur.Length)
{
continue;
}

byte[] rgbT = rgbBlockCur;
rgbBlockCur = rgbBlockNext;
rgbBlockNext = rgbT;

ibOffset = 0;
ibBaseOffset += cbBlockCur;

cbBlockCur = cbBlockNext;
cbBlockNext = stream.Read(rgbBlockNext, 0,
rgbBlockNext.Length);
}
}

bool FRangesEqual(byte[] rgb, int ibStart, int ibLength, byte[]
rgbPattern)
{
for (int ibPattern = 0; ibPattern < ibLength; ibPattern++)
{
if (rgb[ibStart + ibPattern] != rgbPattern[ibPattern])
{
return false;
}
}

return true;
}

There are a number of refinements one might make to the above. But even
that basic example ought to perform much better than a C# port of the
PowerShell script you posted. I apologize in advance for any typos or
bugs; hopefully it's enough to get you pointed in the right direction, in
spite of any flaws that might be present.

Pete
 
T

Tom Spink

Peter said:
I believe that Tom's asking if the pattern you're looking for will always
begin at a byte the offset of which is exactly some multiple of some
number larger than 1.

For example, a 32-bit-aligned pattern could be found at byte offset 0, 4,
8, etc. but never at 1, 2, 3, 5, 6, 7, etc.

If it's aligned, then that obviously can reduce somewhat the overhead of
non-matching characters, because you have to examine on average 1/cb the
number of bytes, where "cb" is the byte length of the alignment (e.g. for
32-bit alignment, "cb" is "4").

Pete

Thanks Pete,

Those were my thoughts - which would also tie in nicely with your
chunking algorithm - and help to eliminate the block-overlap problem, if
the block size used was some multiple of the alignment value.
 
M

Michelle

Peter,

Thank you for the explanation. It's more difficult then I thought.
I thought that searching through binary data a common issue was (you can do
this with every hex editor) and that there were plenty of examples
available.
For me there is only a single pattern to match, I don't need to search for
multiple patterns in a single pass.
Speed is not the main issue (although it's important), but finding
everything is the main issue.

To try your example, I created a console application (using Visual Studio
2008)
Maybe it's a little bit silly, but how can I add my pattern (hex value) as
an argument to FindByteString -> byte[] rgbPattern?
int FindByteString(string strInputFile, byte[] rgbPattern)

Get an error on this and don't know how to solve it:
if (++ibOffset < cbBlockCur.Length)

'int' does not contain a definition for 'Length' and no extension method
'Length' accepting a first argument of type 'int' could be found (are you
missing a using directive or an assembly reference?)

As mentioned before, I'm a rookie using C#, so please be patient.
But I'm willing to learn.

Regards,

Michelle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top