Identify File Difference

Josh Carlisle · Dec 8, 2004

I've been tasked with developing a document/file versioning system of sorts.
Similar to a very scaled down source control system. We're integrating it
very closely with an existing application so it must be 100% home grown.
Similar to how many source code controls systems behave I'd like to only
save only differences from the original in each of the versions and then
reassemble those differences when the latest version is requested. I have
some vague ideas but I thought I'd get some others opinions. What makes it a
little more complex is that I have to apply this to files with possible
binary data in them also. Can anyone point me into the direction of some
usefull or helpful framework classes or hints? Thanks!

Josh

W.G. Ryan eMVP · Dec 8, 2004

You can generate a Hash on both files and compare the hashed value.
http://www.dotnetspider.com/technology/KBPages/397.aspx

http://dotnetjunkies.com/WebLog/darrell.norton/archive/2004/07/20/19820.aspx

Josh Carlisle · Dec 8, 2004

Thanks for the quick reply

This looks like a good starting point, although the hashing algorithm will
only tell me if there is a difference but not what the differences are. Non
the less this is a very useful exercise for verification prior to actual
byte by byte comparison. Are you aware of any algorithms that are helpful in
this area? I was initially considering doing a byte for byte comparison for
actually identifying the changes (so only the changes are stored) but I
wasn't sure if there was a better way. I was about to check out the source
code of some open source windiff type projects but like most developers I'm
looking for some shortcuts

Anyway thanks again for the two links and any further advice you may have
would be appreciated.

Josh

Nick Malik · Dec 8, 2004

Hi Josh,

Comparing differences between files, especially binary files, in an
interesting topic. It has been described in various places. I did a little
google searching and found this excerpt:
Tower, and Paul Eggert. Wayne Davison designed and implemented the unified
output format. The basic algorithm is described in "An O(ND) Difference
Algorithm and its Variations", Eugene W. Myers, Algorithmica Vol. 1 No. 2,
1986, pp. 251--266; and in "A File Comparison Program", Webb Miller and
Eugene W. Myers, Software--Practice and Experience Vol. 15 No. 11, 1985, pp.
1025--1040. The algorithm was independently discovered as described in
"Algorithms for Approximate String Matching", E. Ukkonen, Information and
Control Vol. 64, 1985, pp. 100--118. <<

The cited articles are probably a good place to start. Unfortunately, I
have not read them, so I cannot comment on the algorithm itself.

I will say one thing though: modern code control systems do NOT store the
original file and then store the differences to get newer versions.

Modern systems store the MOST RECENT file and store the differences needed
to recreate Previous versions (since 99% of the time, you don't want the
first version... you want the most recent one.)

Also, given the low cost of hard drive space and the ability to simply
compress prior versions, you may want to simply consider keeping the entire
contents of each version of each file, simply compressing old versions to
save space.

One more thing to look at: If you have Windows Server 2003, you can
download, for free, Windows Sharepoint Services. This system gives you
simple document management capabilities, including the ability to set up a
virtual "folder" tree that contains "files" where you can store every
version of any or all files. It's pretty nice, and because it's free, you
would avoid most licensing issues. That's the upside. The downside: it
only runs on Windows Server 2003. If your customers cannot upgrade their
OS, then this can't be used as your back end. Still, it's worth
considering, if for no other reason that to simply Write Less Code.

Good Luck. I hope this helps,

--- Nick

Josh Carlisle · Dec 8, 2004

Nick,

I actually downloaded the windiff code so I'm going to take a look at that
but GNU diff looks interesting also. Luckily I'm comfortable enough with c
and c++ to get algorithms out of the code I need so hopefully that will be
usefull. Luckily we're not looking at some of the other features of most
source code controls systems (branching, merging, etc) so I'm hoping to keep
it simple. Also because of some of the unique aspects of what we are tieing
it into using a product like sharepoint isn't possible. However you make a
very good point about alternative mechansims. I had done some googling and
found some references about only storing differences so not knowing any
better I started to pursue that route but after reading your statement that
most modern systems store the current and only the differences for historic
purposes is some very valuable information and does make more sense. I had
originally thought of storing the complete versions but discounted it for
space concerns but I'm actually heading that direction more to reduce
complexity and as you say with the use of some compression (which I've used
with some remoting sinks in the past) makes it a very feasible.

Thanks for your replies.

Josh

Identify File Difference

Josh Carlisle

W.G. Ryan eMVP

Josh Carlisle

Nick Malik

Josh Carlisle