Comparing Files - quickly and definitively

S

Smithers

I would appreciate some recommendations for programmatically determining if
files differ.

I'm writing a utility that backs up files that customers upload to Web
sites. Rather than mindlessly copying any/all files from each Web site to
the backup server (and wasting space), I'm looking to copy only files that
have been modified since the last backup took place. The files include
anything from PDF to GIF/JPG to XML, text, etc. Max size is currently under
5MB, but that could be increased later depending on customer demand.

I understand that I can look to the LastModified date or other file
properties, but I would prefer something more reliable. By "more reliable" I
mean this: I have noticed that the time can differ by a couple of seconds
after copying a file from one server to another. If the logic were to
compare using those date/times, we would expect "false positives" - files
that appear to be newer (different) based on Date/Time, but are in fact no
different. At least this scenario would happen if the logic looked to the
last backup (on the backup server) and compared against the current file on
a Web server.

So I'm thinking that there may be a more reliable way to determine if the
file content is actually different. While it would be a no-brainer to open
each file and compare the contents, that could be a rather costly
operation - given the large number of files to potentially compare, and
their potential large sizes.

So I'm looking for a reliable means through which to determine which files
have, in fact, been changed - and make that determination with fast
performance.

Suggestions? Ideas?

Thanks!

-S
 
P

Peter Duniho

Smithers said:
[...]
So I'm looking for a reliable means through which to determine which files
have, in fact, been changed - and make that determination with fast
performance.

Depends on your definition of "reliable". Many backup programs use only
the filename, size, and modified date to determine whether the file has
changed. Some even just use the archive bit. When they use these
things, they make sure that they copy not only the file but also the
file attributes they are checking. So if you are relying on the
modified date, for example, you'd have to copy the modified date too (I
know that the Windows Explorer does this when copying the files by hand).

But since these things aren't actually tied the actual file contents,
they aren't actually 100% reliable, though they often are "good enough".

If you really want to know whether the file is different, you have to
compare it somehow. A common method would be to generate and store an
MD5 hash on the file, and then generate the same hash for the file that
is eligible for copying. If the hash is the same, don't copy.

Of course, you would check the file size first, since that's a quick way
to know for sure if the files are different. :)

There is a theoretical possibility of hash collisions even using that
technique, so technically speaking it's not 100% reliable. But it's far
more reliable than looking just at date and file size, and is probably
good enough for almost any real-world application.

Pete
 
S

Smithers

Thanks Pete - hadn't thought about the hashing alternatives.

"good enough" is criteria I can live with on this. An occasional false
positive won't be the end of the world. It would simply mean that we archive
a file unnecessarily. No big deal. I think I'll go with a comparison of the
date/times after all, do a bunch of testing, and if there are very few false
positives, then we'll be done. We can go with more involved analyses and
possibly hashing if we need to tighten things up later.

-S


Peter Duniho said:
Smithers said:
[...]
So I'm looking for a reliable means through which to determine which
files have, in fact, been changed - and make that determination with fast
performance.

Depends on your definition of "reliable". Many backup programs use only
the filename, size, and modified date to determine whether the file has
changed. Some even just use the archive bit. When they use these things,
they make sure that they copy not only the file but also the file
attributes they are checking. So if you are relying on the modified date,
for example, you'd have to copy the modified date too (I know that the
Windows Explorer does this when copying the files by hand).

But since these things aren't actually tied the actual file contents, they
aren't actually 100% reliable, though they often are "good enough".

If you really want to know whether the file is different, you have to
compare it somehow. A common method would be to generate and store an MD5
hash on the file, and then generate the same hash for the file that is
eligible for copying. If the hash is the same, don't copy.

Of course, you would check the file size first, since that's a quick way
to know for sure if the files are different. :)

There is a theoretical possibility of hash collisions even using that
technique, so technically speaking it's not 100% reliable. But it's far
more reliable than looking just at date and file size, and is probably
good enough for almost any real-world application.

Pete
 
P

Peter Duniho

Smithers said:
Thanks Pete - hadn't thought about the hashing alternatives.

"good enough" is criteria I can live with on this. An occasional false
positive won't be the end of the world. It would simply mean that we archive
a file unnecessarily. No big deal. I think I'll go with a comparison of the
date/times after all, do a bunch of testing, and if there are very few false
positives, then we'll be done. We can go with more involved analyses and
possibly hashing if we need to tighten things up later.

I think false negatives are probably the bigger problem. And they can
occur with either approach, though IMHO they are unlikely with either.

You could have a file with the same name and modification date and time,
but which isn't actually the one that's been archived. There are ways
the user can force this situation, but it's even theoretically possible
simply as an accident.

Likewise, hashes can collide, so it is possible using a hash of the file
you'd detect the files as identical even though they are different and
the file needs archiving.

Even the date/time false negative is extremely unlikely IMHO, and the
hash is probably (many) orders of magnitude less likely than that. So I
don't really think they are of great concern. I just want to ensure
those issues aren't overlooked. It's fine to be aware of them and call
it "good enough", but one needs to at least be aware of them.

Pete
 
S

Smithers

Agreed. Failing to back up a modified file (per false negative) is, at least
in our case, far worse than backing up a file unnecessarily. Part of our
strategy for protecting against that is that we back up all files,
regardless of modified or not, on a weekly basis - which is acceptable given
the nature of the data. In fact, we have been doing this on a _daily_ basis
for the past couple of years now. I'm looking to get away from the _daily_
full backup and go with weekly full backup, with incremental backups between
full backups - in order to reduce the amount of space taken up
unnecessarily. Separately, we always advise customers to maintain their own
local copies. And YES - I know this has practically nothing to do with us
meeting our own SLA. But experience shows that that's where customers have
typically gone anyway - i.e., they get their own backups without even asking
us for them, even though we had 'em ready to go if necessary.

-S
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top