File dupe finder - when file names are not the same?

S

Sam

I was wondering if there was a utility that could search a local hard
drive and scan file contents of common file types (documents,
presentations, spreadsheets, text files) and report back those that
seem to be identical or near perfect matches. Maybe using some sort
of percentage match relationship.

With the advent of e-mail file attachments, it's easy to get large
numbers of duplicate files. My e-mail program (at home) is Eudora and
one of its constructs is that all attachments are in fact detached and
stored in a directory. If you get the same file many times - Eudora
dutifully stores them by indexing the files (e.g. sam.txt becomes
sam1.txt, sam2.txt, etc.).

At work, we iterate on lots of documents during preparation (nothing
new I'm sure) and so have many versions upto the final. It would be
nice to find all the similar files then have the option of doing
something with them (sort, group zip, delete, etc.).

If anyone has any idea I'd be glad to hear it. File discipline is one
of my strong points but once overwhelmed it's really hard to clean up
the mess.

Thanks.

Sam
 
P

(ProteanThread)

I was wondering if there was a utility that could search a local
there is also one called doublekiller, a single no install app; as soon
as i find it i'll post the link unless someone beats me to it.
 
S

Susan Bugher

Alex said:

I believe this is the last freeware version:

Program: DupeLocater
Author: Midnight Blue Software
Install: n.i.
W: LFW
Ware: v 1.0.0.1
http://ftp.tdcnorge.no/pub/windows/misc/

More apps are listed here:

http://www.pricelesswarehome.org/acf/P_FILEUTILITIES.php#3.10DuplicateFileChecker

For graphics files see this list:

http://www.pricelesswarehome.org/acf/P_GRAPHICS.php#6.04DuplicateFileChecker;Images

Susan
--
Posted to alt.comp.freeware
Search alt.comp.freeware (or read it online):
http://google.ca/advanced_group_search?q=+group:alt.comp.freeware
Pricelessware & ACF: http://www.pricelesswarehome.org
Pricelessware: http://www.pricelessware.org (not maintained)
 
J

John Fitzsimons

I was wondering if there was a utility that could search a local hard
drive and scan file contents of common file types (documents,
presentations, spreadsheets, text files) and report back those that
seem to be identical or near perfect matches. Maybe using some sort
of percentage match relationship.

< snip >

If such a program existed then it would be very handy. IIRC there is a
graphic file compare program that gives you a % identical in two files
but I have never heard of that in non graphics files. If you find one
please post the info here. Most (all ?) dupe detectors only register
files as duplicates if the contents match 100%.

Regards, John.
--
****************************************************
,-._|\ (A.C.F FAQ) http://clients.net2000.com.au/~johnf/faq.html
/ Oz \ John Fitzsimons - Melbourne, Australia.
\_,--.x/ http://www.vicnet.net.au/~johnf/welcome.htm
v http://clients.net2000.com.au/~johnf/
 
S

Sam


just some feedback to the group -

Downloaded and tried DupeLocater - it's rather simplistic but
it does work. It does not provide any "match" data only file names.
It does catch "incrementing" the file name. It seems to catch exact
same content in totally different file names but not formats.

I took a text file 'name.txt' and copied it as mena.xls, and eman.doc
(did not change the format - just the name). Everything worked as
expected. DupeLocater identified the two new files as dupes of the
original.

Then I took mena.xls - and saved it using Excel as mena2.xls as a
worksheet, and I used Word to save eman.doc as eman2.doc. This didn't
work as a document. DupeLocater sill reported the renamed text files
as dupes but the word & excel formatted files were not reported even
though to me (the user) the "content" was the same.

Going one step further - I took the mena2.xls (real) excel file and
copied it to drat3.xls and similarly for eman2.doc (real) word saved
as lost.doc. This did work in that DupeLocater identified the two new
files as dupes of their respective originals.

It will sometimes report non match files - I haven't figured out why.
I have a couple of index files (binary formats) that get reported as
dupes of unrelated list files. The files are not the same size (or
anywhere close). The content of the list files is (it appears)
something like rtf (text + markup) or it could be a delimited file in
some way. The index files are binary format (I developed the program
that creates them.) The good news is that they are so dissimilar that
it's easy to catch.

And lastly - just for fun; I took a key phrase which appears in all
these files and plugged it into Google Desktop Search and it reported
back all the files regardless of format.

end of report...

Sam
 
D

David

just some feedback to the group -

Downloaded and tried DupeLocater - it's rather simplistic but
it does work. It does not provide any "match" data only file names.
It does catch "incrementing" the file name. It seems to catch exact
same content in totally different file names but not formats.

I took a text file 'name.txt' and copied it as mena.xls, and eman.doc
(did not change the format - just the name). Everything worked as
expected. DupeLocater identified the two new files as dupes of the
original.

Then I took mena.xls - and saved it using Excel as mena2.xls as a
worksheet, and I used Word to save eman.doc as eman2.doc. This didn't
work as a document. DupeLocater sill reported the renamed text files
as dupes but the word & excel formatted files were not reported even
though to me (the user) the "content" was the same.
Which is what I would expect. Did you think to compare the file sizes
after the Excel and Word saves. They will be much larger and thus
could not possibly be considered as duplicates.
Going one step further - I took the mena2.xls (real) excel file and
copied it to drat3.xls and similarly for eman2.doc (real) word saved
as lost.doc. This did work in that DupeLocater identified the two new
files as dupes of their respective originals.

It will sometimes report non match files - I haven't figured out why.
I have a couple of index files (binary formats) that get reported as
dupes of unrelated list files. The files are not the same size (or
anywhere close). The content of the list files is (it appears)
something like rtf (text + markup) or it could be a delimited file in
some way. The index files are binary format (I developed the program
that creates them.) The good news is that they are so dissimilar that
it's easy to catch.

And lastly - just for fun; I took a key phrase which appears in all
these files and plugged it into Google Desktop Search and it reported
back all the files regardless of format.
DupeLocater is not about the content of file but about whether they
are identical or not. One byte of difference makes files
non-identical.
 
S

Sam


DupeLocater is not about the content of file but about whether they
are identical or not. One byte of difference makes files
non-identical.

Which sadly is the point. That is why I was looking for "match data"
in my original post. If DupeLocater (just as an example) had reported
that the content of File X is 99.8% the same as File Y - I would have
more to go on.

In today's world of streaming information, we need 'similar.' But this
implies that the tools contain the necessary mechanisms to decipher
file formats. This is particularly notable in the plain text file
versus the MS Word document.

I am sadly all to familiar with the computer's binary view of files -
and I don't really care (in this context). What I care about is the
user's view of files. If the text, in a text formatted file, is
_identical_ to the text in a Word document, I think the tools should
report that in some way. The fact that their checksum is different is
not the deciding factor. It _is_ about the content.

That is why I made the point about Google DTS. While DupeLocater
determined that the files are not 'identical,' Google dutifully
reported however that the 'content' - regardless of format - was found
in every file.

There are other tools and I'm still looking. I'm patient and not all
that old.

Sam
 
A

Alan

Hi Sam,

I am sorry to post here as NoClone being recommended is a shareware,
but if freeware can't finish your job, why not shareware?
NoClone compares duplicate files based on file contents regardless of
file name, she compares file byte by byte. Unique time-saving Smart
Marker filters duplicates for removal. Preview images and flexible
removal/archival options.
http://noclone.net

Alan
 
S

Sam

Hi Sam,

I am sorry to post here as NoClone being recommended is a shareware,
but if freeware can't finish your job, why not shareware?
NoClone compares duplicate files based on file contents regardless of
file name, she compares file byte by byte. Unique time-saving Smart
Marker filters duplicates for removal. Preview images and flexible
removal/archival options.
http://noclone.net

Alan
Thanks, I'll take a look

Sam
 
J

John Fitzsimons

I am sorry to post here as NoClone being recommended is a shareware,

No need to be sorry. Just post about freeware and you will have
nothing to be sorry for. If you need to be sorry be sorry for being
too lazy to post your answer after Sam's question.
but if freeware can't finish your job, why not shareware?

Perhaps because it doesn't do what he wanted ?
NoClone compares duplicate files based on file contents regardless of
file name, she compares file byte by byte. Unique time-saving Smart
Marker filters duplicates for removal. Preview images and flexible
removal/archival options.
http://noclone.net

It looks to me that that would be a complete waste of time/effort for
Sam, and probably many others, here. There are a number of freeware
programs that find identical files but with different names etc. That
isn't what Sam was asking for.

A quick test of the above showed that it was far inferior to a number
of freeware duplicate checkers I compared it to. For example, from
what I could see at a quick check if two files were identical to a
third then only two files would be listed. Not all three.

Also, if two files were 95% the same then noclone wouldn't tag them
as "near perfect matches".

It is possible noclone does either/both of these things but at first
looks it seems to be pretty useless for the job in question.

Regards, John.
--
****************************************************
,-._|\ (A.C.F FAQ) http://clients.net2000.com.au/~johnf/faq.html
/ Oz \ John Fitzsimons - Melbourne, Australia.
\_,--.x/ http://www.vicnet.net.au/~johnf/welcome.htm
v http://clients.net2000.com.au/~johnf/
 
S

Sam

I am sorry to post here as NoClone being recommended is a shareware,

No need to be sorry. Just post about freeware and you will have
nothing to be sorry for. If you need to be sorry be sorry for being
too lazy to post your answer after Sam's question.
but if freeware can't finish your job, why not shareware?

Perhaps because it doesn't do what he wanted ?
NoClone compares duplicate files based on file contents regardless of
file name, she compares file byte by byte. Unique time-saving Smart
Marker filters duplicates for removal. Preview images and flexible
removal/archival options.
http://noclone.net

It looks to me that that would be a complete waste of time/effort for
Sam, and probably many others, here. There are a number of freeware
programs that find identical files but with different names etc. That
isn't what Sam was asking for.

A quick test of the above showed that it was far inferior to a number
of freeware duplicate checkers I compared it to. For example, from
what I could see at a quick check if two files were identical to a
third then only two files would be listed. Not all three.

Also, if two files were 95% the same then noclone wouldn't tag them
as "near perfect matches".
[snip]
It is possible noclone does either/both of these things but at first
looks it seems to be pretty useless for the job in question.

Regards, John.

Thanks for saving me the trouble of looking. I've been distracted
lately by more pressing issues and have not pursued this topic very
much. It will be a few more days until I can get back to it. Thanks
again.

Sam
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top