File Duplication check

G

giftson.john

Hi,

I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.

What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.

Can anyone give me a solution for this?

Thanks in advance.

Giftson John
 
N

Nicholas Paldino [.NET/C# MVP]

John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.
 
G

giftson.john

John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)




I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.
Can anyone give me a solution for this?
Thanks in advance.
Giftson John- Hide quoted text -

- Show quoted text -

Hi Nicholas,

I was bit confused about the MD5 hashing.

Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.

Thanks for your help.
 
N

Nicholas Paldino [.NET/C# MVP]

That's the thing, from a file perspective, the files are different
because the metadata is embedded in the file. If you want to check to see
if specific portions of the file are different, then you are going to have
to open the file using word, and then compare word for word, style for
style, etc, etc. Not an easy task.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no
duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing
each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files
could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)




I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.
Can anyone give me a solution for this?
Thanks in advance.
Giftson John- Hide quoted text -

- Show quoted text -

Hi Nicholas,

I was bit confused about the MD5 hashing.

Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.

Thanks for your help.
 
D

DeveloperX

That's the thing, from a file perspective, the files are different
because the metadata is embedded in the file. If you want to check to see
if specific portions of the file are different, then you are going to have
to open the file using word, and then compare word for word, style for
style, etc, etc. Not an easy task.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)




John,
Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no
duplicates,
then yes, you SHOULD get different values.
What you have to do is scan the contents of the directory, hashing
each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files
could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).
Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Hi,
I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.
Can anyone give me a solution for this?
Thanks in advance.
Giftson John- Hide quoted text -
- Show quoted text -
Hi Nicholas,
I was bit confused about the MD5 hashing.
Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.
Thanks for your help.- Hide quoted text -

- Show quoted text -

A possibly simpler alternative would be to save each file as a text
file and compute the hash on the text files. I say simpler because the
files could be exported in code in a few simple lines. Open file, save
as text, close file, done :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Spell Check 2
File integrity checking? 24
check for File In Use 1
word 2007 and grammar check 33
ClickOnce and User Managed Documents 1
Remove check boxes 1
Removing duplicate files 2
Printing Word documents with C# 1

Top