File Duplication check

  • Thread starter Thread starter giftson.john
  • Start date Start date
G

giftson.john

Hi,

I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.

What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.

Can anyone give me a solution for this?

Thanks in advance.

Giftson John
 
John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.
 
John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)




I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.
Can anyone give me a solution for this?
Thanks in advance.
Giftson John- Hide quoted text -

- Show quoted text -

Hi Nicholas,

I was bit confused about the MD5 hashing.

Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.

Thanks for your help.
 
That's the thing, from a file perspective, the files are different
because the metadata is embedded in the file. If you want to check to see
if specific portions of the file are different, then you are going to have
to open the file using word, and then compare word for word, style for
style, etc, etc. Not an easy task.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

John,

Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no
duplicates,
then yes, you SHOULD get different values.

What you have to do is scan the contents of the directory, hashing
each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files
could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).

Hope this helps.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)




I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.
Can anyone give me a solution for this?
Thanks in advance.
Giftson John- Hide quoted text -

- Show quoted text -

Hi Nicholas,

I was bit confused about the MD5 hashing.

Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.

Thanks for your help.
 
That's the thing, from a file perspective, the files are different
because the metadata is embedded in the file. If you want to check to see
if specific portions of the file are different, then you are going to have
to open the file using word, and then compare word for word, style for
style, etc, etc. Not an easy task.

--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)




John,
Well, using a hash is the right way to go, but I don't understand why
everything gives you different values. I mean, if you have no
duplicates,
then yes, you SHOULD get different values.
What you have to do is scan the contents of the directory, hashing
each
file as you go. You then store the values of the hashes. While scanning
the directory, you check the value of the hash against the list you have
already compiled. If the hash exists in the list, then the two files
could
be duplicates (you really have to check both files against each other at
that point byte by byte to see if they are if you want to be completely
accurate).
Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

Hi,
I am creating an application which migrates all documents from one
repository to another repository. Before migration i have to verify
all the documents are unique. No duplicates has to be uploaded. Event
the document created date, modified date, filename can be different.
How to find the document is duplidate.
What i did is, i created a file and did save as and saved into another
location. I am not able to find that the document is duplicate. I have
tried MD5 hash, CRC check, SHA1. Everything gives different values.
Can anyone give me a solution for this?
Thanks in advance.
Giftson John- Hide quoted text -
- Show quoted text -
Hi Nicholas,
I was bit confused about the MD5 hashing.
Could you please tell me how to compare the contents of Word
Documents. What is happening is MS Word is having some set of Metadata
and even the file contents are same, the metadata difference is giving
different MD5 hash value.
Thanks for your help.- Hide quoted text -

- Show quoted text -

A possibly simpler alternative would be to save each file as a text
file and compute the hash on the text files. I say simpler because the
files could be exported in code in a few simple lines. Open file, save
as text, close file, done :)
 
Back
Top