what is best algorithm to check duplicated rows

R

Ryan Liu

Hi,

If I have tens of thousands DataRow in a DataTable and allow the end user to
pick any DataColumn(s) to check for duplicated lines, the data is so large,
is there a better API, algorithm can be used for this purpose?

Thanks a lot!
Ryan
 
M

Michael Nemtsev

Hello Ryan,

Hmm,
I see two ways - using the hashtable or sorting + binary search

---
WBR, Michael Nemtsev [.NET/C# MVP].
My blog: http://spaces.live.com/laflour
Team blog: http://devkids.blogspot.com/

"The greatest danger for most of us is not that our aim is too high and we
miss it, but that it is too low and we reach it" (c) Michelangelo

RL> Hi,
RL>
RL> If I have tens of thousands DataRow in a DataTable and allow the end
RL> user to pick any DataColumn(s) to check for duplicated lines, the
RL> data is so large, is there a better API, algorithm can be used for
RL> this purpose?
RL>
RL> Thanks a lot!
RL> Ryan
 
G

Guest

Ryan,
The first question I would ask is "how did you get tens of thousands of
rows" into this Datatable? If they came out of a database, shouldn't that be
where you are enforcing your referential and unique column integrity?
Peter
 
R

Ryan Liu

Hi Peter,



The data is imported by end user from external file, most time it is a csv
text file.



After import to datatable, then the end user specify the criteria which is
used to pick rows from datatable.



Then for selected rows, the end user want to check duplicated lines before
insert them to database. The criteria for checking duplicated lines is
also specified by the end user.



And I am also required check duplicated entries against data already in
database.



Thank you and everyone replyed to this message!

Ryan
 
R

Ryan Liu

Thanks!

Just hope hash algorithm for string is as efficient as int.

And the criteria for checking duplicated datarows could be based on multiple
dataColumns (AND logic), this make it difficult for me to come out a hash
algorithm.

Ryan
 
Joined
Sep 7, 2007
Messages
1
Reaction score
0
Assumed "DN" is the column you want to check for duplication

DataTable data = ???;

string duplicatedNumber = string.Empty;

DataRow[] dr = data.Select("DN<>'' AND DN IS NOT NULL", "DN");

string prevKey = string.Empty;

for (int row = 0; row < dr.Length; row++)
{
if (prevKey.Equals(dr[row]["DN"].ToString()) && duplicatedNumber.IndexOf(dr[row]["DN"].ToString()) < 0)
duplicatedNumber += string.Format("{0}<br>", dr[row]["DN"].ToString());
else prevKey = dr[row]["DN"].ToString();
}

I have 12000+ rows and it takes a second to locate the duplication
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top