Fastest way to sort

B

Bill

What is the fastest way to sort two columns and one million rows?

I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.

What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?

Other than the need to sort two columns, I have a lot of control over
the data source, etc.
 
S

Scott M.

Bill said:
What is the fastest way to sort two columns and one million rows?

I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.

SQL Server will generally yeild better performance results in all areas than
Access will.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.

The time it takes to do the sort is affectd by several things, not just the
..NET object that is containing the data. But, to be sure, sorting a million
records will take some time.
What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?
http://dotnetslackers.com/Community.../sql-bulk-insert-and-ado-net-sqlbulkcopy.aspx


Other than the need to sort two columns, I have a lot of control over
the data source, etc.

Have you considered using a generic Disctionary(Of T) type? These generic
types support sorting and may prove to be quicker than sorting a DataSet via
a DataView.

-Scott
 
P

Peter Duniho

Bill said:
What is the fastest way to sort two columns and one million rows?

In-memory quicksort. Which is what you get if you use .NET's built-in
sorting methods.
I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.

If you need a database-like API for other reasons, using SQL Server
Express or similar could be helpful. I doubt it would speed up your
sort times though. Even if you could host the database on a different,
much faster computer, the i/o adding data to the database would probably
be prohibitive, consuming whatever gain you got by offloading the sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method.

Why are you using a DataSet and DataView? Certainly one possible issue
is the use of overly complex data structures like that, rather than a
simply array or List<T>. If all you are doing is parsing input,
sorting, and then writing it back out, I would expect DataSet to be
overkill and possibly less efficient.
I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.

Frankly, a few minutes to sort a million records sounds pretty
reasonable to me. But then, I'm obviously a bit out-of-date. I did a
quick test using the plain Array.Sort() method, and I can sort 1 million
records in under 2 seconds.

Seems to me the most direct way to improve performance is to stop
sorting in the context of a database. :)
What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?

Other than the need to sort two columns, I have a lot of control over
the data source, etc.

How much control? Can you get the data to appear in the original text
file already in sorted order? That would allow for the best
performance. :)

Pete
 
V

Vista Succubus Hunter

Bill said:
What is the fastest way to sort two columns and one million rows?

I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.

What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?

Other than the need to sort two columns, I have a lot of control over
the data source, etc.


I would just use a List<T>.Sort.

<http://dotnetslackers.com/Community.../How-to-sort-a-generic-List_3C00_T_3E00_.aspx>

The Product class/object would be what you're sorting. Your sort object
will be a name you make/call it.

var products = New List<Product>;


in your reader loop of reading each record.

Loop
var product = new Product();

product.Name = "Test";

products.Add(product);

end loop


products.Sort();


foreach(var prod in products)
{
string name = prod.Name;

and do other processing
}


products.Clear();


You start it all over again.


Why do you need a database to do sort? It don't get any faster than a
sort in memory, which is what you're doing with List<T>.Sort().

And you don't have to use or deploy a database.
 
B

Bill

Pete,

I have to keep the two data items(columns) together, which is why I
used a dataset/dataview.The sort must be on the first column only. I
tried doing this with a custom sort routine within the List class sort
method, but that was even slower than the dataset. I will look into
the Dictionary class to see what it can do.
 
B

Bill

Scott,

The Dictionary class did the trick. THANK YOU. It sorted over 3
million data items in 2.5 minutes, down from 8.5 minutes.
 
P

Peter Duniho

Bill said:
Pete,

I have to keep the two data items(columns) together, which is why I
used a dataset/dataview.The sort must be on the first column only. I
tried doing this with a custom sort routine within the List class sort
method, but that was even slower than the dataset. I will look into
the Dictionary class to see what it can do.

You must have a really slow computer, or you didn't implement the
List<T>-based sort correctly. On my computer (32-bit Windows 7 running
in a VM on a MacBook Pro Core 2 Duo 2.33) I can sort an array of 10
million int/string pairs in 20 seconds.

I also don't understand what you mean by using Dictionary. The
Dictionary<TKey, TValue> class doesn't sort items. The
SortedDictionary<TKey, TValue> class does, but it uses a binary tree
which has _at best_ the same performance characteristics as the
quicksort implementation of Array.Sort() or List<T>.Sort(), and since
you have all the overhead of the internal memory objects and operations
of the SortedDictionary<TKey, TValue> class, it's actually going to be
quite a bit slower.

The fact that you need to keep the two data items together isn't
relevant. You can easily create a simple struct to pair them, or just
use the KeyValuePair<TKey, TValue> struct if you like.

I suspect that using an array and sorting it is going to be, by far, the
most efficient way to solve your problem.

Pete
 
G

Göran Andersson

Scott said:
Have you considered using a generic Disctionary(Of T) type? These generic
types support sorting and may prove to be quicker than sorting a DataSet via
a DataView.

A Dictionary<T> can't be sorted as it doesn't preserve the order of the
elements...
 
G

Göran Andersson

Bill said:
What is the fastest way to sort two columns and one million rows?

I currently have no database for this application and am wondering if
I should distribute access or sql server express to speed up a sort
routine.
I am currently adding rows, one at a time, to a dataset. The data
comes from a text file, where I parse out two sets of data, one a
number and the second an alpha string. I insert them into the dataset.
At certain points I need to sort based on the first column(number) and
I create a Dataview for this and use the dataview sort method. I then
write the dataset out to a text file, clear the dataset and start
adding data for the next sort. When I have to sort a large number of
rows, it is very slow. One million rows takes over three minutes for
the sort.

What would be faster? I tried Access but with a large number of rows,
the insert slowed down and I could not find a way to do a bulk insert
with access -if there is one, then maybe Access would be faster than
what I have. I tried the insert with and without having the first
column indexed in the table and it was slow both ways. Does Sql
Server Express support any type of bulk loading?

Other than the need to sort two columns, I have a lot of control over
the data source, etc.

Using a DataTable means a lot of overhead, compared to objects
specifically created to hold the data that you have.

I made this test class for holding an item:

private class Test {

public int Id { get; private set; }
public string Name { get; private set; }

public Test(int id, string name) {
Id = id;
Name = name;
}

}

Using this code I created a million random items, that I then sorted on
the numerical and textual properties:

List<Test> list = new List<Test>();
Random rnd = new Random();
for (int i = 0; i < 1000000; i++) {
list.Add(new Test(rnd.Next(), rnd.Next().ToString()));
}

Stopwatch w = new Stopwatch();
w.Start();
list.Sort((x, y) => x.Id.CompareTo(y.Id));
w.Stop();
Console.WriteLine("{0} ms.", w.ElapsedMilliseconds);
w.Reset();
w.Start();
list.Sort((x, y) => x.Name.CompareTo(y.Name));
w.Stop();
Console.WriteLine("{0} ms.", w.ElapsedMilliseconds);

Output:

881 ms.
4514 ms.

So, what you should expect for sorting a million objects in a List<T> is
a few seconds, not a few minutes.
 
J

Jesse Houwing

* Peter Duniho wrote, On 31-10-2009 3:36:
You must have a really slow computer, or you didn't implement the
List<T>-based sort correctly. On my computer (32-bit Windows 7 running
in a VM on a MacBook Pro Core 2 Duo 2.33) I can sort an array of 10
million int/string pairs in 20 seconds.

That probably excludes loading them from file.

The original poster has included the IO time of reading and writing into
the total time it takes as far as I can see.

My guess is that a lot of optimization can take place in the reading and
writing of the file as well, though I do not know what the original
format is, nor the target format.
I also don't understand what you mean by using Dictionary. The
Dictionary<TKey, TValue> class doesn't sort items. The
SortedDictionary<TKey, TValue> class does, but it uses a binary tree
which has _at best_ the same performance characteristics as the
quicksort implementation of Array.Sort() or List<T>.Sort(), and since
you have all the overhead of the internal memory objects and operations
of the SortedDictionary<TKey, TValue> class, it's actually going to be
quite a bit slower.

The fact that you need to keep the two data items together isn't
relevant. You can easily create a simple struct to pair them, or just
use the KeyValuePair<TKey, TValue> struct if you like.

I suspect that using an array and sorting it is going to be, by far, the
most efficient way to solve your problem.

I agree. Might even create a small class or struct as you said, inherit
from IComparable and let the framework do the rest.
 
P

Peter Duniho

Jesse said:
* Peter Duniho wrote, On 31-10-2009 3:36:

That probably excludes loading them from file.

Yes, true. My test just initialized random data in-memory. I was
assuming that Bill's numbers were for the sort only, and didn't include
the preparatory reading of the data. But yes...if he's including file
i/o time, that could explain the difference.
The original poster has included the IO time of reading and writing into
the total time it takes as far as I can see.

My guess is that a lot of optimization can take place in the reading and
writing of the file as well, though I do not know what the original
format is, nor the target format. [...]

In fact, if the data has to come from a file, and that file hasn't been
recently accessed (i.e. the data's not already in an in-memory cache
somewhere), the file i/o is probably going to dominate the processing
time, at least for that number of records (as the size of the data
approaches the available physical RAM, VM disk swapping starts
dominating...but at that point, performance drops precipitously to
completely unusable levels, so presumably his scenario isn't reaching
that point).

I believe he said that his original data was some sort of
character-delimited (CSV?) file. The most effective optimizations are
likely to be along the lines of simply making sure the i/o is buffered,
the buffers are large enough, and (if possible) the storage where the
data will be kept is pre-allocated.

To some extent, the disk bandwidth will always be a bottleneck and may
preclude any significant gains in the sorting department. Who knows?
Maybe the DataSet/DataView solution is, in-memory, practically as fast
as the plain array/List<T> implementation.

Pete
 
V

vanderghast

From a not so long away discussion we got, dotNet does not use the quicksort
method, or at least, not the basic algorithm of the method The basic
quicksort method is a divide and conquer method which does not compare more
than once any two elements together: in a first pass, only the pivot is
involve in comparison with other elements and once the pivot is positioned,
the two partitions only deal with their members, not with the pivot, neither
with the elements of the other partition. It has been established (by code
supplied by Random Coder) that dotNet Sort may compare two given elements
more than once, so dotNet Sort does not use the basic quicksort method,
although it may be strongly inspired from that method.

Vanderghast, Access MVP

Peter Duniho said:
Bill wrote: (...)
In-memory quicksort. Which is what you get if you use .NET's built-in
sorting methods.
(...)
 
P

Peter Duniho

vanderghast said:
From a not so long away discussion we got, dotNet does not use the
quicksort method, or at least, not the basic algorithm of the method The
basic quicksort method is a divide and conquer method which does not
compare more than once any two elements together: in a first pass, only
the pivot is involve in comparison with other elements and once the
pivot is positioned, the two partitions only deal with their members,
not with the pivot, neither with the elements of the other partition. It
has been established (by code supplied by Random Coder) that dotNet Sort
may compare two given elements more than once, so dotNet Sort does not
use the basic quicksort method, although it may be strongly inspired
from that method.

From the documentation of Array.Sort(): "This method uses the QuickSort
algorithm".

From the documentation of List<T>.Sort(): "This method uses Array.Sort,
which uses the QuickSort algorithm."

I'm not sure which discussion you're talking about, but I don't recall
one in which it was established that a) .NET compare non-pivot elements
to each other at all, nor that b) a quicksort necessarily implies that
any given element is compared with any other given element only once.

Bear in mind that in quicksort, the pivot has to wind up in one or the
other of the partitions. Whether or not it winds up as the pivot again,
the pivot will necessarily be compared to at least one other element it
had already been compared with (i.e. at a minimum, the new pivot for
that partition, and every element in the partition if it should happen
to be selected as the pivot again).

The discussion I recall hinged on whether it was okay for a comparison
function to return random results. The answer is, of course, no. It
may not return random results because even if there was such a thing as
a perfect sort in which no two elements are ever compared more than
once, the comparison results need to be transitively consistent in order
for the sort algorithm to work (i.e. if A < B and B < C, then when you
compare A to C, it must compare as less than C).

In any case, the documentation clearly states the sort algorithm used,
and this can easily be verified by examining the implementation, either
via Reflector or Microsoft's published source code for .NET.

Pete
 
V

vanderghast

If we assume there is no duplicated values, and if we assume we are at the
stage of the algorithm where the partitions are made, consider those
partitions: A and C such that element a is in A are such that a<b, and
element c is in C such that c>b ( b = the pivot). No element a in A would
further interact with any element c in C, and vice-versa, by virtue of what
a partition is. We agree on that? (It also implies that the transitivity is
inherently respected by the partition, once the partition is made)

So, from that point, let us move back in time, at the step were we establish
the partitions around the value, k. Any comparison, at that step, involves k
as one of the operands. And, at least in the basic algorithm, see
http://www.sorting-algorithms.com/quick-sort to get a fix on the references
on stages and variables names that I am using, all comparisons made are in
the for loop:

if a < pivot

And as shown by the animation pair of red triangles, showing i and k values
involved, I simply fail to see where the comparison with the pivot would
imply an element a[j] twice, within that stage "2-way partition".


So, since the "recursive sorts" stage does not involve the actual pivot,
a[k], and since the "2-way partition" stage seems to involve at most one
comparison between the pivot and any other element a, I don't see why the
'basic' quick sort algorithm could involve two given elements compared
together more than once, ever.


Still under the assumption that there is no duplicated values.



Vanderghast, Access MVP
 
P

Peter Duniho

vanderghast said:
[...]
So, since the "recursive sorts" stage does not involve the actual pivot,
a[k], and since the "2-way partition" stage seems to involve at most one
comparison between the pivot and any other element a, I don't see why
the 'basic' quick sort algorithm could involve two given elements
compared together more than once, ever.

Still under the assumption that there is no duplicated values.


Actually, duplicated values doesn't/shouldn't matter. And you're right,
in theory you can do the partitioning such that the pivot is left out of
future comparisons (in particular, an in-place quicksort allows this
fairly easily).

But that's very implementation dependent. The most abstract quicksort
simply puts each partition into a logical "bucket", and the re-runs the
algorithm on the new buckets. The pivot winds up in one of the buckets
and so winds up part of future comparisons.

You can't assert that .NET doesn't use quicksort just on the basis of
observed comparison behavior. And this is doubly true given that it's
trivially demonstrable that .NET _does_ use quicksort.

Pete
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top