The Right Collection for the Job

orekin · Apr 30, 2005

Hi There

I have a huge array of objects being displayed on a UI. These are
actually 'jobs' that the users need to deal with before close of
business.

As the operator deals with a particular job, I want to add the unique
identifier for that job to a collection. The collection simply keeps
track of jobs that have been dealt with.

The issue is speed of retrieval. Given a job, I know its unique
identifier and want to be able to quickly search the collection to see
if the job has been dealt with.

At the moment I am using a HashTable, but my use of the HashTable is
not intuitive - I add items with myHashTable.Add(jobID,jobID) ... ie I
use the unique identifier for both key and value, because really, I
don't care for the value, I just want a quick way to see if the JobID
is in the table.

Is there a better way to go about things?

Thanks
Bill

Tom Shelton · Apr 30, 2005

Hi There

I have a huge array of objects being displayed on a UI. These are
actually 'jobs' that the users need to deal with before close of
business.

As the operator deals with a particular job, I want to add the unique
identifier for that job to a collection. The collection simply keeps
track of jobs that have been dealt with.

The issue is speed of retrieval. Given a job, I know its unique
identifier and want to be able to quickly search the collection to see
if the job has been dealt with.

At the moment I am using a HashTable, but my use of the HashTable is
not intuitive - I add items with myHashTable.Add(jobID,jobID) ... ie I
use the unique identifier for both key and value, because really, I
don't care for the value, I just want a quick way to see if the JobID
is in the table.

Is there a better way to go about things?

Thanks
Bill

Bill... I think using that hastable is still pretty good. You could do
something like this:

using System;
using System.Collections;

public sealed class App
{
public static void Main ()
{
// create a hashtable
Hashtable ht = new Hashtable ();

// add some keys
ht.Add (1, null);
ht.Add (2, null);

Console.WriteLine (ht.ContainsKey (1));
Console.WriteLine (ht.ContainsKey (10));
}
}

Output:

True
False

Ollie Riches · Apr 30, 2005

You should check the specialized collection namespace
(System.Collections.Specialized) it has a string dictionary that
implements a hash table with the key and the value strongly typed to be
strings rather than objects.

How many entries are you going to have in the collection?

And do you only ever want to check for Job once?

--
HTH

Ollie Riches
http://www.phoneanalyser.net

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a programmer
helping programmers.

Dave · Apr 30, 2005

If you don't care about the "Value" in the Hashtable, then I recommend using an ArrayList and the Sort and BinarySearch methods.

I believe that sorting the list, and "binary searching" will be the fastest implementation in your case, but I believe that this
method performs the best with 10 or more entries (if i remember correctly). Either way, 10 entries should be extremely fast so that
should work just fine.

Nick Malik [Microsoft] · Apr 30, 2005

if the list of jobs is constrained to "the number of jobs that a single
human can perform in a single day," then the word "huge" is misleading.

To create a parallel, let's say that I wanted to count, as a task, each
breath I take while at work. I take appoximately 3 seconds per breath. In
other words, I inhale about 20 times a minute while at rest. In an eight
hour day, I will inhale 9,600 times. This is tiny.

Assuming I wanted to place 9,600 rows in any collection object that the
framework makes available, it would be difficult (at best) for a human being
to percieve any difference in the performance for looking up a single entry,
since the number of entries is so small.

You have already spent more time thinking about it, just by reading these
responses, than all of your users, combined, will ever percieve in variation
based on the choice you make.

In other words, don't optimize until you know where your time is being
spent.

--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.

orekinbck · May 1, 2005

Thanks everyone for your replies.

A particular jobID would be checked against the collection several
thousand times, not just once. Also, the jobID is an integer.

In terms of the size of the array that holds the number of jobs that
are dealt with - this ranges from 0 to approx 250,000 items [There are
thousands of employees available to deal with jobs and each job takes
on average 10 minutes]

So on the absolute worst case (please correct if these calcs are
wrong):
- Linear search requires 250,000 hits
- Binary search requires 18 hits [Log 250,000 base 2]
- HashTable with default load factor requires 3.6 hits [1/(1-0.72)]

If my calculations are correct that means that we have roughly 15 ms to
gain every time this collection is searched. The search routine will
be put away in its own function and that function could be called up to
10,000 times (1 refresh every 3 seconds over a standard business day).
This equates to a time saving of 150 seconds throughout the day.

I am really interested to read further feedback. For now I am leaning
toward the approach suggested by Tom as it is quick and sticks to
HashTables, which the other programmers on my team are familiar with..

Kind Regards
Bill

Dave · May 1, 2005

Hi Bill,

I was wondering if you could please explain this calculation:

- HashTable with default load factor requires 3.6 hits [1/(1-0.72)]

Default load is 1, I believe. How did you determine the other values?

HashTable is obviously the way to go (aside from algorithms not built into the framework that may perform better).

--
Dave Sexton
[email protected]
-----------------------------------------------------------------------

Thanks everyone for your replies.

A particular jobID would be checked against the collection several
thousand times, not just once. Also, the jobID is an integer.

In terms of the size of the array that holds the number of jobs that
are dealt with - this ranges from 0 to approx 250,000 items [There are
thousands of employees available to deal with jobs and each job takes
on average 10 minutes]

So on the absolute worst case (please correct if these calcs are
wrong):
- Linear search requires 250,000 hits
- Binary search requires 18 hits [Log 250,000 base 2]
- HashTable with default load factor requires 3.6 hits [1/(1-0.72)]

If my calculations are correct that means that we have roughly 15 ms to
gain every time this collection is searched. The search routine will
be put away in its own function and that function could be called up to
10,000 times (1 refresh every 3 seconds over a standard business day).
This equates to a time saving of 150 seconds throughout the day.

I am really interested to read further feedback. For now I am leaning
toward the approach suggested by Tom as it is quick and sticks to
HashTables, which the other programmers on my team are familiar with..

Kind Regards
Bill

orekinbck · May 1, 2005

Hi Dave!

I got this from:

http://msdn.microsoft.com/library/d...s/dv_vstechart/html/datastructures_guide2.asp

Search the document for "1 / (1 - lf), where lf is the load factor"

I could probably be incorrect, but I am thinking worse case scenario is
a collision, and using the above formula with default load factor I get
3.6 (note that Microsoft does a transformation on the load factor,
0.72 is really 1 ... search for paragraph containing "Realize, however,
that whatever value you provide, it is scaled down 72 percent, so even
if you pass in a value of 1.0, the Hashtable class's actual loadFactor
will be 0.72").

Cheers
Bill

Nick Malik [Microsoft] · May 3, 2005

doing this in memory is not scalable. It is fast, true, but not scalable.

In other words, you can get good numbers from 100 folks hitting the
collection, but when you have 10,000, you are going to need another machine
or two, because you will bottleneck on the networking or communication
aspects. Therefore, all the speed in the world in the collection won't
help.

If you are going to scale, you will need a nice fast database under the
covers. Put in SQL (on a seperate machine from your app server). Then, at
the app server level, Cache the hits you have looked up, but no more, and
even then keep them in memory only for a short time, since their data will
age pretty fast.

Your speed issue still has nothing to do with the collection. Use the hash
table and go. Save this "thinking" time to consider actual scalability.
--
--- Nick Malik [Microsoft]
MCSD, CFPS, Certified Scrummaster
http://blogs.msdn.com/nickmalik

Disclaimer: Opinions expressed in this forum are my own, and not
representative of my employer.
I do not answer questions on behalf of my employer. I'm just a
programmer helping programmers.

The Right Collection for the Job

orekin

Tom Shelton

Ollie Riches

Dave

Nick Malik [Microsoft]

orekinbck

Dave

orekinbck

Nick Malik [Microsoft]