what is the best datatype for..

C

calvert4rent

I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert
 
N

Nicholas Paldino [.NET/C# MVP]

I would store this in a Dictionary<int, int> where the key is the id and
the value is the count.

Then, when you need to sort the keys, I would enumerate through the
KeyValuePair instances and sort those (Dictionary<TKey, TValue> implements
IEnumerable<KeyValuePair<TKey, TValue>>). You can simply put them into an
array and then call the static Sort method on the Array class, passing in an
anonymous method for the Comparison<T> delegate (which in this case, would
be Comparison<KeyValuePair<int, int>>) which would compare the values based
on the count.

You could also get away with using a DataTable with an id column and a
count column, and then just placing a data view on the data table which is
sorted by the count.
 
C

Chris Dunaway

I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert

As for data types, a general rule of thumb would be that if you plan
to do some calculations with it (like addition, subtraction, etc.)
then it should be some sort of numeric type (int, double, decimal).
If you are only going to store it, then use a string. For example,
even though a phone number is all digits, you should use a string
since you don't do calculations with a phone number.

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next


Hope this helps

Chris
 
C

calvert4rent

Thank you both very much. That was exactly the information I was
looking for. I am going to give it a shot now. Thanks again.
 
H

Hilton

Hi,

Being a CF engineer, I'm always looking to optimize, so even though hashes,
dictionaries etc will work, the whole process is inefficient. Do you have a
range of values? Let's say that you know that the values will be 0-1000,
define a struct with two values; one being the value, the other being the
count. Create an array of the structs and off you go. Alternatively you
could use two separate arrays.

Again, this is for performance reasons and I'll bet it'll be way faster than
dictionaries etc. Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques. :)

Hilton
 
P

Peter Duniho

Hi,

Being a CF engineer, I'm always looking to optimize, so even though hashes,
dictionaries etc will work, the whole process is inefficient.

Define "inefficient".

A dictionary implementation will be faster than your suggested implementation.
Do you have a
range of values? Let's say that you know that the values will be 0-1000,
define a struct with two values; one being the value, the other being the
count. Create an array of the structs and off you go. Alternatively you
could use two separate arrays.

If I understand your suggestion correctly, you are proposing creating
an array that will be scanned searching for the index value, at which
point the count can be updated.

For small data sets, I think that would work fine. Nothing wrong at
all. But claiming that it's more "efficient" begs the question: in
what way?

It's more efficient in memory usage, but it's certainly not more
efficient with respect to performance. Furthermore, for large data
sets where the difference in memory footprint is likely to be of
concern, the performance difference will be phenomenal. The dictionary
implementation will always provide a basically constant order retrieval
of the element in question, while your linear search will be linear
order, taking orders of magnitude longer as the data set is also orders
of magnitude larger.

In other words, the performance for just 100 elements is ten times
slower than the performance for just 10 elements, and for 1000 elements
is another ten times slower than for 100 elements.
Again, this is for performance reasons and I'll bet it'll be way faster than
dictionaries etc.

You'd bet wrong. Your suggestion is more memory efficient, but it's
not going to be faster.
Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

No, only do this extra work if you absolutely must minimize the memory
footprint of the implementation. It's not going to perform better at
all.
I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques. :)

Well, a) you seem to misunderstand how the dictionary solution would
work, and b) you do seem to insist on mischaracterizing techniques as
"efficient" when they either are not at all, or are not efficient in a
way that's is interesting in the more general case (this is not a
newsgroup for writing using Compact Framework).

It is true that there are often trade-offs between being memory
efficient and computation efficient. It's false for you to assert that
you know best which is more important. Even on a CF-based
implementation, sometimes you'll still prefer a faster implementation
over one that uses less memory. It just depends on the priorities of
the situation.

Pete
 
H

Hilton

Peter,
A dictionary implementation will be faster than your suggested
implementation.

You misunderstood probably because I never explained it well enough.
If I understand your suggestion correctly, you are proposing creating an
array that will be scanned searching for the index value, at which point
the count can be updated.

Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation. In fact, you don't even need the struct
if most of the work is 'reading' the value and the sorting doesn't need to
be 'high-performance' - then you can define an array of ints and not structs
and spend a little more time doing the sorting. To decide, we'd need to
understand more about the nature of the data.

Hilton
 
W

Wingot

Err, isn't that VB? Correct me if I'm wrong, but the "Dim" was the
biggest clue, and the fact that the comments start with ', among other
things.

Just thought it was weird to see it in the csharp newsgroup :).

Regards,
Wingot

-----Original Message-----
From: Chris Dunaway [mailto:[email protected]]
Posted At: Tuesday, 20 November 2007 1:14 AM
Posted To: microsoft.public.dotnet.languages.csharp
Conversation: what is the best datatype for..
Subject: Re: what is the best datatype for..

*snip*

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next


Hope this helps

Chris
 
P

Peter Duniho

Peter,


You misunderstood probably because I never explained it well enough.

Yes, it's hard to know what a person is talking about when they don't
explain themselves.
If I understand your suggestion correctly, you are proposing creating an
array that will be scanned searching for the index value, at which point
the count can be updated.

Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation.

Yes, that's probably true. Based on your updated description, I'd
agree the dictionary implementation will be slower. However, it will
only be slower by a very tiny amount (assuming the rest of the code is
doing anything interesting, it would be difficult to even measure the
difference...the Dictionary class is still constant order just like
your lookup-array), and your implementation will use a lot more memory
than a dictionary would unless the range of values is completely used
or at least nearly so.

It also won't be flexible if the input data changes for any reason,
since the implementation itself is so dependent on the input data.

Finally, one could use the SortedDictionary class for simplicity. It
would be much slower for the counting part of the algorithm (O(log n)
instead of constant order), but when you're done counting the data
would already be sorted. In the end, I'd be surprised if the total
time spent was significantly different (for sure, the order of the
algorithm would be the same: O(n log n) for all three variations).

"Way faster" isn't a well-defined phrase, but I doubt that any
reasonable interpretation of "way faster" would properly describe the
minimal difference in performance between using a dictionary and your
lookup-array proposal.
In fact, you don't even need the struct
if most of the work is 'reading' the value and the sorting doesn't need to
be 'high-performance' - then you can define an array of ints and not structs
and spend a little more time doing the sorting. To decide, we'd need to
understand more about the nature of the data.

Well, personally I would almost never implement a design like the one
you proposed. I certainly would never do it based solely on
performance reasons. If my input data was a reasonably narrow range
(say, 16-bit values or less), _and_ I expected the input values to
cover the range with very few exceptions, then I might choose a
solution like the one you proposed. But mainly because it's simpler,
not because it might be a little faster.

Frankly, when you write stuff like "Being a CF engineer, I'm always
looking to optimize", you are REALLY insulting to those of us who don't
write CF code. As if we aren't concerned with optimization as well.
Furthermore, it's pretty clear from this thread and some others we've
seen that you have a very specific and narrow idea of what "optimize"
means, and if someone doesn't find your choice of trade-offs useful,
you dismiss them as someone who isn't optimizing their code.

The fact is, here you've suggested a solution that isn't really going
to gain any significant performance, and which has its own costs that
may not be appropriate for the situation (and in fact is likely to be
inappropriate for most situations, _especially_ in CF where memory is
at a much greater premium than on a desktop computer).

Calling the dictionary solution "inefficient" is just plain dumb; it
might not be quite as fast as a straight array lookup, but it's not
like it's a _slow_ design. It's not "inefficient" at all. It's just a
different way of approaching the problem and your implication that
anyone who might use a dictionary implementation is writing inefficient
code is very annoying.

So, if you're wondering why your posts seem to generate what you think
are "flames", you might take a good look at how you write your
"suggestions" and whether they really qualify as "more efficient"
anyway.

Pete
 
H

Hilton

Dude, chill. The guy asked how to write a histogram, I said use an array of
integers.

Hilton


Peter Duniho said:
Peter,


You misunderstood probably because I never explained it well enough.

Yes, it's hard to know what a person is talking about when they don't
explain themselves.
If I understand your suggestion correctly, you are proposing creating an
array that will be scanned searching for the index value, at which point
the count can be updated.

Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array
of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is
faster
than any dictionary implementation.

Yes, that's probably true. Based on your updated description, I'd agree
the dictionary implementation will be slower. However, it will only be
slower by a very tiny amount (assuming the rest of the code is doing
anything interesting, it would be difficult to even measure the
difference...the Dictionary class is still constant order just like your
lookup-array), and your implementation will use a lot more memory than a
dictionary would unless the range of values is completely used or at least
nearly so.

It also won't be flexible if the input data changes for any reason, since
the implementation itself is so dependent on the input data.

Finally, one could use the SortedDictionary class for simplicity. It
would be much slower for the counting part of the algorithm (O(log n)
instead of constant order), but when you're done counting the data would
already be sorted. In the end, I'd be surprised if the total time spent
was significantly different (for sure, the order of the algorithm would be
the same: O(n log n) for all three variations).

"Way faster" isn't a well-defined phrase, but I doubt that any reasonable
interpretation of "way faster" would properly describe the minimal
difference in performance between using a dictionary and your lookup-array
proposal.
In fact, you don't even need the struct
if most of the work is 'reading' the value and the sorting doesn't need
to
be 'high-performance' - then you can define an array of ints and not
structs
and spend a little more time doing the sorting. To decide, we'd need to
understand more about the nature of the data.

Well, personally I would almost never implement a design like the one you
proposed. I certainly would never do it based solely on performance
reasons. If my input data was a reasonably narrow range (say, 16-bit
values or less), _and_ I expected the input values to cover the range with
very few exceptions, then I might choose a solution like the one you
proposed. But mainly because it's simpler, not because it might be a
little faster.

Frankly, when you write stuff like "Being a CF engineer, I'm always
looking to optimize", you are REALLY insulting to those of us who don't
write CF code. As if we aren't concerned with optimization as well.
Furthermore, it's pretty clear from this thread and some others we've seen
that you have a very specific and narrow idea of what "optimize" means,
and if someone doesn't find your choice of trade-offs useful, you dismiss
them as someone who isn't optimizing their code.

The fact is, here you've suggested a solution that isn't really going to
gain any significant performance, and which has its own costs that may not
be appropriate for the situation (and in fact is likely to be
inappropriate for most situations, _especially_ in CF where memory is at a
much greater premium than on a desktop computer).

Calling the dictionary solution "inefficient" is just plain dumb; it might
not be quite as fast as a straight array lookup, but it's not like it's a
_slow_ design. It's not "inefficient" at all. It's just a different way
of approaching the problem and your implication that anyone who might use
a dictionary implementation is writing inefficient code is very annoying.

So, if you're wondering why your posts seem to generate what you think are
"flames", you might take a good look at how you write your "suggestions"
and whether they really qualify as "more efficient" anyway.

Pete
 
J

Jon Skeet [C# MVP]

Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

Jon
 
P

Peter Duniho

Dude, chill. The guy asked how to write a histogram, I said use an array of
integers.

If that's _all_ you'd said, we'd not have a problem.

It's the other crap you insist on writing that causes the issues.

Stick to the facts and you won't get flamed.
 
M

Marc Gravell

Just to turn things around... in .NET 3.5 and C# 3 (VS2008) you can
use LINQ to do the same thing, but focusing more on what you want to
do, rather than how it should happen. See below, but you can read at a
glance that we are grouping by the value itself, ordering by the count
(descending) and then value (ascending), and projecting the value and
the count.

If you do the same with a dictionary, it'll work, but it won't be as
clear what the code is actually trying to achieve. IMO, at least ;-p
The advantage of this is that it also ports very well both to database
usage, and to (for example) Parallel LINQ - without you having to redo
it from scratch.

Marc

// data-stream could be any source
IEnumerable<int> dataStream = new[] {
1, 4, 6, 23, 6, 2, 54, 1, 6, 23, 2, 4, 1, 6 };

// prepare a query
var query = from value in dataStream
group value by value into grouped
orderby grouped.Count() descending,
grouped.Key
select new {Value = grouped.Key, Count =
grouped.Count() };

// execute and iterate
foreach (var row in query) {
Console.WriteLine("{0}: {1}", row.Value, row.Count);
}
 
M

Marc Gravell

As a semi-retraction... for reference, looking deeper into the code it
appears that Enumerable.GroupBy() uses a Lookup<TKey,TElement>
internally, which adds the elements to each grouping; as such, it
probably isn't suited to cases where you want to forget about the
contents and just keep the aggregates...

But never the less, worth knowing. Now to see if I can write a more
efficient LINQ aggregator ;-p

Marc
 
C

Chris Dunaway

Err, isn't that VB? Correct me if I'm wrong, but the "Dim" was the
biggest clue, and the fact that the comments start with ', among other
things.

Just thought it was weird to see it in the csharp newsgroup :).

Regards,
Wingot

-----Original Message-----
From: Chris Dunaway [mailto:[email protected]]

Posted At: Tuesday, 20 November 2007 1:14 AM
Posted To: microsoft.public.dotnet.languages.csharp
Conversation: what is the best datatype for..
Subject: Re: what is the best datatype for..

*snip*

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next

Hope this helps

Chris

Oops! I had been looking at the VB group and obviously forgot where I
was! I'm too young to have a "senior moment"!

Chris
 
H

Hilton

OK.... I'm not looking to get into a pissing match here, so this will be my
last post to this thread, but for the record, I'll correct just one of the
many incorrect things in your post.

You said: "Frankly, when you write stuff like "Being a CF engineer, I'm
always looking to optimize", you are REALLY insulting to those of us who
don't
write CF code. As if we aren't concerned with optimization as well."

Peter, this logic is so flawed that is it laughable. If a red car has
wheels, then a blue car doesn't? Secondly, it is pretty obvious that
running code on a small device with a computationally-limited 200MHz CPU
with a tiny bus to slow memory, etc requires some additional optimizations
that a dual-core desktop may not.

Well, there you go.

Hilton
 
P

Peter Duniho

OK.... I'm not looking to get into a pissing match here, so this will be my
last post to this thread, but for the record, I'll correct just one of the
many incorrect things in your post.

Sorry? You wrote you were going to correct something. But you just
wrote another untrue thing.
You said: "Frankly, when you write stuff like "Being a CF engineer, I'm
always looking to optimize", you are REALLY insulting to those of us who
don't
write CF code. As if we aren't concerned with optimization as well."

Peter, this logic is so flawed that is it laughable.

I'm not laughing.
If a red car has
wheels, then a blue car doesn't?

I have never seen anyone write "being a red car, it has wheels" and if
I did, I would call out that statement as being just as wrong as what
you wrote. Clearly the phrasing you used makes the exact implication I
describe.

Specifically, the word "being" along with an adjective ("CF" in your
original statement, "red" in this latest example) implies some
significance to the adjective, not just the noun.

If you don't want to accept or admit that implication, fine. You'll
probably keep making the same mistake. But stop being so surprised
when the things you write set someone off.
Secondly, it is pretty obvious that
running code on a small device with a computationally-limited 200MHz CPU
with a tiny bus to slow memory, etc requires some additional optimizations
that a dual-core desktop may not.

"Requires" does not mean "is the only platform in which are used".

Just because you write code on the CF platform, that in no way
necessarily gives you a leg up on people who don't with respect to
optimal techniques. If anything, the performance of your platform is
more likely to affect the quantity of data you are able to deal with,
rather than how efficient an algorithm you need. Optimal techniques
are needed across ALL platforms.

Frankly, your contribution to this thread is a very good example of the
fallaciousness in believing that working on CF somehow gives you an
advantage, since your claims about the relative performance of the
techniques offered are just plain wrong.

Again, there was nothing wrong with the basic technique you suggested.
But you couldn't resist comparing yourself and your technique to the
other people, and _that_ is the problem. It would be annoying if there
was even some basis to the comparison, but the fact that the comparison
itself is completely flawed just aggravates the situation.
Well, there you go.

Yes, there I go.

Pete
 
H

Hilton

Jon,
Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array
of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is
faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

I agree 100% - the nature of the data is significant. BTW: I wouldn't have
to sort 100,000, the algorithm could, for example remove the unused counts
by creating another array and copying over only the used counts and convert
from an int to a struct which would include the index.

(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

The "int count, index" struct would have double the data to swap so you are
correct (instead of just a reference) - assuming that we use ints (again,
depends on the nature of the data - we might require bytes or longs).
However, since arrays of structs are contiguous, cache misses would/might be
reduced and that might end up being significant (or not). On the flip side,
creating (and disposing) the objectes incurs a penalty. There are several
'exercises for the reader' here; e.g. how much more expensive are objects
than structs during creation and swapping, how much penalty is incurred
during boxing and unboxing (if required), faster to swap structs or
references etc. Another thing, seems like accessing a x[y].z should be
faster when using structs than objects (I think).

Personally I'd go for the simplest code first and only optimise when
it's shown to be necessary - but at that point I'd definitely be
benchmarking instead of guessing. I would *try* using a Dictionary to
collect the values, then converting it to a list and sorting the list.
That way you wouldn't pay a penalty for unused values, although
there's a higher constant factor due to more copying being involved.

I think it really all depends on the nature of the data and the nature of
the app. With large amounts of data (e.g. large bitmap or network streams)
I would definitely favor the array of ints, but the techniques you suggest
definitely should be part of the same toolkit to solve the problem.

Jon, do have any URLs with information on the internals of "Dictionary<int,
int>"? I'm particularly interested in boxing, unboxing, casting, hashing,
etc. Performance info would be great too.

Thanks,

Hilton
 
J

Jon Skeet [C# MVP]

Hilton said:
If you know you'll have say 1000 possible indices, then define an array
of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is
faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000 - with the dictionary approach,
you'd only have a single entry to sort, whereas with your approach
you'd have to sort 100,000 entries. Which of those do you think would
be quicker?

I agree 100% - the nature of the data is significant. BTW: I wouldn't have
to sort 100,000, the algorithm could, for example remove the unused counts
by creating another array and copying over only the used counts and convert
from an int to a struct which would include the index.

Removing unused counts is still O(R) where R is the theoretical range -
that may be *much* larger than the set of actual numbers used.
(Note that using a struct may well be slower than using a class, by
the way - when sorting, each swap would be swapping more data. I
wouldn't like to say without testing though.)

The "int count, index" struct would have double the data to swap so you are
correct (instead of just a reference) - assuming that we use ints (again,
depends on the nature of the data - we might require bytes or longs).
However, since arrays of structs are contiguous, cache misses would/might be
reduced and that might end up being significant (or not). On the flip side,
creating (and disposing) the objectes incurs a penalty. There are several
'exercises for the reader' here; e.g. how much more expensive are objects
than structs during creation and swapping, how much penalty is incurred
during boxing and unboxing (if required), faster to swap structs or
references etc. Another thing, seems like accessing a x[y].z should be
faster when using structs than objects (I think).

Possibly. My bigger point was that it's not a good idea to claim
performance benefits without testing, which is what you were doing.
Optimization is a very subtle art - and a time consuming exercise,
which should only be attempted where really needed.
I think it really all depends on the nature of the data and the nature of
the app. With large amounts of data (e.g. large bitmap or network streams)
I would definitely favor the array of ints, but the techniques you suggest
definitely should be part of the same toolkit to solve the problem.

Why would you "definitely favour the array of ints"? You've already
suggested that with a range of size 100,000 you wouldn't use an array -
that could be the case with large streams.

In particular, would you "definitely favour the array of ints" without
benchmarking? If so, that's foolish.
Jon, do have any URLs with information on the internals of "Dictionary<int,
int>"? I'm particularly interested in boxing, unboxing, casting, hashing,
etc. Performance info would be great too.

There'd be no boxing involved, due to the way that generics are handled
by the JIT. General performance info is given in MSDN - lookup on a
Dictionary<K,V> approaches an O(1) operation; adding is O(1) unless
extra capacity is required, in which case it becomes O(n).
 
S

Scott Gifford

Jon Skeet said:
Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation.

That depends on how many samples there are. Suppose there's only one
sample, but a range of size 100,000

Just for the OP's reference (and some Google keywords), this is a
standard tradeoff, and it depends on whether the data is "dense" or
"sparse". For dense data the array approach is generally more
efficient; for sparse data the dictionary is likely to be best. But
of course, performance specifics will depend on a lot of factors.

Since both ways are fairly simple, if performance matters it would be
straightforward to write it both ways and benchmark, as Jon suggests.
It would also give some useful data for the next time you have to make
this tradeoff.

Good luck!

----Scott.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top