PC Review


Reply
Thread Tools Rate Thread

what is the best datatype for..

 
 
calvert4rent@gmail.com
Guest
Posts: n/a
 
      19th Nov 2007
I need to some sort of data type that will hold a listing of ID's and
their counts/frequency. As I do some processing I get an ID back
which I need to store and keep an accurate count for how many times
that ID is listed. For example I might get a listing of these
numbers:
1,4,6,23,6,2,54,1,6,23,2,4,1,6

I will get those numbers one at a time and need to build something
like
1,3
4,2
6,4
23,2
2,2
43,1

In above number X,Y:
X- is the number
Y- is the count of the number

I then need to sort the data on the Y (the count) column:
6,4
1,3
4,2
23,2
2,2
43,1

Can anyone suggest how best to do this. Any suggestions or help is
much appreciated!

Thanks,

Calvert
 
Reply With Quote
 
 
 
 
Nicholas Paldino [.NET/C# MVP]
Guest
Posts: n/a
 
      19th Nov 2007
I would store this in a Dictionary<int, int> where the key is the id and
the value is the count.

Then, when you need to sort the keys, I would enumerate through the
KeyValuePair instances and sort those (Dictionary<TKey, TValue> implements
IEnumerable<KeyValuePair<TKey, TValue>>). You can simply put them into an
array and then call the static Sort method on the Array class, passing in an
anonymous method for the Comparison<T> delegate (which in this case, would
be Comparison<KeyValuePair<int, int>>) which would compare the values based
on the count.

You could also get away with using a DataTable with an id column and a
count column, and then just placing a data view on the data table which is
sorted by the count.


--
- Nicholas Paldino [.NET/C# MVP]
- (E-Mail Removed)

<(E-Mail Removed)> wrote in message
news:89f5d237-ed03-46ce-862d-(E-Mail Removed)...
>I need to some sort of data type that will hold a listing of ID's and
> their counts/frequency. As I do some processing I get an ID back
> which I need to store and keep an accurate count for how many times
> that ID is listed. For example I might get a listing of these
> numbers:
> 1,4,6,23,6,2,54,1,6,23,2,4,1,6
>
> I will get those numbers one at a time and need to build something
> like
> 1,3
> 4,2
> 6,4
> 23,2
> 2,2
> 43,1
>
> In above number X,Y:
> X- is the number
> Y- is the count of the number
>
> I then need to sort the data on the Y (the count) column:
> 6,4
> 1,3
> 4,2
> 23,2
> 2,2
> 43,1
>
> Can anyone suggest how best to do this. Any suggestions or help is
> much appreciated!
>
> Thanks,
>
> Calvert



 
Reply With Quote
 
Chris Dunaway
Guest
Posts: n/a
 
      19th Nov 2007
On Nov 19, 9:53 am, calvert4r...@gmail.com wrote:
> I need to some sort of data type that will hold a listing of ID's and
> their counts/frequency. As I do some processing I get an ID back
> which I need to store and keep an accurate count for how many times
> that ID is listed. For example I might get a listing of these
> numbers:
> 1,4,6,23,6,2,54,1,6,23,2,4,1,6
>
> I will get those numbers one at a time and need to build something
> like
> 1,3
> 4,2
> 6,4
> 23,2
> 2,2
> 43,1
>
> In above number X,Y:
> X- is the number
> Y- is the count of the number
>
> I then need to sort the data on the Y (the count) column:
> 6,4
> 1,3
> 4,2
> 23,2
> 2,2
> 43,1
>
> Can anyone suggest how best to do this. Any suggestions or help is
> much appreciated!
>
> Thanks,
>
> Calvert


As for data types, a general rule of thumb would be that if you plan
to do some calculations with it (like addition, subtraction, etc.)
then it should be some sort of numeric type (int, double, decimal).
If you are only going to store it, then use a string. For example,
even though a phone number is all digits, you should use a string
since you don't do calculations with a phone number.

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next


Hope this helps

Chris
 
Reply With Quote
 
calvert4rent@gmail.com
Guest
Posts: n/a
 
      19th Nov 2007
Thank you both very much. That was exactly the information I was
looking for. I am going to give it a shot now. Thanks again.
 
Reply With Quote
 
Hilton
Guest
Posts: n/a
 
      19th Nov 2007
Hi,

Being a CF engineer, I'm always looking to optimize, so even though hashes,
dictionaries etc will work, the whole process is inefficient. Do you have a
range of values? Let's say that you know that the values will be 0-1000,
define a struct with two values; one being the value, the other being the
count. Create an array of the structs and off you go. Alternatively you
could use two separate arrays.

Again, this is for performance reasons and I'll bet it'll be way faster than
dictionaries etc. Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques.

Hilton


<(E-Mail Removed)> wrote in message
news:89f5d237-ed03-46ce-862d-(E-Mail Removed)...
>I need to some sort of data type that will hold a listing of ID's and
> their counts/frequency. As I do some processing I get an ID back
> which I need to store and keep an accurate count for how many times
> that ID is listed. For example I might get a listing of these
> numbers:
> 1,4,6,23,6,2,54,1,6,23,2,4,1,6
>
> I will get those numbers one at a time and need to build something
> like
> 1,3
> 4,2
> 6,4
> 23,2
> 2,2
> 43,1
>
> In above number X,Y:
> X- is the number
> Y- is the count of the number
>
> I then need to sort the data on the Y (the count) column:
> 6,4
> 1,3
> 4,2
> 23,2
> 2,2
> 43,1
>
> Can anyone suggest how best to do this. Any suggestions or help is
> much appreciated!
>
> Thanks,
>
> Calvert



 
Reply With Quote
 
Peter Duniho
Guest
Posts: n/a
 
      19th Nov 2007
On 2007-11-19 12:38:47 -0800, "Hilton" <(E-Mail Removed)> said:

> Hi,
>
> Being a CF engineer, I'm always looking to optimize, so even though hashes,
> dictionaries etc will work, the whole process is inefficient.


Define "inefficient".

A dictionary implementation will be faster than your suggested implementation.

> Do you have a
> range of values? Let's say that you know that the values will be 0-1000,
> define a struct with two values; one being the value, the other being the
> count. Create an array of the structs and off you go. Alternatively you
> could use two separate arrays.


If I understand your suggestion correctly, you are proposing creating
an array that will be scanned searching for the index value, at which
point the count can be updated.

For small data sets, I think that would work fine. Nothing wrong at
all. But claiming that it's more "efficient" begs the question: in
what way?

It's more efficient in memory usage, but it's certainly not more
efficient with respect to performance. Furthermore, for large data
sets where the difference in memory footprint is likely to be of
concern, the performance difference will be phenomenal. The dictionary
implementation will always provide a basically constant order retrieval
of the element in question, while your linear search will be linear
order, taking orders of magnitude longer as the data set is also orders
of magnitude larger.

In other words, the performance for just 100 elements is ten times
slower than the performance for just 10 elements, and for 1000 elements
is another ten times slower than for 100 elements.

> Again, this is for performance reasons and I'll bet it'll be way faster than
> dictionaries etc.


You'd bet wrong. Your suggestion is more memory efficient, but it's
not going to be faster.

> Of course, only do this extra work if you need the
> performance and you're not just running a one-off process.


No, only do this extra work if you absolutely must minimize the memory
footprint of the implementation. It's not going to perform better at
all.

> I get this feeling of pending flames - it always seems to happen when I
> suggest more efficient techniques.


Well, a) you seem to misunderstand how the dictionary solution would
work, and b) you do seem to insist on mischaracterizing techniques as
"efficient" when they either are not at all, or are not efficient in a
way that's is interesting in the more general case (this is not a
newsgroup for writing using Compact Framework).

It is true that there are often trade-offs between being memory
efficient and computation efficient. It's false for you to assert that
you know best which is more important. Even on a CF-based
implementation, sometimes you'll still prefer a faster implementation
over one that uses less memory. It just depends on the priorities of
the situation.

Pete

 
Reply With Quote
 
Hilton
Guest
Posts: n/a
 
      20th Nov 2007
Peter,

> A dictionary implementation will be faster than your suggested
> implementation.


You misunderstood probably because I never explained it well enough.

> If I understand your suggestion correctly, you are proposing creating an
> array that will be scanned searching for the index value, at which point
> the count can be updated.


Nope, no scanning whatsoever.

If you know you'll have say 1000 possible indices, then define an array of
1000 structs, then when you see a value of 605, you simply do
"array[605].count++" (i.e. your basic histogram code). I bet that is faster
than any dictionary implementation. In fact, you don't even need the struct
if most of the work is 'reading' the value and the sorting doesn't need to
be 'high-performance' - then you can define an array of ints and not structs
and spend a little more time doing the sorting. To decide, we'd need to
understand more about the nature of the data.

Hilton


 
Reply With Quote
 
Wingot
Guest
Posts: n/a
 
      20th Nov 2007
Err, isn't that VB? Correct me if I'm wrong, but the "Dim" was the
biggest clue, and the fact that the comments start with ', among other
things.

Just thought it was weird to see it in the csharp newsgroup .

Regards,
Wingot

-----Original Message-----
From: Chris Dunaway [private.php?do=newpm&u=]
Posted At: Tuesday, 20 November 2007 1:14 AM
Posted To: microsoft.public.dotnet.languages.csharp
Conversation: what is the best datatype for..
Subject: Re: what is the best datatype for..

*snip*

For you problem, I think your best bet would be to use a generic
dictionary. Use the ID as the key and the value as the count. Read
each value in the array, check to see if it already exists in the
dictionary. If it does, then just increment the count, otherwise, add
it to the dictionary.

Dim values() As Integer = {1,4,6,23,6,2,54,1,6,23,2,4,1,6}
Dim idCount As New Dictionary(Of String, Integer)()

For Each i As Integer In values 'Check
each value in the array
If idCount.ContainsKey(i.ToString()) Then 'If the value
already exists in the dictionary
idCount(i.ToString) += 1
'increment the count for that value

Else
'otherwise
idCount.Add(i.ToString(), 1) 'add
the value to the dictionary
End If
Next


Hope this helps

Chris

 
Reply With Quote
 
Peter Duniho
Guest
Posts: n/a
 
      20th Nov 2007
On 2007-11-19 16:56:41 -0800, "Hilton" <(E-Mail Removed)> said:

> Peter,
>
>> A dictionary implementation will be faster than your suggested
>> implementation.

>
> You misunderstood probably because I never explained it well enough.


Yes, it's hard to know what a person is talking about when they don't
explain themselves.

>> If I understand your suggestion correctly, you are proposing creating an
>> array that will be scanned searching for the index value, at which point
>> the count can be updated.

>
> Nope, no scanning whatsoever.
>
> If you know you'll have say 1000 possible indices, then define an array of
> 1000 structs, then when you see a value of 605, you simply do
> "array[605].count++" (i.e. your basic histogram code). I bet that is faster
> than any dictionary implementation.


Yes, that's probably true. Based on your updated description, I'd
agree the dictionary implementation will be slower. However, it will
only be slower by a very tiny amount (assuming the rest of the code is
doing anything interesting, it would be difficult to even measure the
difference...the Dictionary class is still constant order just like
your lookup-array), and your implementation will use a lot more memory
than a dictionary would unless the range of values is completely used
or at least nearly so.

It also won't be flexible if the input data changes for any reason,
since the implementation itself is so dependent on the input data.

Finally, one could use the SortedDictionary class for simplicity. It
would be much slower for the counting part of the algorithm (O(log n)
instead of constant order), but when you're done counting the data
would already be sorted. In the end, I'd be surprised if the total
time spent was significantly different (for sure, the order of the
algorithm would be the same: O(n log n) for all three variations).

"Way faster" isn't a well-defined phrase, but I doubt that any
reasonable interpretation of "way faster" would properly describe the
minimal difference in performance between using a dictionary and your
lookup-array proposal.

> In fact, you don't even need the struct
> if most of the work is 'reading' the value and the sorting doesn't need to
> be 'high-performance' - then you can define an array of ints and not structs
> and spend a little more time doing the sorting. To decide, we'd need to
> understand more about the nature of the data.


Well, personally I would almost never implement a design like the one
you proposed. I certainly would never do it based solely on
performance reasons. If my input data was a reasonably narrow range
(say, 16-bit values or less), _and_ I expected the input values to
cover the range with very few exceptions, then I might choose a
solution like the one you proposed. But mainly because it's simpler,
not because it might be a little faster.

Frankly, when you write stuff like "Being a CF engineer, I'm always
looking to optimize", you are REALLY insulting to those of us who don't
write CF code. As if we aren't concerned with optimization as well.
Furthermore, it's pretty clear from this thread and some others we've
seen that you have a very specific and narrow idea of what "optimize"
means, and if someone doesn't find your choice of trade-offs useful,
you dismiss them as someone who isn't optimizing their code.

The fact is, here you've suggested a solution that isn't really going
to gain any significant performance, and which has its own costs that
may not be appropriate for the situation (and in fact is likely to be
inappropriate for most situations, _especially_ in CF where memory is
at a much greater premium than on a desktop computer).

Calling the dictionary solution "inefficient" is just plain dumb; it
might not be quite as fast as a straight array lookup, but it's not
like it's a _slow_ design. It's not "inefficient" at all. It's just a
different way of approaching the problem and your implication that
anyone who might use a dictionary implementation is writing inefficient
code is very annoying.

So, if you're wondering why your posts seem to generate what you think
are "flames", you might take a good look at how you write your
"suggestions" and whether they really qualify as "more efficient"
anyway.

Pete

 
Reply With Quote
 
Hilton
Guest
Posts: n/a
 
      20th Nov 2007
Dude, chill. The guy asked how to write a histogram, I said use an array of
integers.

Hilton


"Peter Duniho" <(E-Mail Removed)> wrote in message
news:2007111919142216807-NpOeStPeAdM@NnOwSlPiAnMkcom...
> On 2007-11-19 16:56:41 -0800, "Hilton" <(E-Mail Removed)> said:
>
>> Peter,
>>
>>> A dictionary implementation will be faster than your suggested
>>> implementation.

>>
>> You misunderstood probably because I never explained it well enough.

>
> Yes, it's hard to know what a person is talking about when they don't
> explain themselves.
>
>>> If I understand your suggestion correctly, you are proposing creating an
>>> array that will be scanned searching for the index value, at which point
>>> the count can be updated.

>>
>> Nope, no scanning whatsoever.
>>
>> If you know you'll have say 1000 possible indices, then define an array
>> of
>> 1000 structs, then when you see a value of 605, you simply do
>> "array[605].count++" (i.e. your basic histogram code). I bet that is
>> faster
>> than any dictionary implementation.

>
> Yes, that's probably true. Based on your updated description, I'd agree
> the dictionary implementation will be slower. However, it will only be
> slower by a very tiny amount (assuming the rest of the code is doing
> anything interesting, it would be difficult to even measure the
> difference...the Dictionary class is still constant order just like your
> lookup-array), and your implementation will use a lot more memory than a
> dictionary would unless the range of values is completely used or at least
> nearly so.
>
> It also won't be flexible if the input data changes for any reason, since
> the implementation itself is so dependent on the input data.
>
> Finally, one could use the SortedDictionary class for simplicity. It
> would be much slower for the counting part of the algorithm (O(log n)
> instead of constant order), but when you're done counting the data would
> already be sorted. In the end, I'd be surprised if the total time spent
> was significantly different (for sure, the order of the algorithm would be
> the same: O(n log n) for all three variations).
>
> "Way faster" isn't a well-defined phrase, but I doubt that any reasonable
> interpretation of "way faster" would properly describe the minimal
> difference in performance between using a dictionary and your lookup-array
> proposal.
>
>> In fact, you don't even need the struct
>> if most of the work is 'reading' the value and the sorting doesn't need
>> to
>> be 'high-performance' - then you can define an array of ints and not
>> structs
>> and spend a little more time doing the sorting. To decide, we'd need to
>> understand more about the nature of the data.

>
> Well, personally I would almost never implement a design like the one you
> proposed. I certainly would never do it based solely on performance
> reasons. If my input data was a reasonably narrow range (say, 16-bit
> values or less), _and_ I expected the input values to cover the range with
> very few exceptions, then I might choose a solution like the one you
> proposed. But mainly because it's simpler, not because it might be a
> little faster.
>
> Frankly, when you write stuff like "Being a CF engineer, I'm always
> looking to optimize", you are REALLY insulting to those of us who don't
> write CF code. As if we aren't concerned with optimization as well.
> Furthermore, it's pretty clear from this thread and some others we've seen
> that you have a very specific and narrow idea of what "optimize" means,
> and if someone doesn't find your choice of trade-offs useful, you dismiss
> them as someone who isn't optimizing their code.
>
> The fact is, here you've suggested a solution that isn't really going to
> gain any significant performance, and which has its own costs that may not
> be appropriate for the situation (and in fact is likely to be
> inappropriate for most situations, _especially_ in CF where memory is at a
> much greater premium than on a desktop computer).
>
> Calling the dictionary solution "inefficient" is just plain dumb; it might
> not be quite as fast as a straight array lookup, but it's not like it's a
> _slow_ design. It's not "inefficient" at all. It's just a different way
> of approaching the problem and your implication that anyone who might use
> a dictionary implementation is writing inefficient code is very annoying.
>
> So, if you're wondering why your posts seem to generate what you think are
> "flames", you might take a good look at how you write your "suggestions"
> and whether they really qualify as "more efficient" anyway.
>
> Pete
>



 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
database datatype bit what is the C# datatype? Jeff Microsoft ASP .NET 2 16th Apr 2007 09:26 PM
conversion date datatype to long datatype raju Microsoft Dot NET Compact Framework 3 9th Jun 2006 12:28 PM
Database datatype to c# datatype Peter Kirk Microsoft C# .NET 3 1st Jun 2006 09:12 AM
Boolean datatype column refuses to Copy to Bit datatype SQL Table =?Utf-8?B?RmlkZGVsbTM3NDI=?= Microsoft ADO .NET 1 30th May 2006 07:28 PM
The Active Directory datatype cannot be converted to/from a native DS datatype Hechmi Microsoft Dot NET 1 1st Mar 2004 07:15 AM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 06:11 AM.