what is the best datatype for..

  • Thread starter Thread starter calvert4rent
  • Start date Start date
Jon said:
Why would you "definitely favour the array of ints"? You've already
suggested that with a range of size 100,000 you wouldn't use an array -
that could be the case with large streams.

In particular, would you "definitely favour the array of ints" without
benchmarking? If so, that's foolish.

You're combining two of my comments into one. When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc. So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.
Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application. You think they use an array or a dictionary? If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
"definitely favour the array of ints?" - absolutely. Would using a
dictionary be faster than "x[y]++" where x is an array of ints? And which
would be clearer to read? Jon, Peter, take that as my throwing down a
challenge. :)

Hilton
 
Hilton said:
You're combining two of my comments into one.

Um, you said in a single post:

"With large amounts of data (e.g. large bitmap or network streams) I
would definitely favour the array of ints".

How is that combining two comments?
When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc. So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.

You never suggested that "huge network stream" implies "we're only
examining a single byte at a time". Why can't a network stream be a
stream of longs, for instance?
Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application. You think they use an array or a dictionary? If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
"definitely favour the array of ints?" - absolutely. Would using a
dictionary be faster than "x[y]++" where x is an array of ints? And which
would be clearer to read? Jon, Peter, take that as my throwing down a
challenge. :)

In the case of only 256 entries, I'd probably go with the array as
well. That's not always the case with large network streams, nor with
images (which may well *not* be 8-bit indexed).
 
You're combining two of my comments into one. When I refer to 'range size'
I mean are all the values 0-255 or 0-65535 etc. So a huge network stream
will only have a range size of 0-255 if you look at the individual bytes.
Same goes with calculating a histogram of an 8-bit indexed image; e.g. GIF
or PNG8 which is a very common operation - look at (almost) any image
application. You think they use an array or a dictionary? If I have a huge
input stream and was only looking at bytes (i.e. range of 0-255), would I
"definitely favour the array of ints?" - absolutely. Would using a
dictionary be faster than "x[y]++" where x is an array of ints? And which
would be clearer to read? Jon, Peter, take that as my throwing down a
challenge. :)

Challenge? Looks more like a straw man to me.

Jon's point appears to me to be that you wrote "with large amounts of
data I would definitely favor the array of ints", but there's really
nothing about the amount of data that predisposes it to being better
solved by an array of ints when that amount is large.

He's not combining two comments. The single statement is by itself
flawed. Even if you have a large amount of data, if the range is large
and the population within that range is sparse, an array of ints is not
likely to be a good solution.

I have already agreed that in many situations, I would prefer an array
of ints for its simplicity. Raising that as an arguing point doesn't
make any sense; we're already in agreement there.

The point of contention, which you seem to keep overlooking, is that
your original claim was that an array of ints is efficient while a
dictionary is NOT efficient. The claim is wrong twice: first, because
the dictionary isn't inefficient at all even in cases where it's slower
than the array of ints, and second because there are situations in
which a dictionary solution is actually more efficient than the array
of ints.

As far as readability goes, which is more readable depends on context.
While I think that the array of ints is likely to be the simpler, more
clear implementation in most cases, that doesn't mean it would be in
all cases. For example, using a SortedDictionary produces very simple
code IMHO because you never even have to have a separate section of
code to sort the data. Even a Dictionary may be found to be simpler
because of the ability to do compile-time type checking, allowing for
it to be used in a simpler way elsewhere.

Context is everything, and there's not enough context to claim a priori
that the array of ints is in fact the clearer implementation.

But besides, your contention has been that the array solution is
superior due to _performance_, not readability. And on that basis,
your claim has no firm foundation.

Pete
 
Jon,

Yet another thread where being in the same room would resolve it in 2
minutes but NG posts result in ambiguity etc. I agree I wasn't specific
enough with large network stream, that's why I clarified it. You and I are
agreeing here that small range, lots of data would imply array of int usage;
sparse and/or big-range data would imply Dictionary or other non-array
solution. And as we both keep saying/implying, the nature of the data
ultimately determines the course of action.

Cheers,

Hilton
 
Hilton said:
Again, this is for performance reasons and I'll bet it'll be way faster
than dictionaries etc. Of course, only do this extra work if you need the
performance and you're not just running a one-off process.

Well, as both Jon and Peter have pointed out your array of ints has some
potential problems. It's not a clear-cut "This is better across the board"
type of decision. It's likley better in many cases, but not all. Sadly, that
always seems to be the case...

Here's one more piece of fuel for the fire:
If you allocate a large array of ints, they're going to be allocated off the
Large Object Heap. This may have some long-term implications for your
process, as the LOH is never compacted. In the limited memory environment of
a mobile platform, this impact may be much larger than it would be on a
desktop system. For long term stability, something more link-list based may
be better, as it won't have the appearence of leaking memory...
I get this feeling of pending flames - it always seems to happen when I
suggest more efficient techniques. :)

Ya know, the most efficient technique would be scan the values, and see if
you can further compact them. You should run sometime like the PunyCode
algorithm over them, so as to guarantee that all of your values are in a
default range. Then you can fit, on average, 2 values into the space that
was taken by 1. You could then (in memory) ZIP the punycoded array, and
compact it by a factor or 5:1. This would give you an overall memory saving
of nearly 10:1.

For further savings, you could dump the values to the screen, and ask the
end-user to remember them for a brief amount of time. Then run a sliding
visual window through, and have the user enter the values. This would let
you completly eliminate the array from memory, thereby signifigantly
decreasing your hardware specs and opening up your target market.

For long term storage, you should use the built-in IR channel to dump the
array to a printer in a 3D barcode format. When you need to recover the
data, use the built-in camera on the phone as a scanner, and volia, you have
reduced footprint even more.

If you're really brave, see if you can setup a harmonic based on the 802.11
receivers built into many of the Windows Mobile devices. With proper
encoding, you could offload everything to the wireless....

(It's a slow day. Honestly though, there is no cut-and-dried answer.... :) )
 
-----Original Message-----
*snip*
For further savings, you could dump the values to the screen, and ask
the
end-user to remember them for a brief amount of time. Then run a
sliding
visual window through, and have the user enter the values. This would
let
you completly eliminate the array from memory, thereby signifigantly
decreasing your hardware specs and opening up your target market.
(It's a slow day. Honestly though, there is no cut-and-dried answer....
:) )

Nice one Chris, I love that idea. :)

You could even pretend it's a game for the user to test his memory :P.
 
Jon,

Yet another thread where being in the same room would resolve it in 2
minutes but NG posts result in ambiguity etc.

I disagree that that's the root of the disagreement here. Hopefully
you can handle the recursion. :)
I agree I wasn't specific
enough with large network stream, that's why I clarified it.

No, you accused Jon of "combining two of [your] comments into one".
That's not clarification, that's finger-pointing.
You and I are
agreeing here that small range, lots of data would imply array of int usage;

I'm not sure that Jon would agree that "lots of data" has anything to
do with it. I know I don't. The amount of data is essentially
irrelevant; the distribution of the data is what matters, along with
code maintainability (something that is almost always in direct
conflict with squeezing the last few percentage points of performance
out of some code).
sparse and/or big-range data would imply Dictionary or other non-array
solution. And as we both keep saying/implying, the nature of the data
ultimately determines the course of action.

No disagreement there. But you continue to refuse to acknowledge and
retract your false statement at the outset, claiming that using a
dictionary implementation would be inefficient. You've simply shifted
your stance to a point where you can claim a lack of disagreement,
silently abandoning your original premise.

There are some scenarios in which an array implementation would be
slightly more efficient (not "way faster" as you wrote), but that's not
a given and in no case would a dictionary implementation actually be
considered _inefficient_.

Pete
 
But never the less, worth knowing. Now to see if I can write a more
efficient LINQ aggregator ;-p

Aside, but I did manage to cobble something together that aggregates
efficiently while streaming, allowing usage as below... uses
Dictinary<TKey, ...> approach, and will have a little overhead
compared to doing it manually, but it is a little more expressive (but
works for LINQ-to-objects only; it won't parse down to the database!)

If anybody wants the source let me know (extensions are to
IEnumerable<T>).

Marc

var query = data.GroupAggregate(
// grouping key (with optional comparer)
x => x.Type,
// "params" of desired aggregates
data.GroupMin(x => x.SomeValue),
data.GroupSum(x => x.SomeValue),
data.GroupAvg(x => x.SomeValue),
data.GroupCount(),
data.GroupMax(x => (decimal) x.SomeOtherValue)
)
// could further compose query via LINQ, but this will
do...
.OrderByDescending(x => x.Key).ThenBy(x => x.Value[0]);

// look at results; key is in x.Key; x.Value is a decimal[] of
the accumulators
foreach (var x in query) {
Console.WriteLine("{0}: {1}\t{2}\t{3}\t{4}",
x.Key,
x.Value[0], x.Value[1], x.Value[2], x.Value[3]);
}
 
Chris said:
Well, as both Jon and Peter have pointed out your array of ints has some
potential problems. It's not a clear-cut "This is better across the board"
type of decision. It's likley better in many cases, but not all. Sadly,
that always seems to be the case...

Chris, I totally agree (as mentioned in previous posts). If you look at the
OP, the numbers mentioned were: 1,4,6,23,6,2,54,1,6,23,2,4,1,6, hence my
suggestion for using the array.

I think that if there was always one best solution, our skill/art/work
wouldn't be as challenging and rewarding. :)

Hilton
 
Marc Gravell said:
Aside, but I did manage to cobble something together that aggregates
efficiently while streaming, allowing usage as below... uses
Dictinary<TKey, ...> approach, and will have a little overhead
compared to doing it manually, but it is a little more expressive (but
works for LINQ-to-objects only; it won't parse down to the database!)

If anybody wants the source let me know (extensions are to
IEnumerable<T>).

Just to say I've just implemented this myself, or rather a variation,
due to something else I'm writing at the moment when I want to group
and count. Here's what the unit tests look like, so you can see the
kind of thing it does. Like Marc, let me know if you'd like it...
(It'll be part of MiscUtil when I get round to it.)

Note that my code only does group counting - no summing etc, at the
moment. I'm not sure whether either CountGroup or Marc's GroupCount
names really capture the functionality nicely. I did call mine
"CountBy" for a while... not sure whether that's better or not...

[Test]
public void CountGroupIdentity()
{
var query = new[] { "One", "Two", "Two", "Three" }.CountGroup();

query = from element in query
orderby element.Count, element.Key
select element;

AssertAreEqual(new[] { 1, 1, 2 },
query.Select(elt => elt.Count));
AssertAreEqual(new[] { "One", "Three", "Two" },
query.Select(elt => elt.Key));
}

[Test]
public void CountGroupWithMapping()
{
var query = new[] { "a", "b", "c", "aa", "bb",
"111", "222", "333", "444" }
.CountGroup(x => x.Length);
query = from element in query
orderby element.Count
select element;

AssertAreEqual(new[] { 2, 3, 4 },
query.Select(elt => elt.Count));
AssertAreEqual(new[] { 2, 1, 3 },
query.Select(elt => elt.Key));
}

[Test]
public void CountGroupWithMappingAndComparison()
{
var query = new[] { "a", "A", "a", "B", "b" }.CountGroup
(x => x, StringComparer.OrdinalIgnoreCase);
query = from element in query
orderby element.Count
select element;

AssertAreEqual(new[] { 2, 3 },
query.Select(elt => elt.Count));
AssertAreEqual(new[] { "B", "a" },
query.Select(elt => elt.Key));
}
 
(I'm assuming that this is a single-pass counter that discards the
elements and keeps the aggregates)

The group scenario is such a common case that an *efficient*
aggregator probably is worthy of a place in MiscUtil. The more general
case (multiple concurrent aggregates) gets very scrappy very quickly,
but a single-purpose "does one thing well" is a good thing. And IMO
perhaps a very good educational aid for people wanting to understand
LINQ-to-objects. As for the names - it stumped me too ;-p I didn't
stress over it...

It almost seems a shame that this common case didn't make it into the
base-libs so that the same code could port to the database too... but
I guess it helps to keep things simple.

Now... to get some sleep...

Marc
 
Marc Gravell said:
(I'm assuming that this is a single-pass counter that discards the
elements and keeps the aggregates)
Yup.

The group scenario is such a common case that an *efficient*
aggregator probably is worthy of a place in MiscUtil. The more general
case (multiple concurrent aggregates) gets very scrappy very quickly,
but a single-purpose "does one thing well" is a good thing.

Right. I may get round to doing sum/min/max etc at some point. Not just
yet though :)
And IMO perhaps a very good educational aid for people wanting to understand
LINQ-to-objects. As for the names - it stumped me too ;-p I didn't
stress over it...
:)

It almost seems a shame that this common case didn't make it into the
base-libs so that the same code could port to the database too... but
I guess it helps to keep things simple.

Indeed. Another thing that I'm rather amazed isn't in the framework is
a way of reading a file, line by line, as an IEnumerable<string>. It's
simple to implement, but *so* handy. Again, it'll be in MiscUtil next
time I do a release (the code's there already, and is more flexible
than just files, of course).
 
Right. I may get round to doing sum/min/max etc at some point. Not just
yet though :)

Brain dump...

For info, I think I've reached the conclusion that the optimal
approach here is to get all the common aggregates (and offer no
bespoke), rather than worry about individual aggregate functions
(which make for a lot of overhead) - but the main difficulty arises
from generics not supporting operators like plus (for the accumulator)
and divide (for an average). Min/max are slightly easier thanks to
Comparer<T>.Default. The simplest (but less pleasing?) approach is to
cheat and use "decimal" or similar. As a middle level, support for the
common types could be offered by a private factory, i.e. the generic
Accululator<TSource, TValue> might need a "Func<TValue, TValue,
TValue> add" and a "Func<TValue, int, TValue> quotient" (just for
Average at the end), which a factory that switches on TValue, using
specific Int32, Single, etc functions. At the far end, the other thing
I had in mind to look at was building an Expression of the same
(probably in a generic static for cache) and compiling it... certainly
easier than using Reflection.Emit, and ?should? cover the wider range
of types?

And of course the other problem is how to offer the different
aggregates to the caller if they want aggregates on multiple
concurrent properties, with different types... again "decimal" makes
this simple but feels unsatisfying - but the alternatives... boxing?
or an IAccumulator with generic wrapper methods like Min<T>() (and
hope the user asks for the same T as TValue)...

But I have other things that are more pressing, and no real /need/ to
look at the above, other than academic interest. I'll probably take
another look at some point...

Marc
 
First off - apologies to the group; this is turning into a bit more
free-thought/discussion than question/answer. As such, it occurs that
I should really start blogging this stuff instead (as a more
appropriate forum). If anybody can recommend an appropriate blog site,
I'd be interested...

I looked into the Expression stuff above, and upon consideration I
believe it is a pretty good solution for that old chestnut: how to do
maths with generics. The following works, and is easily extended to
other operators... the generated delegate is fully cached etc...

This has some application to the LINQ / grouping scenario above, but I
suspect it has a lot *more* uses, and it looks like it should
automatically support "JoesRandomStruct" etc...

Marc

using System;
using System.Linq.Expressions;

namespace ConsoleApplication1 {
class Program {
static void Main() {
var data = new[] {1.2, 3.1, 2.9};
Test(data);
/*double i = 0;
int j = 1;
double x = i / j;*/
}
static void Test<T>(T[] args) {
T sum = default(T);
int count = 0;
foreach (T t in args) {
sum = MathOperator<T>.Add(sum, t);
count++;
}
T avg = MathOperator<T>.DivideInt32(sum, count);
}
}
}

static class MathOperator<T> {
private static Func<T, T, T> add, divide;
private static Func<T, int, T> divideInt32;

static Func<TLhs, TRhs, TReturn> BuildBinary<TLhs, TRhs,
TReturn>(Func<Expression, Expression, BinaryExpression> method) {
ParameterExpression lhs = Expression.Parameter(typeof(TLhs),
"lhs"),
rhs = Expression.Parameter(typeof(TRhs), "rhs");
try {
return Expression.Lambda<Func<TLhs, TRhs, TReturn>>(
method(lhs, rhs), lhs, rhs).Compile();
} catch (InvalidOperationException) {
// try castring rhs to lhs
Expression castRhs = Expression.Convert(rhs,
typeof(TLhs));
return Expression.Lambda<Func<TLhs, TRhs, TReturn>>(
method(lhs, castRhs), lhs, rhs).Compile();
}
}
public static T Add(T arg1, T arg2) {
if (add == null) {
add = BuildBinary<T, T, T>(Expression.Add);
}
return add(arg1, arg2);
}
public static T Divide(T arg1, T arg2) {
if (divide == null) {
divide = BuildBinary<T, T, T>(Expression.Divide);
}
return divide(arg1 ,arg2);
}
public static T DivideInt32(T arg1, int arg2) {
if (divideInt32== null) {
divideInt32 = BuildBinary<T, int, T>(Expression.Divide);
}
return divideInt32(arg1, arg2);
}

}
 
Marc Gravell said:
First off - apologies to the group; this is turning into a bit more
free-thought/discussion than question/answer.

Fine by me - although I've got nothing to add to the particular
suggestion you've presented here.

What I've been thinking about is a sort of "Split" method which could
take a further query expression as its parameter, and apply that query
expression to each group within a set of results. That way we wouldn't
need to write "group and count", "group and sum" etc - just split,
passing count, sum or whatever as the parameter.

Unfortunately, I can't get this to work in my head with IEnumerable<T>
without creating one thread per group - we get into stack difficulties,
effectively. (Or you have to read the whole lot and cache it first,
which is what we're trying to avoid.) It may not be possible, but that
would certainly be a pity... it's *logically* feasible, but LINQ
doesn't help us as much as we might want.
 
What I've been thinking about is a sort of "Split" method which could
take a further query expression as its parameter, and apply that query
expression to each group within a set of results. That way we wouldn't
need to write "group and count", "group and sum" etc - just split,
passing count, sum or whatever as the parameter.

That is exactly what I tried to do in my Nov 21 implementation... the
"GroupAggregate" method /is/ the split... the "Sum" etc *is* a
parameter (in fact a set of "params" parameters). Each of these "Sum"
etc parameters were a seed plus a Func<Func<...>> - i.e. a factory
method that returned an accumulator method. The first time a new
branch is found, it invoke the *outer* method to get hold of the
accumulator method to use for that branch... which might be a common
one, or might be stateful. For existing branches it simply cranks the
handle on the accumulator it got previously. This means it only stores
one state per branch (group) [per accumulator].

I'm not entirely happy with my implementation (in fact I intend to
redo from scratch - it passes the selector too far into the stack and
uses "decimal", etc), but unless I've misunderstood your post we are
thinking along the same lines. Again, while I think the concept is OK,
I stress that I'm not very happy with the current /code/, hence I'm
not wildly posting it here in case it just adds confusion, but I can e-
mail it if you are interested.

At the moment the crank-handle drives aggregates (counters, sum, etc),
but the same approach (i.e. a factory returning new crank-handles for
each group) could theoretically drive any operation that just needs
the group-key and some prior "so far" state - so it could start
progressing the first group even though the last entry (for that
group) is at the end of the stream.

Marc
 
but unless I've misunderstood your post we are
thinking along the same lines.

<snip>

I don't *think* so. As far as I can see, your aggregation works on a
single value at a time, so you can't use the existing extension
methods. I was thinking of something like:

var query = data.Split (x => x.Count());

foreach (var group in query)
{
Console.WriteLine (group.Key + ": "+group.Result);
}

"x" in the first part would either be an IEnumerable<T> of the same
kind as data, or possibly IGrouping<T> to allow the lower levels to
access the key if they wanted - but Count() is the standard
Enumerable.Count().
 
Ah right, so each branch yields an IEnumerable over the source
entities for that key...

I see what you mean... the thread approach is fine until you get lots
of groups...

Marc
 
Marc Gravell said:
Ah right, so each branch yields an IEnumerable over the source
entities for that key...

I see what you mean... the thread approach is fine until you get lots
of groups...

Indeed. It feels like it *should* be feasible, but basically we need to
push the data rather than pulling it. Very annoying.
 
Well, for data where we know the grouping key is ascending (to use
your log-file example, grouping per day or per week) there is an
obvious single-threaded optimisation, since we can simply "yield
break" when we find the key changes, and start a new IEnumerable<T>.

In the more general case... for small volume the inbuilt (Lookup<,>
based) code suffices, so I'm thinking of the worst case high-volume.
In this case, *generally* the number of records is order(s) of
magnitude greater than the number of groups. I'm excluding scenarios
like "group orders by customer" where each customer only has a small
number of orders (perhaps a database is the answer there!); but things
like "group site hits by [major] browser-flavor" would be perfectly
reasonable.

In such volume cases, pre-loading everything is a non-starter, and the
simplest non-cached drip of data (over m=groups threads) would lead to
a lot of idle threads, but there is perhaps a middle-ground for those
cases where you want a bit more parallelism (rather than 20 threads
only 1 of which is doing anything). If you allocated each group a
*small* queue (either a Queue<T> or T[] or whatever), you would then
hopefully find (assuming reasonable distribution) that *most* of the
time there is space in the queue for the item you are adding. If full
it could Monitor.Wait on the groups sync-lock until there is space to
add. If the queue was empty prior to adding it could Monitor.Pulse the
group. And the reverse for the enumerator: while the queue is empty
then Monitor.Wait; if the queue was full prior to dequeing then
Monitor.Pulse. And some close-down test for the end-of-stream.
Essentially parallel single-producer/single-consumer queues, but where
the producer is writing to multiple such queue.

Obviously pool-threads would be risky due to saturation, but regular
Threads could be used (perhaps at lower priority than the thread that
is reading the stream).

You've probably already considered this, though...

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top