are arrays contiguous in memory?

Arnaud Debaene · Dec 19, 2005

Peter said:
Wow, you ask a simple question..... : )

As I understand it, the 'old' STL containers make a distinction in
this regard. List's are linked lists, and so are not necessarily
contiguous in memory. Vector's are the STL equivalent of contiguous
lists. The reason for both is that vector's are better at lookup
performance (randomly accessible and therefore a fixed length of
lookup time); but suffer when being deleted from or resized, since
this is often done by moving the entire end of the list and
allocating a bigger vector somewhere else and copying over,
respectively. List's are worse at lookup performance since they
require serial lookup (linking thorugh the list from the header,
which is takes a variable length of time based on 'distance' from
header), but benefit when it comes to deletion and resizing since an
element can be removed without moving anything else, and resizing
just means finding a large enough memory space anywhere for the added
element.

And there is also in intermediate "mix" between the two : the deque.

My opinion is this: with a faster computer and more memory it really
doesn't matter which method is used! : )

This is of course basically wrong : processor speed has never changed
anything to algorithm complexity. When manipulating great number of elements
(several tens of thousands), the difference between logarithmic and linear
complexity is as important today as it was 10 years ago.

Arnaud
MVP - VC

Peter Oliphant · Dec 19, 2005

This is of course basically wrong : processor speed has never changed

anything to algorithm complexity. When manipulating great number of
elements (several tens of thousands), the difference between logarithmic
and linear complexity is as important today as it was 10 years ago.

Since in my days I've had to write 'clever' code to combat slow computer
speed, I have to disagree with any blanketing statement along the lines of
"processor speed has never changed anything to algorithm complexity".

For example, this is very algorithmically simple:

int Square( int X )
{
return X * X ;
}

This however, is not:

int Square( int X )
{
if ( X < 0 ) { X = -X ; }
int Y = 1 ;
int DY = 3 ;
for ( int i = 1 ; i < X ; i++ )
{
Y += DY ;
DY += 2 ;
}
return Y ;
}

These methods both do the same thing (if you don't believe me, test them!).
But, in the 'old days' the former method would have been the wrong way to go
for small values of X since in those the days people counted clock cycles,
and the multiply instruction took about 50-100 longers to do than the add
instruction.

But since computers, through piping and other methods, have gotten adds and
mutliplies down to the same 1-cycle speed, nowadays the algorithmically
simpler method is the way to go. That's because today this algorithmically
simpler method executes faster than the 'clever' one (and as a side benefit
saves memory, is more intuitive, is less prone to error, and can be
programmed in less time).

So, theoretically you are correct. In practice, you're not-so-correct... : )

[==P==]

PS - A third method would be to create a lookup table by pre-calculating all
the squares and look them up, and works for even large values of X. This is
even more algorithmically complex as it requires a setup phase. But look-up
tables were the norm until a few years ago when computers got fast enough to
do things 'on the fly'. This is why we can entertain 'procedural textures'
in 3D worlds, something that would have been algorithmic suicide to attempt
when computers were slower and more inefficient when dealing with
graphics...

alextha · Dec 19, 2005

One thing that is missing from this discussion is the importance of cache effects when iterating. If your vector accesses are distributed all over memory like they might be with a linked list that has been sorted, then your cache miss rate will be significantly higher than if you iterate over a contiguous block of memory. This effect is drastic because L1 cache can be well over 20 times faster than RAM.

Another thing here is the fact that modern compilers have extremely good optimizations for looping over contiguous arrays because the compiler knows that it is working with contiguous memory.

That said, depending on what you are doing, it is not always worth it to sacrifice an order of complexity on an algorithm just to get these benefits. The two things I describe above are only constant time improvements, while dispersed memory vectors (such as linked lists) can sometimes give you order N or greater improvements over contiguous arrays.

We kind of got off topic from the original question but this is a good discussion!

-Alex

-----Original Message-----
From: Arnaud Debaene
Posted At: Monday, December 19, 2005 1:38 PM
Posted To: microsoft.public.dotnet.languages.vc
Conversation: are arrays contiguous in memory?
Subject: Re: are arrays contiguous in memory?

Peter said:
Wow, you ask a simple question..... : )

As I understand it, the 'old' STL containers make a distinction in
this regard. List's are linked lists, and so are not necessarily
contiguous in memory. Vector's are the STL equivalent of contiguous
lists. The reason for both is that vector's are better at lookup
performance (randomly accessible and therefore a fixed length of
lookup time); but suffer when being deleted from or resized, since
this is often done by moving the entire end of the list and
allocating a bigger vector somewhere else and copying over,
respectively. List's are worse at lookup performance since they
require serial lookup (linking thorugh the list from the header,
which is takes a variable length of time based on 'distance' from
header), but benefit when it comes to deletion and resizing since an
element can be removed without moving anything else, and resizing
just means finding a large enough memory space anywhere for the added
element.

And there is also in intermediate "mix" between the two : the deque.

My opinion is this: with a faster computer and more memory it really
doesn't matter which method is used! : )

This is of course basically wrong : processor speed has never changed
anything to algorithm complexity. When manipulating great number of elements
(several tens of thousands), the difference between logarithmic and linear
complexity is as important today as it was 10 years ago.

Arnaud
MVP - VC

Brian Muth · Dec 19, 2005

These methods both do the same thing (if you don't believe me, test

them!). But, in the 'old days' the former method would have been the wrong
way to go for small values of X since in those the days people counted
clock cycles, and the multiply instruction took about 50-100 longers to
do than the add instruction.

But since computers, through piping and other methods, have gotten adds
and mutliplies down to the same 1-cycle speed, nowadays the
algorithmically simpler method is the way to go. That's because today this
algorithmically simpler method executes faster than the 'clever' one (and
as a side benefit saves memory, is more intuitive, is less prone to error,
and can be programmed in less time).

So, theoretically you are correct. In practice, you're not-so-correct...
: )

This argument is entirely irrelevant.

Arnaud has correctly pointed out that lookups with lists are of O(n) whereas
lookups with vectors are of constant order. Advances in processor speeds
don't change this. Advances in floating point arithmetic don't change this.
He has correctly challenged your statement that "with a faster computer and
more memory it really doesn't matter which method is used!". Indeed, for
large amounts of data, it matters a lot.

Best regards,

Brian

Holger Grund · Dec 20, 2005

Brandon Bray said:
Of course, that's a trade off between safety and performance. One should
be very careful when making that decision. Because the JIT compiler has
the ability to eliminate bounds checks in some cases, switching to
pointers is not always a worthwhile exercise. Though, it is definitely
useful in some cases where bitmap manipulation is done. This is probably
the only reason why interior pointers exist in the new C++. If it wasn't
for this case, I probably would have succeeded at keeping them out of the
language entirely.

Ok, maybe I got too many islays today. But you can't be serious about that,
can you?

Am I missing something or do you say (plain) pointers are (and have always
been)
useless? What about by-ref params in managed signatures?

-hg

Tamas Demjen · Dec 20, 2005

Willy said:
That's true, though the overhead is real small (something like a cmp and a
jmp) and only important to consider for some operations on arrays.

In most cases it is small, but if you have 200 million bytes to iterate
through in a large scanned image (let's say to apply a lookup table to
each pixel), it's better to go back to unmanaged. It's not that hard to
make such operations pretty safe. So it's good to know that a pinned
array is contiguous. In an average algorithm with 100 items in an array
the performance most likely doesn't matter and it's better to be on the
safe side.

Tom

David Wilkinson · Dec 20, 2005

Peter said:
Since in my days I've had to write 'clever' code to combat slow computer
speed, I have to disagree with any blanketing statement along the lines of
"processor speed has never changed anything to algorithm complexity".

For example, this is very algorithmically simple:

int Square( int X )
{
return X * X ;
}

This however, is not:

int Square( int X )
{
if ( X < 0 ) { X = -X ; }
int Y = 1 ;
int DY = 3 ;
for ( int i = 1 ; i < X ; i++ )
{
Y += DY ;
DY += 2 ;
}
return Y ;
}

These methods both do the same thing (if you don't believe me, test them!).
But, in the 'old days' the former method would have been the wrong way to go
for small values of X since in those the days people counted clock cycles,
and the multiply instruction took about 50-100 longers to do than the add
instruction.

[snip]

Peter:

Cute, but actually your function does not work for X=0. A correct (and
likely more efficient) version of the same idea is:

int Square( int X )
{
if ( X < 0 ) { X = -X ; }
int Y = 0 ;
for ( int i = 1 ; i < X ; i++ )
{
Y += i ;
}
return Y + Y + X ;
}

A simple way to see that this is correct is to observe that Y is a sum
of (X-1) terms whose average is X/2 (average of first and last term).

David Wilkinson

Peteroid · Dec 20, 2005

He has correctly challenged your statement that "with a faster computer

and more memory it really doesn't matter which method is used!". Indeed,
for large amounts of data, it matters a lot.

Except this statement is taken out of context. My ORIGINAL statement was
made in an obviously jovial manner (note smiley at end that has been omitted
since), it was about the SPECIFIC problem regarding contiguous memory (the
topic of this thread), and the 'challenge' was then made in the GENERAL
case. That's not fair... : )

To be serious here, we are talking about two different things here. I'm
talking about how processor speed makes a big difference in algorithmic
design choice in most real world applications. You guys are talking about
computer science. Apples and ibms - er - oranges...

[==P==]

Rodrigo Corral [MVP] · Dec 20, 2005

I agree that pin is not needed.
The point in the article is that you get better performance using pointer
arithmetic, so in some scenarios to use pointer arithmetic is valuable.

Carl Daniel wrote:

"Not of much (any?) value unless you're writing unmanaged code."

I wanted to show that pointer arithmetic is valuable even in unmanaged code.

--
Un saludo
Rodrigo Corral González [MVP]

FAQ de microsoft.public.es.vc++
http://rcorral.mvps.org

Peter Oliphant · Dec 20, 2005

Dang, you're right! I needed to add the line at the top:

if ( X == 0 ) { return 0 ; }

And your Square( ) method is cool too! : )

[==P==]

David Wilkinson said:
Peter said:

Since in my days I've had to write 'clever' code to combat slow computer
speed, I have to disagree with any blanketing statement along the lines
of "processor speed has never changed anything to algorithm complexity".

For example, this is very algorithmically simple:

int Square( int X )
{
return X * X ;
}

This however, is not:

int Square( int X )
{
if ( X < 0 ) { X = -X ; }
int Y = 1 ;
int DY = 3 ;
for ( int i = 1 ; i < X ; i++ )
{
Y += DY ;
DY += 2 ;
}
return Y ;
}

These methods both do the same thing (if you don't believe me, test
them!). But, in the 'old days' the former method would have been the
wrong way to go for small values of X since in those the days people
counted clock cycles, and the multiply instruction took about 50-100
longers to do than the add instruction.

Click to expand...

[snip]

Peter:

Cute, but actually your function does not work for X=0. A correct (and
likely more efficient) version of the same idea is:

int Square( int X )
{
if ( X < 0 ) { X = -X ; }
int Y = 0 ;
for ( int i = 1 ; i < X ; i++ )
{
Y += i ;
}
return Y + Y + X ;
}

A simple way to see that this is correct is to observe that Y is a sum of
(X-1) terms whose average is X/2 (average of first and last term).

David Wilkinson

Willy Denoyette [MVP] · Dec 20, 2005

Tamas Demjen said:
In most cases it is small, but if you have 200 million bytes to iterate
through in a large scanned image (let's say to apply a lookup table to
each pixel), it's better to go back to unmanaged. It's not that hard to
make such operations pretty safe. So it's good to know that a pinned array
is contiguous. In an average algorithm with 100 items in an array the
performance most likely doesn't matter and it's better to be on the safe
side.

Tom

Tom,

All depends on the time you spend to build the lookup table (per pixel), say
that each pixel requires 8 instructions, then agreed, the two instructions
overhead is important, however if it takes 40 instructions per pixel (not to
forget the high number of L1 and L2 cache misses, typical for such large
arrays), the overhead is just negligible. Note that you are talking about a
pinned array, but I'm pretty sure your scanned image will reside in
unmanaged heap memory, so pinning isn't applicable here. Also, if you are
building your lookup table in the native heap and you are looking for the
highest performance, I would go for pure native code and forget about
managed code.
Note, as I said in another reply, arrays are always contagious (pinned or
not), if they aren't they are no arrays.

Willy.

Peter Oliphant · Dec 20, 2005

emphasis mine:

Note, as I said in another reply, arrays are always **contagious** (pinned
or not), if they aren't they are no arrays.

AHHH! That might explain why I think I'm getting the flu! LOL

[==P==]

PS - I'm the KING of typos, so don't feel too bad... : )

Willy Denoyette [MVP] · Dec 20, 2005

Rodrigo Corral said:
I agree that pin is not needed.
The point in the article is that you get better performance using pointer
arithmetic, so in some scenarios to use pointer arithmetic is valuable.

Yes better performance, but compared to what? What is done in this articles
code (C#) is "compensating" the inherently slow GDI+ access to image frame
buffers by using native pointer accesses (by the way, there is no other way
to access native heap memory from managed code). IMO if you feel that you
need to resort to "unsafe" constructs like this (in C#) you probably have
choosen the wrong language :-)

.

Willy.

Tamas Demjen · Dec 20, 2005

Willy said:
All depends on the time you spend to build the lookup table (per pixel), say
that each pixel requires 8 instructions, then agreed, the two instructions
overhead is important

An extra array bounds checking matters with very simple algorithms (such
as applying a lookup table, FIR filter, or FFT). Even the difference
between pointer and index arithmetics matters. array[i++] requires two
registers, while *p_array++ requires only one. The Intel processors have
a very unforunate design with especially few registers, and in a triple
nested loop it may make a significant difference. Often if you can
eliminate a register at the price of some extra calculation (such as * 2
or + Const), overall you may still gain speed. I wish the CPU had at
least twice as many registers -- image processing could be 25% faster.

On the other hand, optimizing compilers are so good these days that hand
optimizing the assembly code will rarely be any faster than writing it
in C++. FIR filtering is one of the cases when a carefully hand
optimized assembly code is twice as fast as the best C code.

However, what matters by far the most is to optimize (reduce) the cache
misses. If you can organize the loops so that you minimize cache misses,
you can sometimes be twice as fast, even if such organization requires
quite a bit of extra calculations and a couple of additional loop
nesting levels.

Tom

Willy Denoyette [MVP] · Dec 20, 2005

Peter Oliphant said:
emphasis mine:

Note, as I said in another reply, arrays are always **contagious** (pinned
or not), if they aren't they are no arrays.

Click to expand...

AHHH! That might explain why I think I'm getting the flu! LOL

[==P==]

PS - I'm the KING of typos, so don't feel too bad... : )

Now I'm disappointed I thought I was the KING :-((.

Willy.

Willy Denoyette [MVP] · Dec 20, 2005

Tom,

Tamas Demjen said:
Willy said:

All depends on the time you spend to build the lookup table (per pixel),
say that each pixel requires 8 instructions, then agreed, the two
instructions overhead is important

Click to expand...

An extra array bounds checking matters with very simple algorithms (such
as applying a lookup table, FIR filter, or FFT). Even the difference
between pointer and index arithmetics matters. array[i++] requires two
registers, while *p_array++ requires only one. The Intel processors have a
very unforunate design with especially few registers, and in a triple
nested loop it may make a significant difference. Often if you can
eliminate a register at the price of some extra calculation (such as * 2
or + Const), overall you may still gain speed. I wish the CPU had at least
twice as many registers -- image processing could be 25% faster.

I assume we are still talking about managed code do we? Well, here you are
at the mercy of the JIT compiler to do the right thing, and believe me while
he does a great job, he will not (yet) be able to optimize this the way you
expect. That's why I said, if you ever find out that there is a performance
issue (after carefull measurement/profiling), you probably have choosen the
wrong language/platform (by platform I mean the CLR).
But again, don't underestimate the impact of cache misses when accessing
large data types (nothing kills your throughput faster than a ton of L1/L2
cache misses), and don't overestimate the importance of register
allocations, the on chip cache (L1) is extremely fast as well.
I've run a lot of benchmarks lately, using the 32 bit and 64 bit native and
managed code on both X86 and on X64, and believe me the results where on par
despite the fact that X64 (AMD Athlon64 dual core) has a much larger
register stack (and SSE/SSE2).

On the other hand, optimizing compilers are so good these days that hand
optimizing the assembly code will rarely be any faster than writing it in
C++. FIR filtering is one of the cases when a carefully hand optimized
assembly code is twice as fast as the best C code.

Well, I'm sure you could prove your claim by running some microbench marks,
but is this representative for a real world application, where a small
'mistake' in your design might ruin all you did in your highly optimized ASM
code? Could be wrong, but I don't think so. Anyway we are still talking
about managed code, and this is not the problem domain targetetted by the
CLR.

However, what matters by far the most is to optimize (reduce) the cache
misses. If you can organize the loops so that you minimize cache misses,
you can sometimes be twice as fast, even if such organization requires
quite a bit of extra calculations and a couple of additional loop nesting
levels.

How would you reduce the cache misses when you have to deal with large (>
1Mb) array's? The cache lines are restricted in size and number, and the
caches are shared resources (beware the multi core rage!), a few context
switches (your process is not the only player) might ruin the cache you
know.

Willy.

Brandon Bray [MSFT] · Dec 20, 2005

Holger said:
Ok, maybe I got too many islays today. But you can't be serious about
that, can you?

For a while, we were planning on exposing interior pointers only for the
purpose of iterating over arrays. Effectively, array would have an iterator
nested type and that's it. Other than iterating over arrays, interior
pointers have very little use. The only other uses they have are
intermediate stage towards pinning, and having something to talk about for
how tracking references work.

Admittedly, the motivation behind avoiding interior pointers in the new
syntax was the tendency to overuse __gc in the old syntax. Much of that was
due to the compiler issuing diagnostics that went away by adding __gc to
pointers everywhere. Many programmers didn't know what they were doing other
than making the compiler happy.

Am I missing something or do you say (plain) pointers are (and have
always been) useless? What about by-ref params in managed signatures?

Interior pointers do have very little use in managed programming. Ref
parameters do have a purpose, and I never advocated leaving those out. In
fact, we spent a lot of time discussing how to go even further and support
C# out parameters in syntax.

Tamas Demjen · Dec 20, 2005

Willy said:
How would you reduce the cache misses when you have to deal with large (>
1Mb) array's?

By reading and/or writing to the memory in the optimal order, even if
the algorithm would normally require a random (or non-linear) order.
Example: break enormously large images into tiles. Group read/write
those pixels that fit in the cache. For example, read lines, not
columns. It gets tricky when you have to turn an image by 90 degrees,
because you have to involve columns. The correctly written code can be
twice as fast as the incorrect one, due to the enormous number of cache
misses involved while reading/writing columns instead of rows.

Tom

Willy Denoyette [MVP] · Dec 20, 2005

Tamas Demjen said:
By reading and/or writing to the memory in the optimal order, even if the
algorithm would normally require a random (or non-linear) order. Example:
break enormously large images into tiles. Group read/write those pixels
that fit in the cache. For example, read lines, not columns. It gets
tricky when you have to turn an image by 90 degrees, because you have to
involve columns. The correctly written code can be twice as fast as the
incorrect one, due to the enormous number of cache misses involved while
reading/writing columns instead of rows.

I see, but that's my point, you might need some assembly code to do this,
you won't do this from managed code (at least not that part). That's exactly
why we won't see a managed version of DirectX any time soon, note I'm not
talking about the managed DirectX wrapper!

Willy.

are arrays contiguous in memory?

Arnaud Debaene

Peter Oliphant

alextha

Brian Muth

Holger Grund

Tamas Demjen

David Wilkinson

Peteroid

Rodrigo Corral [MVP]

Peter Oliphant

Willy Denoyette [MVP]

Peter Oliphant

Willy Denoyette [MVP]

Tamas Demjen

Willy Denoyette [MVP]

Willy Denoyette [MVP]

Brandon Bray [MSFT]

Tamas Demjen

Willy Denoyette [MVP]