Boxing and Unboxing ??

  • Thread starter Thread starter Peter Olcott
  • Start date Start date
Barry Kelly said:
Peter said:
Jesse McGrew said:
Does that mean that you do have to call a method every time you add or
compare
two integers that are stored in reference types?

From the point of view of C#, an integer (or any other value type) is
only boxed if it's been assigned to a location of type 'object' -
whether local variable, argument or field.

Value types that are fields of a reference type are stored inline in the
memory for that object on the heap.

For example:

class A { int x; }

... can be imagined as being roughly equivalent (from a memory layout
perspective) to this in C:

typedef void *MethodTable; // CLR implementation detail
typedef struct A_ { MethodTable *mt; int x; } *A;

In fact, you can't add two boxed integers in C#, since it's got no way
to represent them as anything other than 'object'. You need to cast them
to 'int' to add them - and that unboxes them.

So a member function can not add two integer members without unboxing them
first? That would sound like horrendous design.
 
So a member function can not add two integer members without unboxing them
first? That would sound like horrendous design.

No. _Only_ if you declare the integer as being of type "object". Viz:

public class SimpleClass
{
private int a;
private int b;

public SimpleClass(int first, int second)
{
this.a = first;
this.b = second;
}

public int Sum()
{
return this.a + this.b;
}
}

No boxing, no unboxing. The integers are stored on the heap with the
object's state, and used directly in the sum operation.

Versus:


public class SimpleClass
{
private object a;
private object b;

public SimpleClass(int first, int second)
{
this.a = first; // Causes boxing
this.b = second; // Causes boxing
}

public int Sum()
{
return (int)this.a + (int)this.b; // "int" cast causes
unboxing
}
}

In the first case, the integers are value types, stored just as they
would be in C++. In the second case, they are declared as reference
types, so the CLR allocates space for them on the heap and copies the
values "first" and "second" into the boxes on the heap. The object
state then maintains references to the two boxes. In order to use the
values, you must fetch them from the boxes on the heap, or "unbox" them
(which, in reality, is just a pointer dereference, which is probably
what you would expect anyway).
 
Bruce Wood said:
No. _Only_ if you declare the integer as being of type "object". Viz:

public class SimpleClass
{
private int a;
private int b;

public SimpleClass(int first, int second)
{
this.a = first;
this.b = second;
}

public int Sum()
{
return this.a + this.b;
}
}

No boxing, no unboxing. The integers are stored on the heap with the
object's state, and used directly in the sum operation.

Versus:


public class SimpleClass
{
private object a;
private object b;

public SimpleClass(int first, int second)
{
this.a = first; // Causes boxing
this.b = second; // Causes boxing
}

public int Sum()
{
return (int)this.a + (int)this.b; // "int" cast causes
unboxing
}
}

In the first case, the integers are value types, stored just as they
would be in C++. In the second case, they are declared as reference
types, so the CLR allocates space for them on the heap and copies the
values "first" and "second" into the boxes on the heap. The object
state then maintains references to the two boxes. In order to use the
values, you must fetch them from the boxes on the heap, or "unbox" them
(which, in reality, is just a pointer dereference, which is probably
what you would expect anyway).

Well that's not too bad then. It would seem that good design might be able to
completely eliminate the boxing and unboxing overhead penalty. Is it possible to
pass data around as unboxed data? Can I pass the address of a struct, so that a
class member can update this struct without boxing and unboxing?

What is the best way to get one class to update the struct data of another
class?
 
Peter said:
Well that's not too bad then. It would seem that good design might be able to
completely eliminate the boxing and unboxing overhead penalty.

Yup. Nobody I know with experience worries much about this.
Is it possible to pass data around as unboxed data?

Yes - declare your types rather than using 'object'.
Can I pass the address of a struct, so that a
class member can update this struct without boxing and unboxing?

You can, but in a strictly downwards (call stack) fashion, via the 'ref'
modifier on arguments. You can't safely store the address.

With unsafe code, you can use the '&' operator to get the address, and
basically write C code to manipulate the data. But that's unsafe code:
it's not verifiable, it won't work if the executable is run from a
network location, and almost certainly won't work if you're (e.g.)
writing an ASP.NET application for hosting on a server somewhere -
unless you control the server & permissions completely.
What is the best way to get one class to update the struct data of another
class?

By calling methods on the other class.

-- Barry
 
Peter said:
So a member function can not add two integer members without unboxing them
first? That would sound like horrendous design.

If you carefully read what I wrote, you'll notice:

You *cannot*, I repeat *CANNOT*, have two boxed integer members in C# -
the members would need to be of type *object*, not int, in order for
them to be boxed.

-- Barry
 
Barry Kelly said:
Yup. Nobody I know with experience worries much about this.


Yes - declare your types rather than using 'object'.


You can, but in a strictly downwards (call stack) fashion, via the 'ref'
modifier on arguments. You can't safely store the address.
So I can call one member function from another member function of a different
class and pass the address of the a struct to the second member function so that
the second member function can directly update the contents of the struct,
without any boxing and unboxing overhead? If the answer is yes, then what is the
syntax for doing this?
 
Barry Kelly said:
If you carefully read what I wrote, you'll notice:


You *cannot*, I repeat *CANNOT*, have two boxed integer members in C# -
the members would need to be of type *object*, not int, in order for
them to be boxed.

-- Barry

I carefully read it, yet, did not fully understand the meaning of all of the
terminology that was used. For one thing, I don't see why there is ever any need
for boxing and unboxing. I know that there is no such need in C++. I also know
that it must somehow support GC, and that is why it is needed. I don't see how
it supports GC. Is it something like maintaining a chain of pointers indicating
who owns what?
 
Peter said:
So I can call one member function from another member function of a different
class and pass the address of the a struct to the second member function so that
the second member function can directly update the contents of the struct,
without any boxing and unboxing overhead?
Yes.

If the answer is yes, then what is the
syntax for doing this?

void Foo(ref MyStruct value) { } // declaration

// ...
MyStruct myStructValue;
// ...
Foo(ref myStructValue); // usage

There is also an 'out', which is similar but (a) argument need not be
definitely assigned when passed in (but it will be definitely assigned
after the call) and (b) it is treated as unassigned in the body of the
method taking the parameter, and will be so treated until it's assigned
(and must be assigned before the method returns).

But be sure to measure that:

1) MyStruct being a struct (value type) is the right thing to do.
Typically, if sizeof(MyStruct) is greater than (say) 16 bytes, it's
looking like it might be too big. Of course, there are exceptions to
this, like in all performance work. Measure, etc.

2) The savings by passing by-ref outweigh the fact that it's a mutable
reference. In other words, beware that there's no const by-ref
mechanism.

-- Barry
 
Peter said:
I carefully read it, yet, did not fully understand the meaning of all of the
terminology that was used. For one thing, I don't see why there is ever any need
for boxing and unboxing.

Consider how these things would be implemented in a memory-safe[1]
manner without boxing (whether manual boxing like Java 1.4, or
autoboxing like C# and Java 1.5+):

* IEnumerable
I know that there is no such need in C++.

If you try to create a C++ analogue of IEnumerable in a memory-safe way,
you'll need to reinvent boxing. In other words, you'll need some way of
unifying all values into some interface that can be queried for type and
safely converted into its actual value.

And just because a feature is useful doesn't mean that it is necessary.
C++ isn't memory-safe.
I also know
that it must somehow support GC, and that is why it is needed.

GC is an orthogonal issue to boxing per se. Autoboxing, however,
requires some kind of GC, even if it's as dumb as reference counting, if
it's to be sane (IMHO).
I don't see how it supports GC. Is it something like maintaining a chain
of pointers indicating who owns what?

GC in no way requires boxing. GC follows the same references you program
with. There are no magic references behind the scenes.

[1] By "memory-safe", I mean that it's provably impossible to violate
language's memory model. See e.g. type safety on Wikipedia for more
info:

http://en.wikipedia.org/wiki/Type_safety

-- Barry
 
Peter said:
So with Generics Boxing and UnBoxing beomes obsolete?

Not in general.

Generics make boxing an unboxing obsolete in the context
of storing value types in collections.

Arne
 
Peter said:
What I am looking for is all of the extra steps that form what is referred to as
boxing and unboxing. In C/C++ converting a value type to a reference type is a
very simple operation and I don't think that there are any runtime steps at all.
All the steps are done at compile time. Likewise for converting a reference type
to a value type.

in C/C++
int X = 56;
int *Y = &X;
Now both X and *Y hold 56, and Y is a reference to X;

That code is not equivalent to what we are discussing in C#.

In fact it does not really have any equivalent in C# (not using
unsafe code).

Arne
 
Peter said:
I carefully read it, yet, did not fully understand the meaning of all of the
terminology that was used. For one thing, I don't see why there is ever any need
for boxing and unboxing. I know that there is no such need in C++.

That's because in C# (and Java) you _can't_ say:

int x = 3;
int *p = &x;

because the "&" operator simply doesn't exist. You can't take the
address of an arbitrary variable.
I also know that it must somehow support GC, and that is why it is needed.

Well, more to the point, a language that supports garbage collection
can't allow one to take addresses of arbitrary memory locations, as the
garbage collector could then never determine what objects were
referenced and which weren't (because an address into the midst of an
object would then be legal).
Is it something like maintaining a chain of pointers indicating who owns what?

Well, sort of. The GC walks the stack and all static objects, looking
for references to objects on the heap. It then follows references
stored in those objects, etc, until it exhausts the network of
references. Any objects left thus unmarked are available for
collection.

Of course, it's rather more complex than that, but you get the idea. If
you allow references into the midst of objects, then it's much more
difficult to decide what is referenced and what isn't.

In C# and Java, every reference that you can directly manipulate in
code is to a valid object on the heap. That's why, if you want to treat
an int as an object (and thus have a reference to it) then the CLR has
to create an object wrapper for it and put it on the heap.
 
Barry Kelly said:
void Foo(ref MyStruct value) { } // declaration

// ...
MyStruct myStructValue;
// ...
Foo(ref myStructValue); // usage

There is also an 'out', which is similar but (a) argument need not be
definitely assigned when passed in (but it will be definitely assigned
after the call) and (b) it is treated as unassigned in the body of the
method taking the parameter, and will be so treated until it's assigned
(and must be assigned before the method returns).

But be sure to measure that:

1) MyStruct being a struct (value type) is the right thing to do.
Typically, if sizeof(MyStruct) is greater than (say) 16 bytes, it's
looking like it might be too big. Of course, there are exceptions to
this, like in all performance work. Measure, etc.

2) The savings by passing by-ref outweigh the fact that it's a mutable
reference. In other words, beware that there's no const by-ref
mechanism.

There is no inherent reason why this could not be added to the language as a
compile time feature later on. It might be simpler to stick with the established
convention and simply make an [in] equivalent of an [out] parameter, instead of
using the somewhat less obvious [const]. There would be no reason to distinguish
between [in] by reference and [in] by value, they could all be passed by
reference, or anything larger than [int] could always be passed by reference.

It is good to know that aggregate data can be passed by reference without the
boxing and unboxing overhead, if need be.
 
Barry Kelly said:
Peter said:
I carefully read it, yet, did not fully understand the meaning of all of the
terminology that was used. For one thing, I don't see why there is ever any
need
for boxing and unboxing.

Consider how these things would be implemented in a memory-safe[1]
manner without boxing (whether manual boxing like Java 1.4, or
autoboxing like C# and Java 1.5+):

* IEnumerable
I know that there is no such need in C++.

If you try to create a C++ analogue of IEnumerable in a memory-safe way,
you'll need to reinvent boxing. In other words, you'll need some way of
unifying all values into some interface that can be queried for type and
safely converted into its actual value.

And just because a feature is useful doesn't mean that it is necessary.
C++ isn't memory-safe.
I also know
that it must somehow support GC, and that is why it is needed.

GC is an orthogonal issue to boxing per se. Autoboxing, however,
requires some kind of GC, even if it's as dumb as reference counting, if
it's to be sane (IMHO).
I don't see how it supports GC. Is it something like maintaining a chain
of pointers indicating who owns what?

GC in no way requires boxing. GC follows the same references you program
with. There are no magic references behind the scenes.

[1] By "memory-safe", I mean that it's provably impossible to violate
language's memory model. See e.g. type safety on Wikipedia for more
info:

http://en.wikipedia.org/wiki/Type_safety

A strongly type language like C++ effectively prevents any accidental type
errors, why bother with more than this?
 
Arne Vajhøj said:
That code is not equivalent to what we are discussing in C#.

In fact it does not really have any equivalent in C# (not using
unsafe code).

Arne

Couldn't there possibly be a way to create safe code that does not ever require
any extra runtime overhead? Couldn't all the safety checking somehow be done at
compile time?
 
Bruce Wood said:
That's because in C# (and Java) you _can't_ say:

int x = 3;
int *p = &x;

because the "&" operator simply doesn't exist. You can't take the
address of an arbitrary variable.

I still don't see any reason why a completely type safe language can not be
constructed without the need for any runtime overhead. You could even allow
construct such as the above, and still be completely type safe, merely disallow
type casting.
Well, more to the point, a language that supports garbage collection
can't allow one to take addresses of arbitrary memory locations, as the
garbage collector could then never determine what objects were
referenced and which weren't (because an address into the midst of an
object would then be legal).

It could do this, but, then you have the issue of reference counting, more extra
overhead. You don't have this problem when data is simply passed by address with
no assignment to another pointer variable.
Well, sort of. The GC walks the stack and all static objects, looking
for references to objects on the heap. It then follows references
stored in those objects, etc, until it exhausts the network of
references. Any objects left thus unmarked are available for
collection.
Global data is disallowed?
Of course, it's rather more complex than that, but you get the idea. If
you allow references into the midst of objects, then it's much more
difficult to decide what is referenced and what isn't.

In C# and Java, every reference that you can directly manipulate in
code is to a valid object on the heap. That's why, if you want to treat
an int as an object (and thus have a reference to it) then the CLR has
to create an object wrapper for it and put it on the heap.

I still don't see any need for a wrapper. Do you mean for reference counting?
 
Peter said:
[1] By "memory-safe", I mean that it's provably impossible to violate
language's memory model. See e.g. type safety on Wikipedia for more
info:

http://en.wikipedia.org/wiki/Type_safety

A strongly type language like C++ effectively prevents any accidental type
errors, why bother with more than this?

Because this also prevents *intentional* type errors, which is
important for running code in a sandbox. Your web browser can guarantee
that a Java applet embedded into a page won't crash your system or
delete all your files, because Java enforces type safety at all levels;
this is the same sort of thing.

Jesse
 
Peter said:
I still don't see any reason why a completely type safe language can not be
constructed without the need for any runtime overhead. You could even allow
construct such as the above, and still be completely type safe, merely disallow
type casting.

If you disallow type casting, you neuter the language. You need to be
able to cast instances of derived classes to their bases and back. You
can do the first kind of cast without any runtime overhead, but you
need *some* runtime overhead to cast a base instance to its actual
derived class, even in C++ with dynamic_cast<>.

(The overhead in C++ isn't for performing the actual cast, but for
verifying that the cast is valid - that the object actually belongs to
the class you're casting it to. In C#, that's usually the case, but for
unboxing casts there's also overhead for copying the value out of its
box.)
It could do this, but, then you have the issue of reference counting, more extra
overhead. You don't have this problem when data is simply passed by address with
no assignment to another pointer variable.

The desire to avoid that overhead (as well as other problems with
reference counting) is, presumably, why .NET uses a garbage collector
instead.
Global data is disallowed?

No, that's what "static objects" refers to. In C#, you typically only
store global data by putting it in the static fields of a class. (There
are a couple other types of global data used with C++/CLI: bare global
variables and gcroots.)
I still don't see any need for a wrapper. Do you mean for reference counting?

The wrapper is there so that the int on the heap can be treated like
any other object, with a type pointer, virtual methods, etc. If it were
just stored on the heap as a plain integer, there'd be no way for your
code (and more importantly, the garbage collector) to tell it apart
from a float or an object reference at runtime.

Boxing lets you write a method like this:

public static void PrintIt(object foo)
{
Console.WriteLine("Thanks for this " + foo.GetType().Name + ": " +
foo.ToString());
}

And then pass in *any* value, whether it's an integer, a structure, or
an object reference. An unboxed integer is just a number, with no type
information other than that stored in the compiler's internals; a boxed
integer is a full-fledged instance of a class derived from
System.Object.

Jesse
 
Peter said:
I still don't see any reason why a completely type safe language can not be
constructed without the need for any runtime overhead. You could even allow
construct such as the above, and still be completely type safe, merely disallow
type casting.


It could do this, but, then you have the issue of reference counting, more extra
overhead. You don't have this problem when data is simply passed by address with
no assignment to another pointer variable.

C# supports pass-by-reference using the "ref" keyword.

However, I don't see how a language that allowed one to take the
address of arbitrary data could implement garbage collection. Even with
reference counting, the theory is that an _object_ counts references to
itself. An int, however, isn't an object. You're faced with the problem
of an object counting references to itself _or piece of data that it
holds_. How could you engineer a system whereby object A could keep
track of this sort of thing:

int *p = &(A.X);
int *q = p;

How does the object A now know that there are two references to it, p
and q, which point to a field inside A and not to A itself?

I don't see how you could automate this kind of reference counting,
even in C++, but then I'm no C++ guru.
Global data is disallowed?

No. Global data is allowed. That's what I meant by "static".
I still don't see any need for a wrapper. Do you mean for reference counting?

C# and Java don't do reference counting. They walk the network of
object references at garbage collection time. "Mark and sweep."

I guess a good summary would be to say that the more regular the
situation, the easier it is to write good code to deal with it. By
forcing every collectable object to be the same, and allowing
references only to objects on the heap (apart from pass-by-ref, which
doesn't enter into garbage collection), C# and Java make it easier on
the garbage collector, which allows the GC to be more efficient.

Once you open up the language to allow arbitrary addressing of objects
and the values within them, you create a nightmare situation for the
garbage collector. Not that a sufficiently clever team of people
couldn't do it, I suppose, but it adds a lot of additional complexity,
and one has to ask exactly what would be gained? Java has demonstrated
that you can write perfectly good code without the ability to take
arbitrary addresses, pointer arithmetic, and the other stuff that C and
C++ pointers provide. There are some domains where the power of C / C++
pointers is arguably a great boon, but for most programming problems it
isn't required. So, you don't lose very much, and you gain a much
simpler garbage collector and better run-time security.

And yes, in .NET 2.0 you can pretty-much avoid boxing (and unboxing)
altogether. It was difficult in .NET 1.1 because all of the standard
collections were collections of Object, and so storing values in a
Hashtable or an ArrayList (aka Vector in C++) meant incurring boxing
overhead. Even in .NET 1.1, however, you could roll your own
collections that didn't box or unbox, but they had to be type-specific.
..NET 2.0's generics (aka templates in C++) eliminate this problem. I
wouldn't say that boxing is a thing of the past, but more than 90% of
boxing in .NET 1.1 was in collections, and that's no longer necessary.

So the runtime penalty is almost non-existent, assuming that you use
appropriate language constructs.

Personally, I'm glad that arbitrary addressing was never put into Java
or C#. When I moved from C / C++ to Java I wondered how I would ever do
without the "&" operator, but I quickly realized that for the type of
software I write (business software) it really isn't needed. If,
however, I ever go back to writing real-time switching systems, I will
no doubt want C++ back again. Each tool has its uses, and C# is, in my
opinion, better suited to most day-to-day programming problems than is
C++. However, there are places that C# won't take you, where C++ is
much better suited.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top