Out of memory with 2-dimensional float/double matrix

G

Guest

Hello,

I´m using a very large 2-dimensional double matrix (45.000x45.000)
in my program.

At the initialization of the matrix:

double[,] matrix = new double[45000,45000] I am getting an out of memory
error.

The program is running on a Windows 2003 Server with 8 GB Ram
and 20 GB virtual Ram.

So I thought this must be enough.
If I am sure, a double value uses 8 Byte.
So the matrix must be 45000*45000*8Byte = 15.087 GB.

Is this right?

My first question is now... must a matrix in C# fit in the Main Memory
(8 GB) or are party of the matrix automatically copied to the virtual
Memory if the matrix is too large?

In other words... can I handle a matrix which takes more than 8 GB Ram?

I also tried to use float instead of double but I am getting the same error.

Is there something smaller than a float which I could use?
I need 4 signs behind the comma.


I am logged in as normal user at the Windows 2003 Server.
Do I maybe have some virtual resource restrictions?
It is a "standard-installation".

I would be happy if somebody has an idea how to go on bugtracking my
problem.


Regards,

Martin
 
J

Jani Järvinen [MVP]

Martin,

you didn't state whether your C# application is 32-bit or 64-bit, but if it
is 32-bit (which is the default, one could say), then you are trying to do
something that is not possible in 32-bit applications.

That is, each 32-bit application can only address 2^32 = 4 gigabytes of
virtual address space, of which at least one gigabyte (1 GB, by default 2
GB) is reserved to the operating system. It doesn't matter how much memory
in total your server has, you can't allocate more than 2 or 3 GB of memory
in a single application. See here:

http://msdn2.microsoft.com/en-gb/library/aa366912.aspx

If your application is a 64-bit one, then your memory allocation should
succeed given that the server has enough free memory and you've specified
your application as being "large address aware". If you are in the 32-bit
world, then there's no other option than to divide your array into smaller
parts, or use for instance a file/database to store the data.

Hope this helps.

--
Regards,

Mr. Jani Järvinen
C# MVP
Helsinki, Finland
(e-mail address removed)
http://www.saunalahti.fi/janij/
 
G

Guest

Hello Jani,

thank you very much for your help.
you didn't state whether your C# application is 32-bit or 64-bit, but if it
is 32-bit (which is the default, one could say), then you are trying to do
something that is not possible in 32-bit applications.

Can you tell me how to check this?

The server has 4xIntel Xeon 3.4 Ghz CPUs and the Operation System is
Windows 2003 Enterprise with SP 1.0

I´ve read that Xeon is a 32-Bit CPU but there are versions with the
EM64T Extension.

How can I check if my CPU has a EM64T Extension or not?

And what about the Win2003 Enterprise OS.
How can I check if it is a 64Bit Version of Win2003 or not?


Regards,
Martin
 
W

Willy Denoyette [MVP]

Jani Järvinen said:
Martin,

you didn't state whether your C# application is 32-bit or 64-bit, but if it is 32-bit
(which is the default, one could say), then you are trying to do something that is not
possible in 32-bit applications.

That is, each 32-bit application can only address 2^32 = 4 gigabytes of virtual address
space, of which at least one gigabyte (1 GB, by default 2 GB) is reserved to the operating
system. It doesn't matter how much memory in total your server has, you can't allocate
more than 2 or 3 GB of memory in a single application. See here:

http://msdn2.microsoft.com/en-gb/library/aa366912.aspx

If your application is a 64-bit one, then your memory allocation should succeed given that
the server has enough free memory and you've specified your application as being "large
address aware". If you are in the 32-bit world, then there's no other option than to
divide your array into smaller parts, or use for instance a file/database to store the
data.


Nope, even on 64 bit this application will fail, the maximum size of an object is limited to
2GB in all current versions of the CLR. Sure you can allocate 8 array's of 2GB each on 64
bit, but not a single array of 16GB.

Willy.
 
R

rossum

Hello,

I´m using a very large 2-dimensional double matrix (45.000x45.000)
in my program.

At the initialization of the matrix:

double[,] matrix = new double[45000,45000] I am getting an out of memory
error.

The program is running on a Windows 2003 Server with 8 GB Ram
and 20 GB virtual Ram.

So I thought this must be enough.
If I am sure, a double value uses 8 Byte.
So the matrix must be 45000*45000*8Byte = 15.087 GB.

Is this right?

My first question is now... must a matrix in C# fit in the Main Memory
(8 GB) or are party of the matrix automatically copied to the virtual
Memory if the matrix is too large?

In other words... can I handle a matrix which takes more than 8 GB Ram?

I also tried to use float instead of double but I am getting the same error.

Is there something smaller than a float which I could use?
I need 4 signs behind the comma.


I am logged in as normal user at the Windows 2003 Server.
Do I maybe have some virtual resource restrictions?
It is a "standard-installation".

I would be happy if somebody has an idea how to go on bugtracking my
problem.


Regards,

Martin
How many of the positions in your array are non-zero? You might be
able to save memory by using a sparse matrix representation.

Do you need random access to the array or can you process it serially?
Serial processing might allow you to keep it in a file rather than in
memory and just retrieve it one row at a time.

You could use a database to store the numbers, 16GB is not large for a
database. Just retrieve the numbers you need from the database as
required.

rossum
 
B

Brian Gideon

Martin,

I'm just curious. Why do you need a matrix that big? Maybe I or
someone else can offer other options.

Brian
 
?

=?ISO-8859-1?Q?Martin_P=F6pping?=

Brian said:
I'm just curious. Why do you need a matrix that big? Maybe I or
someone else can offer other options.

Hi Brian,

I´m doing a document "pre-clustering" with a graphical algorithm
(connected components).

That´s why I need a comparison of each document to each other.

If I have 45.000 documents f.e. I need a 45.000x45.000 document matrix.

I know that I can compress the size of the matrix by using an
upper triangle matrix, but after my calculations this would also do not
fit to my memory space.

So I really have to devide my matrix into several smaller matrixes.


Regards,

Martin
 
B

Bruce Wood

Martin said:
Hi Brian,

I´m doing a document "pre-clustering" with a graphical algorithm
(connected components).

That´s why I need a comparison of each document to each other.

If I have 45.000 documents f.e. I need a 45.000x45.000 document matrix.

Or you need some other structure that stores, for each document, its
relationship to every other document. Is there a reason that you need
the entire thing in memory all at once? Could you use some form of
indexed file to store the relationships? What kinds of operations are
you doing with the matrix elements, and how many operations do you need
to do? Is there some sort of performance requirement?

If all you need to do is store the results of comparing every document
with every other document, then I would think that a file would be a
better structure, given the amount of data you're talking about. If, on
the other hand, you need to perform complex traversals over the data
over and over again, you need things to be in memory.
 
G

Guest

Hi Martin,
there are also other ways you can go about figurung out which labels are
equivalent in your connected component algorithm, one would be to have a
single array of 45,000 items where each array elemnt is a linked list
containing all of the other labels that are equivalent, after the first pass
of your data you can then choose one label for each item in your list and do
a second pass through your data to rename the labels. I have done this a few
times when I have used this algorithm in image processing and it works very
well and is very fast. Just some other ideas instead of trying to build a
matrix of all possible combinations.

Mark.
 
L

Lucian Wischik

Martin Pöpping said:
I´m doing a document "pre-clustering" with a graphical algorithm
(connected components).
That´s why I need a comparison of each document to each other.
If I have 45.000 documents f.e. I need a 45.000x45.000 document matrix.

(sorry I've only been to a couple of lectures on the topic and don't
really know the field) but this seems like the time where you start
reading up on algorithms and datastructures for solving the problem,
not the time to look up on language support for features.

What algorithm are you using? Your matrix seems O(n^2) but a random
glance through google gives me the impression that people are using
O(n.log n) or O(n) for their specialized clustering problems. One page
I saw reckoned they could cluster 100.000 documents of 10 variables in
under 50 minutes. And they say that an n^2 matrix would have taken
40gb but they did it in only 40mb.
http://www.clustan.com/clustering_large_datasets.html
 
?

=?ISO-8859-1?Q?Martin_P=F6pping?=

rossum said:
How many of the positions in your array are non-zero? You might be
able to save memory by using a sparse matrix representation.
Yes that´s what I was planned to do.
Using a upper triangular Matrix[1]
You could use a database to store the numbers, 16GB is not large for a
database. Just retrieve the numbers you need from the database as
required.
Yes that is not a bad idea. Thank you.
Maybe I am trying this as a last option.


Regards,
Martin


[1] http://en.wikipedia.org/wiki/Triangular_matrix
 
?

=?ISO-8859-1?Q?Martin_P=F6pping?=

What kinds of operations are you doing with the matrix elements, and how many operations do you need
to do? Is there some sort of performance requirement?

With a threshold value I am konverting the matrix to a binary Matrix (0:
documents are not equal, 1: documents are (partially) equal).

After this I am using a graph algorithm to calculate the connected
components. So I think it would be better if there is a chance to keep
the values in memory.


Regards,
Martin
 
J

Jani Järvinen [MVP]

Hello!
How can I check if my CPU has a EM64T Extension or not?

You would need an utility that is able to display this information, or, you
would need to call the "CPUID" instruction in native code to get the
information. I once wrote a Delphi Win32 DLL that does this call and then a
C# managed wrapper around this DLL, but there are freeware utilities as
well.

For example CPU-Z (www.cpuid.com) should be able to tell.

--
Regards,

Mr. Jani Järvinen
C# MVP
Helsinki, Finland
(e-mail address removed)
http://www.saunalahti.fi/janij/
 
J

Jani Järvinen [MVP]

Hello!
Nope, even on 64 bit this application will fail, the maximum size of an
object is limited to 2GB in all current versions of the CLR.

Thanks for pointing this out Willy, I was thinking in terms of what the
platform can do. So if you wanted to do this in C#, you'd need to resort to
using P/Invoke and not managed code.

--
Regards,

Mr. Jani Järvinen
C# MVP
Helsinki, Finland
(e-mail address removed)
http://www.saunalahti.fi/janij/
 
C

chanmm

I do agree with Lucian. Large 2 dimension array is quite unlikely. You
should look at what Collection classes can help you if you feel you do not
have one then I will suggest you to look at unmanage code to build the
algorithm to gain the performance.

chanmm
 
W

Willy Denoyette [MVP]

Jani Järvinen said:
Hello!


Thanks for pointing this out Willy, I was thinking in terms of what the platform can do.
So if you wanted to do this in C#, you'd need to resort to using P/Invoke and not managed
code.

Or C++/CLI in mixed mode.

Willy.
 
?

=?ISO-8859-1?Q?Martin_P=F6pping?=

chanmm said:
I do agree with Lucian. Large 2 dimension array is quite unlikely. You
should look at what Collection classes can help you if you feel you do
not have one then I will suggest you to look at unmanage code to build
the algorithm to gain the performance.

Unmanaged Code means I have to write a c++ DLL?

I tried it now with a single array and a linked list.
But I´m still having the same problem with out of memory.

Regards,
Martin
 
W

Willy Denoyette [MVP]

Martin Pöpping said:
Unmanaged Code means I have to write a c++ DLL?

I tried it now with a single array and a linked list.
But I´m still having the same problem with out of memory.

Regards,
Martin

Like as been said before in this thread, you can't allocate such huge array's on Windows 32
bit, not from managed code nor from unmanaged code. A 32 bit process has simply not the
space available in the process heap to allocate more than a total of 2GB of data, more, it
has less than this available as contiguous space. The amount of free contiguous space varies
from application to application and even more important - varies at run time due to heap
fragmentation. Free *contiguous* heap space in general varies between ~1.2 and ~1.8 GB, but
on a 32 bit OS, you should never assume this to be guaranteed.

Willy.
 
B

Brian Gideon

chanmm said:
I do agree with Lucian. Large 2 dimension array is quite unlikely. You
should look at what Collection classes can help you if you feel you do not
have one then I will suggest you to look at unmanage code to build the
algorithm to gain the performance.

chanmm

I really hate to see someone move away from managed code because there
isn't a suitable collection class. And I just don't think the marginal
performance difference between managed and unmanaged code is going to
matter that much anyway.

Brian
 
C

chanmm

Do you might to put your code here with some explanation or email it to me?
No promise but I can try to look at it.

chanmm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top