List/Collection/Array Selection - Memory Issues

james · Jul 27, 2010

General Issue: I need to create some data classes/structs that can be
used efficiently with both limited memory use as well as good search
capabilities.

Details: I'm working on a C# VSTO add-in, and part of its tasks is to
match dimensions betyween OLAP cubes. I'm collecting all the
dimension members as List<CubeInfo>, and then adding that
List<CubeInfo> to CubeDimensions. The end result will need to match
the members of the lists and return them in a different structure. It
works fine from a functional perspective, but it eats memory, since
the members can number from 5 to 800,000, and there are approx. 20
dimesions.

Question: What data structures can I employ that would have a limited
memory footprint, as well as provide good search capabilities on
millions of 'rows'

Additonal: I've considered hashtables, as well as storing the
information in a SQL Db, since that will provide native matching
capabilities.

Exiting Data structures:

[Serializable]
public class CubeInfo
{
public string DimensionType;
public string UniqueName;
public string Name;
}

[Serializable]
public class CubeDimensions
{
public string DimenstionName;
public List<CubeInfo> DimenstionData = new List<CubeInfo> { };
}

Arne Vajhøj · Jul 27, 2010

General Issue: I need to create some data classes/structs that can be
used efficiently with both limited memory use as well as good search
capabilities.

Details: I'm working on a C# VSTO add-in, and part of its tasks is to
match dimensions betyween OLAP cubes. I'm collecting all the
dimension members as List<CubeInfo>, and then adding that
List<CubeInfo> to CubeDimensions. The end result will need to match
the members of the lists and return them in a different structure. It
works fine from a functional perspective, but it eats memory, since
the members can number from 5 to 800,000, and there are approx. 20
dimesions.

Question: What data structures can I employ that would have a limited
memory footprint, as well as provide good search capabilities on
millions of 'rows'

Additonal: I've considered hashtables, as well as storing the
information in a SQL Db, since that will provide native matching
capabilities.

Exiting Data structures:

[Serializable]
public class CubeInfo
{
public string DimensionType;
public string UniqueName;
public string Name;
}

[Serializable]
public class CubeDimensions
{
public string DimenstionName;
public List<CubeInfo> DimenstionData = new List<CubeInfo> { };
}

Maybe you can use Dictionary<int,CubeInfo> instead of
List<CubeInfo> !?

Arne

james · Jul 27, 2010

General Issue: I need to create some data classes/structs that can be
used efficiently with both limited memory use as well as good search
capabilities.

Click to expand...

Details: I'm working on a C# VSTO add-in, and part of its tasks is to
match dimensions betyween OLAP cubes. I'm collecting all the
dimension members as List<CubeInfo>, and then adding that
List<CubeInfo> to CubeDimensions. The end result will need to match
the members of the lists and return them in a different structure. It
works fine from a functional perspective, but it eats memory, since
the members can number from 5 to 800,000, and there are approx. 20
dimesions.

Click to expand...

Question: What data structures can I employ that would have a limited
memory footprint, as well as provide good search capabilities on
millions of 'rows'

Click to expand...

Additonal: I've considered hashtables, as well as storing the
information in a SQL Db, since that will provide native matching
capabilities.

Click to expand...

Exiting Data structures:

Click to expand...

[Serializable]
public class CubeInfo
{
public string DimensionType;
public string UniqueName;
public string Name;
}

Click to expand...

[Serializable]
public class CubeDimensions
{
public string DimenstionName;
public List<CubeInfo> DimenstionData = new List<CubeInfo> { };
}

Click to expand...

Maybe you can use Dictionary<int,CubeInfo> instead of
List<CubeInfo> !?

Arne- Hide quoted text -

- Show quoted text -

I need the first column as a key, since that will hold the dimension
name.

Arne Vajhøj · Jul 27, 2010

General Issue: I need to create some data classes/structs that can be
used efficiently with both limited memory use as well as good search
capabilities.

Click to expand...

Details: I'm working on a C# VSTO add-in, and part of its tasks is to
match dimensions betyween OLAP cubes. I'm collecting all the
dimension members as List<CubeInfo>, and then adding that
List<CubeInfo> to CubeDimensions. The end result will need to match
the members of the lists and return them in a different structure. It
works fine from a functional perspective, but it eats memory, since
the members can number from 5 to 800,000, and there are approx. 20
dimesions.

Click to expand...

Question: What data structures can I employ that would have a limited
memory footprint, as well as provide good search capabilities on
millions of 'rows'

Click to expand...

Additonal: I've considered hashtables, as well as storing the
information in a SQL Db, since that will provide native matching
capabilities.

Click to expand...

Exiting Data structures:

Click to expand...

[Serializable]
public class CubeInfo
{
public string DimensionType;
public string UniqueName;
public string Name;
}

Click to expand...

[Serializable]
public class CubeDimensions
{
public string DimenstionName;
public List<CubeInfo> DimenstionData = new List<CubeInfo> { };
}

Click to expand...

Maybe you can use Dictionary<int,CubeInfo> instead of
List<CubeInfo> !?

Click to expand...

I need the first column as a key, since that will hold the dimension
name.

[Serializable]
public class CubeDimensions
{
public string DimenstionName;
public Dictionary<int,CubeInfo> DimenstionData = new
Dictionary<int,CubeInfo>();
}

still has DimensionName.

Arne

james · Jul 28, 2010

james said:
james said:

General Issue: I need to create some data classes/structs that can be
used efficiently with both limited memory use as well as good search
capabilities.

Click to expand...

Details: I'm working on a C# VSTO add-in, and part of its tasks is to
match dimensions betyween OLAP cubes. I'm collecting all the
dimension members as List<CubeInfo>, and then adding that
List<CubeInfo> to CubeDimensions. The end result will need to match
the members of the lists and return them in a different structure. It
works fine from a functional perspective, but it eats memory, since
the members can number from 5 to 800,000, and there are approx. 20
dimesions.

Click to expand...

What is a "dimension member" here? Are these individual data elements
from your original data set (in which case, I don't see how the
"CubeInfo" class you posted relates to that)? Some specific OLAP cube
(in which case, how do you get to 800K of them)? Something else? What
kind of "matching" are you doing between "dimension members"? How does
the end result relate to the original, unorganized data input?

Question: What data structures can I employ that would have a limited
memory footprint, as well as provide good search capabilities on
millions of 'rows'

Click to expand...

Additonal: I've considered hashtables, as well as storing the
information in a SQL Db, since that will provide native matching
capabilities. [...]

Click to expand...

As Arne has said, a hashing data structure is typically the appropriate
first-choice for fast data retrieval.

Unfortunately, I don't find enough detail in your question to know
whether that will solve your problem or not (see above). It's not clear
what kind of searching it is you're trying to do, nor how the data
structures that you've shown us enable that searching.

Neither of the data structures you've shown appear to contain the actual
discrete data you might be interested in. Nor is it clear what the
"CubeInfo" data structure is used for at all; an "OLAP cube" would
typically have three or more dimensions, but I would not expect a given
dimension to be related to more than one cube.

Ignoring the data structure examples you've provided (since I don't
understand how they relate to OLAP), it seems like there are a couple of
obvious ways to organize the data.

For dimensions where the data are in discrete "bins" (for example,
product SKUs or items organized by color), a hash table such as Arne
suggested would work well for such a dimension. In this case, you would
store in a dictionary lists of individual data elements (e.g. "rows")
corresponding to the discrete values (used as keys for the dictionary).
For any given data key, the dictionary's "value" would be the list of
data elements that have that key.

For dimensions where the data are continuous (for example, annual sales,
unit price, population, etc.) instead you might sort the data in a list
and then use a binary search to find boundary points for analysis.
Here, there would be just a single list for the dimension, containing
the individual data elements, but sorted

Some data might fit into either category. For example, dates might be
useful both as discrete data points or continuous ranges. For that kind
of data, a SortedDictionary<TKey, TValue> or SortedList<TKey, TValue>
might be more appropriate.

Though, since both of those actually use O(log n) retrieval algorithms
(binary tree and binary search, respectively) you might find just
implementing it yourself is more appropriate. Especially since both of
those data structures require unique keys, just like any dictionary,
requiring you to use lists as the "value" type for the data structures.
If you implement a sorted list yourself, then you can avoid the
overhead of the wrapper lists, because you can allow duplicated values
in your sorted list.

Whether it's important to avoid that overhead depends on how many data
elements for a given key you expect to have, on average. If there are a
high number of collisions even for that discrete-and-continuous data,
then the overhead of the lists isn't likely to be a big deal. But, if
most of the individual data elements will have unique keys for that
dimension, then you'll have almost as many lists (most being
single-element lists) as you have data elements, which could be very bad
(especially as your data set approaches a million elements…though, even
at a million if you're running on 64-bit Windows, that would be fine).

If the above discussion doesn't seem helpful or applicable to the
problem you're trying to solve, then you should restate the problem in a
more specific way, so that it's clearer what problem you _are_ actually
trying to solve.

Pete

The information is appreciated. It distills much of what I've already
read.

List/Collection/Array Selection - Memory Issues

james

Arne Vajhøj

james

Arne Vajhøj

james