string dictionary and memory issue.

Alexandre Brisebois (www.pointnetsolutions.com) · Mar 10, 2006

Hi,

I am currently building a lexical analysis component to pull keywords
out of content,
I currently have a functional first build, but I am having problems
ince I am easily loading over 300 000 strings in memory,

when I am doing the actual analysis I can reach upto 400 mb of ram
usage.

I currently have built my dictionary out of a tree built by nodes
containing hashtables. each node represents a letter of the string and
a flag representing the end of a string.

I thought of building such a structure for speed. The reading speed of
the tree, and loading speed is
remarkably fast, but the cost to the system is greater then I had
expected.

Could anyone suggest a solution to this ?
should I store my node data as compressed data?
Best Regards,
Alexandre Brisebois

Jon Skeet [C# MVP] · Mar 10, 2006

Alexandre said:
I am currently building a lexical analysis component to pull keywords
out of content,
I currently have a functional first build, but I am having problems
ince I am easily loading over 300 000 strings in memory,

when I am doing the actual analysis I can reach upto 400 mb of ram
usage.

I currently have built my dictionary out of a tree built by nodes
containing hashtables. each node represents a letter of the string and
a flag representing the end of a string.

How many entries are in each node? I wouldn't expect there'd be that
many - particularly if you're only dealing with ASCII characters. I'd
suggest using a plain array, either as a list of entries (for nodes
without many sub-entries) or an array with nulls in (for those where
traversing a list would be expensive). You could use ArrayList/List<T>
instead of the straight arrays, but call TrimToSize on all of them when
you've finished loading the list of words to avoid wastage.

Jon

Alexandre Brisebois (www.pointnetsolutions.com) · Mar 10, 2006

well each node has beteew 0 to possible characters

I tryed loading 117 000~ words in a straight foward hashtable.
a 4 meg xml file became 25 megs in memory...

but I still find that extremly large. I am tring to find a way to use
as little ram as possible,
without completely hurting my speed.

I will try with the ArrayList and List generic collections.
Best Regards,
Alexandre Brisebois

If you have any other ideas please do let me know.

Jon Skeet [C# MVP] · Mar 10, 2006

Alexandre said:
well each node has beteew 0 to possible characters

I tryed loading 117 000~ words in a straight foward hashtable.
a 4 meg xml file became 25 megs in memory...

Is that the total memory of the process? If so, that's not particularly
surprising. The framework creates a fair amount of overhead (try
loading a virtually empty file to see what I mean) and the strings will
all be converted to Unicode, which will double the size of an ASCII
file.

but I still find that extremly large. I am tring to find a way to use
as little ram as possible, without completely hurting my speed.

Do you actually have a concrete requirement with respect to memory? On
most modern systems, 25MB is very little. One can waste a lot of time
going for the "best possible" performance instead of performance which
is "good enough".

Jon

Alexandre Brisebois (www.pointnetsolutions.com) · Mar 10, 2006

Well I want to try and have this run on a shared hosting machine.

So I do wish to find a way to reduce the amount of memory use.
Though the more I look into this, the more I underant that I will have
to build a distributed system.

Im still not sure how im going to have to structure it all.

so far my dictionary files contain roufly 200 000 words and is growing
I also built a Thesaurus ADT so then we are talking about an ther easy
200 000 words ( interlinked by references not copies )

and the system continiously identifies unknown words which need to be
reviewed.

so all this is taking up a lot of resources as it is loaded.

I have never done lexical analysis before, so this stuff is all new to
me and I have not found the best strategy so far.
so I keep looking and trying different things.

Using a database would make everything much simpler,
but I dont have that luxury.

that is the main reason I am looking for some type of strategy to build
either partial or full thesaurus and dictionary ADTs
I currently am looking at the partial loading option, where I would
load it up in a graph containing only the words it needs
for a particular analysis and unloading afterwards.

also looking to maybe merge the thesaurus with the dictionary. but I
and trying to find anything else that will prevent me from doing so.

Best Regards,
Alexandre Brisebois

Generic Dictionary For Keys Only?	3	May 27, 2008
Thread-safe Dictionary	2	Sep 12, 2006
Value of a Dictionary string[]	3	Jan 27, 2010
Parse String to Dictionary<String, String>	10	Oct 10, 2010
Algorithm Suggestion/Help (buiding strings from a graph)	1	Jun 23, 2008
unsafe loop not as fast as id like	2	Oct 16, 2008
Dictionary<K,V> and Case Insensitive key compare?	2	Aug 7, 2004
Treeview questions	1	Oct 22, 2006

string dictionary and memory issue.

Alexandre Brisebois (www.pointnetsolutions.com)

Jon Skeet [C# MVP]

Alexandre Brisebois (www.pointnetsolutions.com)

Jon Skeet [C# MVP]

Alexandre Brisebois (www.pointnetsolutions.com)

Ask a Question

Similar Threads