string dictionary and memory issue.

  • Thread starter Thread starter Alexandre Brisebois (www.pointnetsolutions.com)
  • Start date Start date
A

Alexandre Brisebois (www.pointnetsolutions.com)

Hi,

I am currently building a lexical analysis component to pull keywords
out of content,
I currently have a functional first build, but I am having problems
ince I am easily loading over 300 000 strings in memory,

when I am doing the actual analysis I can reach upto 400 mb of ram
usage.

I currently have built my dictionary out of a tree built by nodes
containing hashtables. each node represents a letter of the string and
a flag representing the end of a string.

I thought of building such a structure for speed. The reading speed of
the tree, and loading speed is
remarkably fast, but the cost to the system is greater then I had
expected.

Could anyone suggest a solution to this ?
should I store my node data as compressed data?
Best Regards,
Alexandre Brisebois
 
Alexandre said:
I am currently building a lexical analysis component to pull keywords
out of content,
I currently have a functional first build, but I am having problems
ince I am easily loading over 300 000 strings in memory,

when I am doing the actual analysis I can reach upto 400 mb of ram
usage.

I currently have built my dictionary out of a tree built by nodes
containing hashtables. each node represents a letter of the string and
a flag representing the end of a string.

How many entries are in each node? I wouldn't expect there'd be that
many - particularly if you're only dealing with ASCII characters. I'd
suggest using a plain array, either as a list of entries (for nodes
without many sub-entries) or an array with nulls in (for those where
traversing a list would be expensive). You could use ArrayList/List<T>
instead of the straight arrays, but call TrimToSize on all of them when
you've finished loading the list of words to avoid wastage.

Jon
 
well each node has beteew 0 to possible characters

I tryed loading 117 000~ words in a straight foward hashtable.
a 4 meg xml file became 25 megs in memory...

but I still find that extremly large. I am tring to find a way to use
as little ram as possible,
without completely hurting my speed.

I will try with the ArrayList and List generic collections.
Best Regards,
Alexandre Brisebois

If you have any other ideas please do let me know.
 
Alexandre said:
well each node has beteew 0 to possible characters

I tryed loading 117 000~ words in a straight foward hashtable.
a 4 meg xml file became 25 megs in memory...

Is that the total memory of the process? If so, that's not particularly
surprising. The framework creates a fair amount of overhead (try
loading a virtually empty file to see what I mean) and the strings will
all be converted to Unicode, which will double the size of an ASCII
file.
but I still find that extremly large. I am tring to find a way to use
as little ram as possible, without completely hurting my speed.

Do you actually have a concrete requirement with respect to memory? On
most modern systems, 25MB is very little. One can waste a lot of time
going for the "best possible" performance instead of performance which
is "good enough".

Jon
 
Well I want to try and have this run on a shared hosting machine.

So I do wish to find a way to reduce the amount of memory use.
Though the more I look into this, the more I underant that I will have
to build a distributed system.

Im still not sure how im going to have to structure it all.

so far my dictionary files contain roufly 200 000 words and is growing
I also built a Thesaurus ADT so then we are talking about an ther easy
200 000 words ( interlinked by references not copies )

and the system continiously identifies unknown words which need to be
reviewed.

so all this is taking up a lot of resources as it is loaded.

I have never done lexical analysis before, so this stuff is all new to
me and I have not found the best strategy so far.
so I keep looking and trying different things.

Using a database would make everything much simpler,
but I dont have that luxury.

that is the main reason I am looking for some type of strategy to build
either partial or full thesaurus and dictionary ADTs
I currently am looking at the partial loading option, where I would
load it up in a graph containing only the words it needs
for a particular analysis and unloading afterwards.

also looking to maybe merge the thesaurus with the dictionary. but I
and trying to find anything else that will prevent me from doing so.

Best Regards,
Alexandre Brisebois
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top