Data Storage for news client

Danny Tuppeny · Nov 17, 2005

Hi all,

I'm writing a news client (mainly to test out CAB & ClickOnce!), and trying
to decide on what to use for the storage of messages etc.. SQL Express seems
like overkill (and is a hefty download for a < 1MB app!). Also, since there
could be thousands of messages (potentially binary), I'm not sure that
serializing my classes to disk would perform at all well.

What would other people use for a small app like this? And why?

Thanks,

Pete Davis · Nov 17, 2005

I would simply store them sequentially in a single file and then create an
index file which has some header information (perhaps subject, author, date,
etc) and an offset to the message's text in the main file. Similar to my
response to the message just a bit earlier under "squeeze few image file
into on binary file"

You could also compress the text prior to storing it in the single file
(using SharpZipLib or 7Zip or something). I suspect it would compress well,
even messages with uuencoded or yenc encoded binaries.

I actually need to integrate a newsreader, at some point, into an app I'm
writing and I suspect this is the direction I'll take.

The advantage of this is that access is quick and it easily accommodates
thousands of messages. If you store the messages in separate files, you'll
soon find your directory getting large and getting to the data in a single
file with an index, using Seek will be much faster than having the file
system find a match for your file name in a directory with thousands of
files.

It's also fairly easy to purge lots of contiguous messages (which is likely
how you'd want to handle purging from a newsreader) from the file. For
example, if you want to delete the first 1000 messages, simply find the
index to the 1001'st message, then copy the data from there to the end to a
new file, delete the original file, and then rename the new one to the name
of the old. Do the same with the index file.

Pete

Danny Tuppeny · Nov 17, 2005

Hi Peter,

I would simply store them sequentially in a single file and then create an
index file which has some header information (perhaps subject, author,
date, etc) and an offset to the message's text in the main file. Similar to
my response to the message just a bit earlier under "squeeze few image file
into on binary file"

You could also compress the text prior to storing it in the single file
(using SharpZipLib or 7Zip or something). I suspect it would compress
well, even messages with uuencoded or yenc encoded binaries.

I actually need to integrate a newsreader, at some point, into an app I'm
writing and I suspect this is the direction I'll take.

Interesting response. What about performance though? If the user opens a
folder that has 1,000 messages, either I have to load them all *very*
quickly (I need to display Sender, Subject, Date, etc.), or I fetch them as
the user scrolls (which could be pretty unresponsive if the user is dragging
the scrollbar).

What would you store in the index file? The user will be able to change the
sort order in the display, so unless I maintain a few indexes, it'd be
difficult to get a list in order. The message list will show the Sender,
Date, Subject etc., and so if I have to scan through the data file for
thousands of these things, surely it'll take an age? I've never done this
kind of processing before, so I've no idea of how it would perform. I don't
want to build it and find it's unacceptable, so any experiences anyone can
share would be much appreciated!

As for compression - again, without testing it, I wouldn't know - but
although compression would save tons of disk space, wouldn't the overhead of
the compression make is slower than reading more uncompressed data? I assume
compression would be variable, so it'd be difficult to seek within a
compressed stream. Any ideas?

Thanks,

Danny

Pete Davis · Nov 18, 2005

Danny Tuppeny said:
Hi Peter,
[snip]
Interesting response. What about performance though? If the user opens a
folder that has 1,000 messages, either I have to load them all *very*
quickly (I need to display Sender, Subject, Date, etc.), or I fetch them
as the user scrolls (which could be pretty unresponsive if the user is
dragging the scrollbar).

I suspect it will load much faster than you think.

Assuming in the index you store Sender, subject, date, message ID, offset in
main file, and a few other header items, I suspect you're looking at an
average of roughly 100-200 bytes per message. Let's say 200 bytes, but
that's probably on the high side. That works out to only 200K per thousand
messages or 5000 messages per megabyte. That will load into memory pretty
quickly.

What would you store in the index file? The user will be able to change
the sort order in the display, so unless I maintain a few indexes, it'd be
difficult to get a list in order. The message list will show the Sender,
Date, Subject etc., and so if I have to scan through the data file for
thousands of these things, surely it'll take an age? I've never done this
kind of processing before, so I've no idea of how it would perform. I
don't want to build it and find it's unacceptable, so any experiences
anyone can share would be much appreciated!

Well, if they're going to be able to sort them, then it makes sense to load
it all into memory, assuming that's feasible. Given the figures above, that
should be doable on most modern computers, assuming your just loading
messages from a single group at a time. Load the messages into memory and
then sort them. Leave them sorted in the files however you want. It won't
make much difference.

I don't expect it to be lighting fast, but I think it will be much faster
than you think. Implementing the IComparer interface, sorting should be a
piece of cake and the built-in sort algorithm is quick sort, I believe.

As for compression - again, without testing it, I wouldn't know - but
although compression would save tons of disk space, wouldn't the overhead
of the compression make is slower than reading more uncompressed data? I
assume compression would be variable, so it'd be difficult to seek within
a compressed stream. Any ideas?

Compressing data is slow. Decomrpessing is generally quite fast. I suspect
it'll be faster to read due to the large amount of saved space, particularly
if data is located on a network drive.

Remember, 2 files: Index file and Data File. Leave the index file
uncompressed. Don't compress the entire data file, just compress the
individual messages. That way you have an offset to each compressed message
and just begin decompression at the beginning of the message. Again, look at
the message I posted earlier where I use a simple index file and store a
bunch of thumbnails in a single file. It easily loads 500 thumbnails (and
that includes jpeg decoding of the data) in a matter of maybe 2 seconds.
Without the jpeg decoding, it would be less than half a second, I'm sure.

Guest · Nov 18, 2005

Why don't you try the SQLite database engine? It's a single small DLL,
requires no installation, has an ADO.NET provider, and it's extremely fast.
There's now a 2.0 version as well. Check it out at Sourceforge.net
peter

Danny Tuppeny · Nov 18, 2005

Pete Davis said:
I suspect it will load much faster than you think.

After Googling a little more last night, I think you're right!

I ran this:

http://www.codeproject.com/csharp/FastBinaryFileInput.asp

Which didn't take long to create 10,000,000 structs in a binary file -
276MB of data

Well, if they're going to be able to sort them, then it makes sense to
load it all into memory, assuming that's feasible. Given the figures
above, that should be doable on most modern computers, assuming your just
loading messages from a single group at a time. Load the messages into
memory and then sort them. Leave them sorted in the files however you
want. It won't make much difference.

I was thinking about this - if once loaded into memory, I let the user sort
(probably by clicking column headers), once they more to another folder (or
close the app), I can write the index back in this order - which persists
their sort order, but also means I don't ever have to load it and
immediately sort afterwards

Do you think the index file would perform well as normal Serialized objects?
The smaller thngs (Folders, user settings etc.) I was going to just
serialize as XML. Since the messages (indexes) won't be huge, I'm wondering
if they can be done the same way, or if I'd need to think about something
slightly different, like the message data..?

Remember, 2 files: Index file and Data File. Leave the index file
uncompressed. Don't compress the entire data file, just compress the
individual messages. That way you have an offset to each compressed
message and just begin decompression at the beginning of the message.
Again, look at the message I posted earlier where I use a simple index
file and store a bunch of thumbnails in a single file. It easily loads 500
thumbnails (and that includes jpeg decoding of the data) in a matter of
maybe 2 seconds. Without the jpeg decoding, it would be less than half a
second, I'm sure.

I forgot to look! Just looked now, and it looks very helpful - thanks!

Danny

Pete Davis · Nov 18, 2005

Danny Tuppeny said:
Pete Davis said:

I suspect it will load much faster than you think.

Click to expand...

[snip]

I was thinking about this - if once loaded into memory, I let the user
sort (probably by clicking column headers), once they more to another
folder (or close the app), I can write the index back in this order -
which persists their sort order, but also means I don't ever have to load
it and immediately sort afterwards

yes, it will/ But again, I don't think sorting is going to be slow at all.

Do you think the index file would perform well as normal Serialized
objects? The smaller thngs (Folders, user settings etc.) I was going to
just serialize as XML. Since the messages (indexes) won't be huge, I'm
wondering if they can be done the same way, or if I'd need to think about
something slightly different, like the message data..?

Xml serialization for the headers is probably fine.

Pete

Data Storage for news client

Danny Tuppeny

Pete Davis

Danny Tuppeny

Pete Davis

Guest

Danny Tuppeny

Pete Davis