Database for C#?

M

Mark B

We have been working on a C# 2007 Outlook Add-in.

What's the best way to handle access to data in an XML file?

For example, if we want to refer to a list of the 80,000 zip codes and
associated states and pull the state name from a particular record, what are
the options?

I've been reading about SelectSingleNode Method (String)
http://msdn.microsoft.com/en-us/library/fb63z0tw.aspx. However would that be
slow because the it's not indexed like a SQL database is?
 
S

Stanimir Stoyanov

Yes, XML lookup would be slow because it is not indexed. SQL is a good
choice but for an application such as Outlook add-ins XML might be a better
choice--one file, no additional libraries or software required.

However, ensure that at startup you cache the whole content to a relevant
Dictionary<key, value>. Loading might be a little slow but once cached,
value lookup will be much faster (by the way, loop through all XML nodes
consequently, don't select based on criteria). Of course, you can also do
tests to see how this affects the overall performance of the add-in.
 
H

Hillbilly

Have you read up on LINQ to XML? XPath is XPath and personally speaking I
think an XML file with 80,000 elements is going to impose observable
performance and the data should probably be shredded into SQL Server. BTW --
you wouldn't mind sending me a copy of that data would you?
 
M

Mark B

That scenario of zip codes was just made up to explain the question :(.

Our one may go to 30,000 records.

We have 3 fields:

LanguageCode (string)
LookupID (integer)
LookupText (string)

We search on LanguageCode (e.g. fr-CA) and then LookupID (e.g. 4235) to pull
back relevant text (in this case Canadian French text).

I had thought of doing a 2-dimensional array string variable:

GetText(LanguageCodeInteger,LookupID) so I guess that would be fast and use
a bit of memory but I am guessing a Dictionary does the same thing (I am new
to C#)?
 
M

Mark B

Thanks. I am new to C# but am guessing a Dictionary is the way to go loading
at startup.
 
A

Arne Vajhøj

Mark said:
That scenario of zip codes was just made up to explain the question :(.

Always ask the real question. It is very difficult to
come up with a truly equivalent case.
Our one may go to 30,000 records.

We have 3 fields:

LanguageCode (string)
LookupID (integer)
LookupText (string)

We search on LanguageCode (e.g. fr-CA) and then LookupID (e.g. 4235) to
pull back relevant text (in this case Canadian French text).

I had thought of doing a 2-dimensional array string variable:

GetText(LanguageCodeInteger,LookupID) so I guess that would be fast and
use a bit of memory but I am guessing a Dictionary does the same thing
(I am new to C#)?

Dictionary<string,Dictionary<int, string>>

then you can lookup with:

d["fr-CA"][4235]

Arne
 
M

Mark B

Thanks.

Arne Vajhøj said:
Mark said:
That scenario of zip codes was just made up to explain the question :(.

Always ask the real question. It is very difficult to
come up with a truly equivalent case.
Our one may go to 30,000 records.

We have 3 fields:

LanguageCode (string)
LookupID (integer)
LookupText (string)

We search on LanguageCode (e.g. fr-CA) and then LookupID (e.g. 4235) to
pull back relevant text (in this case Canadian French text).

I had thought of doing a 2-dimensional array string variable:

GetText(LanguageCodeInteger,LookupID) so I guess that would be fast and
use a bit of memory but I am guessing a Dictionary does the same thing (I
am new to C#)?

Dictionary<string,Dictionary<int, string>>

then you can lookup with:

d["fr-CA"][4235]

Arne
 
M

Michael B. Trausch

Yes, XML lookup would be slow because it is not indexed. SQL is a
good choice but for an application such as Outlook add-ins XML might
be a better choice--one file, no additional libraries or software
required.

However, ensure that at startup you cache the whole content to a
relevant Dictionary<key, value>. Loading might be a little slow but
once cached, value lookup will be much faster (by the way, loop
through all XML nodes consequently, don't select based on criteria).
Of course, you can also do tests to see how this affects the overall
performance of the add-in.

Hrm.

This will incur a consumption of at least 400,000 bytes of RAM (5 bytes
times 80,000 ZIP codes). This isn't counting the city/state data if
that is part of the XML file.

It'd probably be better to use (if available) a DBF table and the
Windows DBF ODBC driver. DBFs can be indexed, and, if memory serves,
queried with a subset of SQL. However, in doing that, you'll incur the
penalty of also not being cross-platform.

You could create a primitive DBF library in under 3-4K legible source
code size---certainly smaller than a simple C reader that includes
indexing support---it won't include SQL capability, but you'll gain the
ability to use a DBF (and its index) and the ability to use whatever
query interface suits you best. You won't incur the memory usage
penalty that you would if you had the entire thing loaded in memory
that way, unless you write the DBF library to load the entire thing
into RAM.

If memory consumption is not a concern and you expect people to always
have lots of RAM free, then reading the XML file isn't a problem.

There are programming specs for DBF/xBase files readily handy on the
Web. This link should be a sufficient starting point:

http://www.clicketyclick.dk/databases/xbase/format/index.html

--- Mike
 
A

Arne Vajhøj

Michael said:
Hrm.

This will incur a consumption of at least 400,000 bytes of RAM (5 bytes
times 80,000 ZIP codes). This isn't counting the city/state data if
that is part of the XML file.

It'd probably be better to use (if available) a DBF table and the
Windows DBF ODBC driver. DBFs can be indexed, and, if memory serves,
queried with a subset of SQL. However, in doing that, you'll incur the
penalty of also not being cross-platform.

You could create a primitive DBF library in under 3-4K legible source
code size---certainly smaller than a simple C reader that includes
indexing support---it won't include SQL capability, but you'll gain the
ability to use a DBF (and its index) and the ability to use whatever
query interface suits you best. You won't incur the memory usage
penalty that you would if you had the entire thing loaded in memory
that way, unless you write the DBF library to load the entire thing
into RAM.

If memory consumption is not a concern and you expect people to always
have lots of RAM free, then reading the XML file isn't a problem.

Hm.

400 KB is nothing today.

And in memory will be a lot faster than a database.

And writing ones own database library seems as a complete
waste of resources.

And the aspect of cross platform does not seem to apply
much to an Outlook Addin written in C#.

Arne
 
M

Michael B. Trausch

Hm.

400 KB is nothing today.

It's close to half a megabyte. Systems with 256 MB of RAM are not
uncommon still, and 400K is a noticeable size when the system's memory
is nearly fully utilized and you're dipping into swap. Just because
virtual memory exists isn't really an excuse to act like one has
infinite resources. There is far too much out there that is bloated
and contains way too much overhead.
And in memory will be a lot faster than a database.

Only if the system isn't leaning heavily on swap. My system, it'd be
really fast---I have 8 gigs of RAM. But I know many people using
systems with even memory sizes as small as 128 MB. We don't have to
count memory down to the bit anymore when we're measuring memory
consumption, but there is a middle road between those days and assuming
a machine with infinite memory resources. IOW, that road lies between
assuming there is only a very tiny space is available and optimizing
the crap out of software for size, and not doing anything at all to
think about the memory footprint of what we're doing. The middle road
seems reasonable when you're talking in terms that can be expressed in
tenths of a megabyte. If we were talking about 40K of data, I'd think
that a database was insane overhead. But, a ZIP code database with any
meaningful information is going to be at least 2.1 MB for 80,000
records, not including the structure of the file it's stored in,
whether that be XML or DBF or even CSV. 256 MB of RAM, only 128
processes have to assume that they can grab 2 MB of RAM and you're
dipping into swap (oversimplified, really it's probably closer to 80
processes outside of the operating system, but you get the point).
And writing ones own database library seems as a complete
waste of resources.

A DBF reader is pretty trivial in any language, and functionality to
utilize the index is not much harder than that, really. The format
isn't that complex, it's just a structured flat file, which is easy to
write a library for. I just don't know if someone's written such a
lightweight thing in C# yet. If not, well, whatever---again, it's a
trivial file format to write for, and for a read-only database (from
the application's point of view) it's _very_ simple to implement. A
reasonably decent programmer should be able to implement a relatively
efficient, read-only DBF library in 2 hours (not counting the time it
takes to read the specs); 3 hours if you include an efficient method of
utilizing an index; add a few more hours if you want to write to it
with row-level locking. Subtract a little bit if you're not
implementing the Memo functionality, which is really not necessary
unless you're trying to import old DBase or VFP-based data from
applications that actually made use of the memo fields.

It would be a waste of resources to try to implement a fully functional
library that would handle row-level locking and all of that junk. In
any case, it was just an alternative.
And the aspect of cross platform does not seem to apply
much to an Outlook Addin written in C#.

Outlook addin? No. Generally good practice? Yes, since C# is
cross-platform these days. It's a good habit to get into. Being in
the habit of writing software that is close to cross-platform is
beneficial for many reasons, one of the most important being keeping an
open mind with regard to the implementation details of things.

--- Mike
 
M

Mark B

Thanks again guys.

I have two follow-up questions relating to this:

1) I want the XML file that .Net generates to have separate Nodes for each
language so that we can selectively load only the language text values for
the language that the user selects for the Add-in. That will reduce the
memory and make it quicker to load. Currently the code is producing the XML
in a flat structure:

<XMLData>
<LanguageCode>EN-US</LanguageCode>
<LookupID>391</LookupID>
<LanguageText>Hello.</LanguageText>
</XMLData>
<XMLData>
<LanguageCode>EN-US</LanguageCode>
<LookupID>392</LookupID>
<LanguageText>Select a different email address.</LanguageText>
</XMLData>
<XMLData>
<LanguageCode>EN-US</LanguageCode>
<LookupID>393</LookupID>
<LanguageText>Check that you have installed the correct
driver.</LanguageText>
</XMLData>
<XMLData>
<LanguageCode>EN-US</LanguageCode>
<LookupID>394</LookupID>
<LanguageText>Click OK to continue.</LanguageText>
</XMLData>
<XMLData>
<LanguageCode>FR-FR</LanguageCode>
<LookupID>391</LookupID>
<LanguageText>Bonjour</LanguageText>
</XMLData>
....

This is the code that outputs a SQL stored procedure select query to XML:

try {
sqlConnection1.Open();
ds.DataSetName = "XMLData";
ds.Load(cmd.ExecuteReader(), LoadOption.OverwriteChanges,
"XMLData");
System.IO.FileStream myFileStream = new
System.IO.FileStream(filename, System.IO.FileMode.Create);
ds.WriteXml(myFileStream, XmlWriteMode.WriteSchema);
}

catch (SqlException ex) {
throw ex;
}
finally {
sqlConnection1.Close();
sqlConnection1.Dispose();
cmd.Dispose();
}

How can we make it create a node for each language?



2) Then how can we load just one node (e.g. FR-FR) into the Dictionary?












Hm.

400 KB is nothing today.

It's close to half a megabyte. Systems with 256 MB of RAM are not
uncommon still, and 400K is a noticeable size when the system's memory
is nearly fully utilized and you're dipping into swap. Just because
virtual memory exists isn't really an excuse to act like one has
infinite resources. There is far too much out there that is bloated
and contains way too much overhead.
And in memory will be a lot faster than a database.

Only if the system isn't leaning heavily on swap. My system, it'd be
really fast---I have 8 gigs of RAM. But I know many people using
systems with even memory sizes as small as 128 MB. We don't have to
count memory down to the bit anymore when we're measuring memory
consumption, but there is a middle road between those days and assuming
a machine with infinite memory resources. IOW, that road lies between
assuming there is only a very tiny space is available and optimizing
the crap out of software for size, and not doing anything at all to
think about the memory footprint of what we're doing. The middle road
seems reasonable when you're talking in terms that can be expressed in
tenths of a megabyte. If we were talking about 40K of data, I'd think
that a database was insane overhead. But, a ZIP code database with any
meaningful information is going to be at least 2.1 MB for 80,000
records, not including the structure of the file it's stored in,
whether that be XML or DBF or even CSV. 256 MB of RAM, only 128
processes have to assume that they can grab 2 MB of RAM and you're
dipping into swap (oversimplified, really it's probably closer to 80
processes outside of the operating system, but you get the point).
And writing ones own database library seems as a complete
waste of resources.

A DBF reader is pretty trivial in any language, and functionality to
utilize the index is not much harder than that, really. The format
isn't that complex, it's just a structured flat file, which is easy to
write a library for. I just don't know if someone's written such a
lightweight thing in C# yet. If not, well, whatever---again, it's a
trivial file format to write for, and for a read-only database (from
the application's point of view) it's _very_ simple to implement. A
reasonably decent programmer should be able to implement a relatively
efficient, read-only DBF library in 2 hours (not counting the time it
takes to read the specs); 3 hours if you include an efficient method of
utilizing an index; add a few more hours if you want to write to it
with row-level locking. Subtract a little bit if you're not
implementing the Memo functionality, which is really not necessary
unless you're trying to import old DBase or VFP-based data from
applications that actually made use of the memo fields.

It would be a waste of resources to try to implement a fully functional
library that would handle row-level locking and all of that junk. In
any case, it was just an alternative.
And the aspect of cross platform does not seem to apply
much to an Outlook Addin written in C#.

Outlook addin? No. Generally good practice? Yes, since C# is
cross-platform these days. It's a good habit to get into. Being in
the habit of writing software that is close to cross-platform is
beneficial for many reasons, one of the most important being keeping an
open mind with regard to the implementation details of things.

--- Mike
 
M

Michael B. Trausch

1) I want the XML file that .Net generates to have separate Nodes for
each language so that we can selectively load only the language text
values for the language that the user selects for the Add-in. That
will reduce the memory and make it quicker to load. Currently the
code is producing the XML in a flat structure:

You could query the master database and just iterate through records
and use the XML API to create your own custom XML formatted output. If
you're dealing with any sizeable amount of records, though, this is
going to be wildly inefficient.
How can we make it create a node for each language?

Loop through the languages and create the nodes and subtrees yourself,
essentially. Unless your database server supports some pretty special
and non-standard stuff, you're not going to be able to issue a query
that returns a hierarchical result.
2) Then how can we load just one node (e.g. FR-FR) into the
Dictionary?

Again, I'd recommend using something like xBase (or modified xBase)
tables---then, you can have one "table" per language, and just load the
correct table at run-time. The more details I see, the more it seems
that a container like xBase makes sense for you. The only modification
you really would need to make is to use UTF-8 encoding when storing
strings in your xBase files. This won't be supported by any *real*
implementations of xBase since most of them predated Unicode, but that
won't really matter.

If you're using a good deal of data, you may want to look at using the
DBase ODBC driver that comes installed with Windows for accessing your
data, since then you get the benefit of using some limited SQL, too.
But I don't know how heavy the ODBC driver is. It's certainly going to
be better than the burden of XML, especially for somewhat larger data
sets which are compartmentalized like you're talking about here.

--- Mike

P.S. I am assuming that your clients are unable to use an existing
database server that is already present in the environment, or that
this is somehow explicitly undesirable. It'd probably be best to just
use the master database to begin with, though, if that's available to
the client software in the configuration you're expecting to use.
 
M

Mark B

The other quick way I guess is just to simply get the code to create an XML
file for each language...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top