Database normalization

Jesse Aviles · Nov 8, 2003

I recently bought the Access 2002 Developers Handbook set and started
reading it. So far, so good (so very good). Although I didnt knew anything
about the normalization rules, most of my tables comply with the 3NF. I
believe that since I didnt knew anything about normalization when I started
building the mdb and still got it to almost 3NF, normalization is a solid
way to prevent redundant data and business losses (time and money).
However, I've also heard that database de-normalization is better to improve
performance. How much of the preceding statement is true? Also, what are
the 4NF and 5NF. I haven't found a site that discusses the last two forms
and I would like to know about them (curiosity). Thanks.

--
Jesse Avilés
(e-mail address removed)
http://home.coqui.net/monk (~Dec 1 will be up. Access en español, among
other things)
Reply Only To The Newsgroup

Arvin Meyer · Nov 8, 2003

You can find a good explanation of all 5 Normalization rules + BCNF
(Boyce-Codd Normal Form) here:

http://www.cs.jcu.edu.au/Subjects/cp1500/1998/Lecture_Notes/normalisation/

My general rule of database design is to normalize everything first, then
break normalization when necessary. That's a bit more work, but reduces
mistakes and allows you to think about, and document, why you are breaking
the rules.

De-Normalization typically improves performance when there is static data
that requires intensive calculations. An OLAP cube slice is an example of
that. I can think of only 2 cases were performance was improved by
denormalizing in the past 6 years of my work, and in both of those cases I
elected to store the data in a summary table.

One of those tables was a temporary table where the data was archived every
month. As 486's were replaced with Pentiums, the 50 - 60 second processing
time was reduced to less than 5 seconds and we no longer stored it. The
other case reduces about 80 to 100K rows of data every day to a 2K row
summary table, which is then is archived about every 18 months.

The first 3 NFs are important, that last 2 (or 3 if you consider BCNF to be
different from 4NF) are really more for special cases and/or mathematical
purists. Most of the time that I break the rules is when after I have
explaine to the client that the "right way" is normalized and they elect to
be wrong. I do it their way, (usually documenting my displeasure) then come
back and do it again when it returns to bite them.
--
Arvin Meyer, MCP, MVP
Microsoft Access
Free Access downloads:
http://www.datastrat.com
http://www.mvps.org/access

Allen Browne · Nov 8, 2003

There are some cases where denormalization is desireable a particular
reason, e.g. precision or performance. You can break any rule provided you
know what are getting yourself in for, e.g. the maintenance issues
associated with guaranteeing that the denormalized data is correct.

Couple of examples where denormalization is commonly used. The classic
example is where you want to match people by an algorithm such as Soundex,
and so you store (and index) the Soundex value of their Surname, even though
it is directly derived from their name. This is denormalization for
performance reasons.

Another exmaple is where the an InvoiceDetail table stores the amount of tax
on each row (so it can cope with some rows that are taxable, and others that
are tax exempt). Strictly, it should be the tax rate that is stored at each
row, since the amount of tax is dependent on the sale amount for the row. To
prevent any rounding errors, the fully normalized approach that copes wih
varying tax rates per row (as some countries have) would have to be
something like this:
Round(CCur(Nz(Quantity * UnitPrice * TaxRate, 0)), 2)
There was a time when that kind of calcuation on every row would cause
unacceptable performance (e.g. for calcuating the quarterly summary
figures), so we used the partially denormalized approach of storing the
Quanity, UnitPrice, and TaxAmount instead of Quantity, UnitPrice and
TaxRate.

As processors become more powerful, the number of cases where
denormalization is justified shrinks. The vast majority or cases where you
see denormalized data, it is just plain wrong.

When you do denormalize, you take on the responsibility of ensuring the data
is always right. It is all too easy to develop a system that gradually
becomes more and more wrong as time goes on. However, it you must
denormalize, it may be possible to build in checks so that the system
becomes self-healing. Earlier in the year I was asked to develop a
centralized stock control system for 5 mail order companies that shared
stock and were on the same network. A couple of them had 500,000 or more
clients, and their orders. It was not practical to calculate the quantity of
stock from all companies in real time based on the on-going sales, so we
developed a system where the workstations pushed their sales up to the
stock-control centre in real time, and the stock control would pull through
the full summary from all the workstations of each company on a weekly
basis. In this way, the centralized system had real-time data stored locally
(denormalized), but it was self-correcting on a weekly basis. Then the
physical count of stock in the periodic stocktake would again correct the
true value, so the system was constantly becoming more "right" instead of
gradually becoming more "wrong" over time.

Database normalization

Jesse Aviles

Arvin Meyer

Allen Browne