Table design for simple time management

J

JMF

Here is an absolutely minimal spec for my little time management system:

1) A set of projects. Each project has a set of tasks.
2) Every day, I make a number of entries: "I worked x hours today on task y
of project z. I worked n hours on task q of project r." etc. whereby there
could be any number of entries on a single day, depending on how long you
worked on how many different tasks.

That's it. Simple as that.

Since I'm pretty new to Access, I want to be sure I get the table design
correctly. This is what I currently have:

Table "Projects." Just one field, projects, which is the primary key.

Project1
Project2
etc.

Table "Tasks." TaskId, Project, Task. TaskId is the primary key, Project is
linked to the foreign key in the projects table.

1 Project1 Task1
2 Project1 Task2
3 Project2 Task1
etc.

Table "Hours Worked." Date, Hours, TaskId, whereby TaskId is linked to the
TaskId in the Tasks table.

5 Feb 2006, 3 hours, TaskId = 1
5 Feb 2006, 2 hours, TaskId = 3
etc.

How does that seem as a minimal but adequate design? Here are a couple of
questions:

- It seems to me that in the Tasks table, you could even do away with the
TaskId field and directly use the combination of Project and Task as the
primary key, since the combination has to be unique. But then I don't
understand how that would fit in to a relationship to the Hours Worked
table. Is it possible or desirable to eliminate the TaskId primary key? Or
is it the best way to relate those two tables?

- Currently my Hours Worked table has no primary key, which makes me
nervous. On the other hand, I can't figure out what it would be good for if
I had it.

Any table design experts interested in commenting?

Thanks!

John
 
T

Tom Ellison

Dear John:

It is possible to eliminate all the surrogate keys columns, and build
relationships on the natural keys. I actually advocate this (and I have
gotten flack for it, as well) but I stick to it.

In your table structure, then there would be a unique natural key for
Projects. I'll call that ProjectName. That would be a unique name for each
project.

To advance my argument, there should be a unique ProjectName whether you use
an autonumber ID column in the table or not. On a form, you will
undoubtedly have a combo box in which to choose the correct project for some
entry. This will list all the projects. If two projects ever have the
exact same name, that name would appear twice. How would the user know how
to choose the exact one? Without displaying the ID values, and insisting
the user know which is which, you could not. Uniqueness of the natural key
is not really optional.

The Tasks of each project should be named uniquely within the context of
that project. There's no reason you couldn't have two tasks with the same
name as long as they are for different projects. So, the Primary Key for
the Task table is a two column natural key of ProjectName/TaskName. The
rest of the discussion is the same as the previous paragraph.

The HoursWorked table then has a 3 part composite key:
ProjectName/TaskName/WorkName. You probably wouldn't call this WorkName,
however. What you would use depends on how you want to identify this. It
might be a date/time indicating when you started working on that project. I
might call that WorkTime. It is natural that the composite key would be
unique. Indeed, WorkTime would be unique for the whole table.
Alternatively, you could just number all the instances of work for each
Project/Task. Then the key would be Project/Task/WorkNumber (or
WorkInstance).

For small to medium sized projects, this has some advantages.

Say you wanted to report all the work done in a week. You'd need to show
ProjectName, TaskName, and details of the work (start, end, notes, etc)
With autonumber IDs, you'd have to JOIN the query to the Task table and the
Project table just to get the TaskName and ProjectName. If you store those
natural keys in the HoursWorked table, there's no lookup needed. There's
less coding and better performance.

If you do use the autonumber identity keys in your tables and relationships,
you should also make unique indexes of all the natural keys, so they cannot
be duplicated (as already discussed). That means that, every time a row is
added, there are two indexes being appended, not just one. Again, this is
slightly slower.

There are some performance advantages to using the surrogate identity keys
as well. For small and moderate sized databases, these are small. With the
tools that come with Access, there is a 2 GB limit. This is well within
what I mean by moderate sized databases.

There is a debate that has, at times, raged over this. My approach is
empirical. I have tested the difference in performance between the two
approaches, again with respect to small and moderate sized databases. In
this case, there's no real performance issue worth talking about.

There is one very important issue to consider. That issue is based on the
stability of the natural key. If the natural keys ever need to change (you
mis-spell the name of a project and really need to fix it) then the change
in the Project table must "cascade" through the relationships into the Task
table and the HoursWorked, changing all the related table values for that
Project. Performance is not a big issue here either. For a moderate sized
database as I contemplate it, this may be a hundred updates or a bit more in
the extreme case. The cascade may take, oh, 3 seconds, maybe upward of 10.
It will occur maybe once a month. Not a very large proportion of the time
involved. If you contemplate having to correct such a thing every minute or
so, or even every hour, maybe this isn't so good. But I expect it is rare.

Now, setting up the cascade updates is not difficult. In very, very complex
table designs it falls apart. This occurs with both Access and SQL Server
(don't know about the most recent verison of SQL Server, but I've talked to
one of MS program managers about this at length one evening in Chicago). It
could be fixed. But you are unlikely to encounter this unless you are very
advanced in your table design requirements.

Maybe that's enough for now. Another book I've written. Well, I hope it
has been relevant, at least. Or did I waste all this typing?

Tom Ellison
 
J

JMF

Tom,

This is great! It's exactly the issues I was wondering about.
It is possible to eliminate all the surrogate keys columns, and build
relationships on the natural keys. I actually advocate this (and I have
gotten flack for it, as well) but I stick to it.

This is what I was thinking, but was wondering whether it was possible or
desirable. Nice to know you think it's desirable.
In your table structure, then there would be a unique natural key for
Projects. I'll call that ProjectName. That would be a unique name for
each project.
Right.

To advance my argument, there should be a unique ProjectName whether you
use an autonumber ID column in the table or not. On a form, you will
undoubtedly have a combo box in which to choose the correct project for
some entry. This will list all the projects. If two projects ever have
the exact same name, that name would appear twice. How would the user
know how to choose the exact one? Without displaying the ID values, and
insisting the user know which is which, you could not. Uniqueness of the
natural key is not really optional.

Agreed, the project names would have to be unique (and of course will be).
The Tasks of each project should be named uniquely within the context of
that project. There's no reason you couldn't have two tasks with the same
name as long as they are for different projects. So, the Primary Key for
the Task table is a two column natural key of ProjectName/TaskName. The
rest of the discussion is the same as the previous paragraph.

That's what I was thinking: ProjectName and NaskName are the natural key.
TaskNames will certainly be duplicated in different contexts (e.g.
"Management" task for several different projects). But always unique for a
project.
The HoursWorked table then has a 3 part composite key:
ProjectName/TaskName/WorkName. You probably wouldn't call this WorkName,
however. What you would use depends on how you want to identify this. It
might be a date/time indicating when you started working on that project.
I might call that WorkTime. It is natural that the composite key would be
unique. Indeed, WorkTime would be unique for the whole table.
Alternatively, you could just number all the instances of work for each
Project/Task. Then the key would be Project/Task/WorkNumber (or
WorkInstance).

That was the hardest part for me: finding a useful key for the HoursWorked
table. I'm still not 100% clear on what you wrote above, but I'm getting the
general idea.
For small to medium sized projects, this has some advantages.

Say you wanted to report all the work done in a week. You'd need to show
ProjectName, TaskName, and details of the work (start, end, notes, etc)
With autonumber IDs, you'd have to JOIN the query to the Task table and
the Project table just to get the TaskName and ProjectName. If you store
those natural keys in the HoursWorked table, there's no lookup needed.
There's less coding and better performance.

Yes, it seemed to me that by having the natural keys in there you could
"naturally" avoid joins. Not that joins are horrible things, but why not
avoid them when possible?
If you do use the autonumber identity keys in your tables and
relationships, you should also make unique indexes of all the natural
keys, so they cannot be duplicated (as already discussed). That means
that, every time a row is added, there are two indexes being appended, not
just one. Again, this is slightly slower.

There are some performance advantages to using the surrogate identity keys
as well. For small and moderate sized databases, these are small. With
the tools that come with Access, there is a 2 GB limit. This is well
within what I mean by moderate sized databases.

Mine will certainly be moderate sized.
There is a debate that has, at times, raged over this. My approach is
empirical. I have tested the difference in performance between the two
approaches, again with respect to small and moderate sized databases. In
this case, there's no real performance issue worth talking about.

I can see that, for small databases like mine.
There is one very important issue to consider. That issue is based on the
stability of the natural key. If the natural keys ever need to change
(you mis-spell the name of a project and really need to fix it) then the
change in the Project table must "cascade" through the relationships into
the Task table and the HoursWorked, changing all the related table values
for that Project. Performance is not a big issue here either. For a
moderate sized database as I contemplate it, this may be a hundred updates
or a bit more in the extreme case. The cascade may take, oh, 3 seconds,
maybe upward of 10. It will occur maybe once a month. Not a very large
proportion of the time involved. If you contemplate having to correct
such a thing every minute or so, or even every hour, maybe this isn't so
good. But I expect it is rare.

Understood. In fact my keys will be very stable, but I can understand from
what you write that it would be indeed a consideration if that were not the
case.
Now, setting up the cascade updates is not difficult. In very, very
complex table designs it falls apart. This occurs with both Access and
SQL Server (don't know about the most recent verison of SQL Server, but
I've talked to one of MS program managers about this at length one evening
in Chicago). It could be fixed. But you are unlikely to encounter this
unless you are very advanced in your table design requirements.

No advanced requirements yet!
Maybe that's enough for now. Another book I've written. Well, I hope it
has been relevant, at least. Or did I waste all this typing?

Thanks a million, Tom, that was just the kind of help I was looking for.
I'll save and study this.

John
 
T

Tom Ellison

Dear John:

Well, you seem to have absorbed a great deal. Likely your thinking was
already quite advanced.

Now, the thing about JOINs. This does nothing to "avoid joins". It just
changes them to be based on the natural keys. Being compound, the JOINs are
therefore more complex. You'll be JOINing on a longer list of columns than
you would need if you had a thorough set of identity surrogate keys. With
surrogate ID keys, you can build a ladder, with one identity forming the
JOIN downward, and another single identity key column added at each rung to
be used from tables above. Not everyone does this, but it certainly can
work that way. So then, the JOINs do not compound.

The result is a multiplier effect, with longer and longer composite natural
keys when there is a hierarchy of tables. I've had this extend to 6 levels
and more! Now, remembering exactly what that composite key should be for a
table would seem to be daunting. Somehow, for me, it has never been so.
Remembering just what the composite key is, even 6 levels in, has been
(again, for me) a perfectly natrual expression of exactly what I am doing.

The total length of such a composite key can approach 200 characters in
length (they're typically text, with a few date/time values thrown in toward
the twigs of the tree). So, many ask, how could it me that a 200 byte key
length doesn't result in a 50:1 performance disadvantage when compared with
an 4 byte autonumber?

The key is that the performance is based on the number of b-tree levels in
the index, which is to say, the number of hard drive accesses that must be
made to retrieve the "leaf" node, which is the row of data.

The math is not too bad to explain this. Given a 4K node size, there would
be just fewer than 20 entries per node for a 200 byte key (there's some
overhead, not just 4096/200). For the 4 byte long integer of the surrogate
autonumber ID, the index would allow perhaps 400 entries per node (the
overhead here is the same in absolute size, so it is larger in proportion,
and I've use 6 bytes per entry for my calculations). So the number of
levels of index, being exponential, would be:

Levels 200 4
1 20 400
2 400 160000
3 8000 64000000
4 160000
5 6400000

I don't give the figures for tables above the moderate size, as this is
beyond the scope of what we do in Access, and I have not peformed the
associated experiments for them. They are not my point at all.

The number of hard drive accesses to work through the index would typically
be one less, as the "root" node of the index will almost alwasy be cached,
but then you must still read the data row from the disk, so the access time
is pretty much according the what is shown. So, for a table with a million
rows, there is a 5:3 ratio of performance advantage to the shorter,
surrogate identity key. When you factor in all the times you may have to
perform this lookup just to find the natural key values (as I suggested
before) which are already stored in the dependent table if you use the
natural key, so that you don't even need to perform any "lookup" for it,
this advantage can disappear. That's why I say that, for small to moderate
sized tables, the performance issue pretty much disappears. I've done the
math, and I've also confirmed that it's pretty close experimentally. It's
not the 50:1 performance advantage that might seem to be the case when
viewed at the surface.

If you want to know what I'm talking about, you can study the structure of a
b-tree index. I've actually designed a database engine that constructs and
uses one. That gives some insight, for sure. Alternatively, you can test
it empirically. In the test of just executing a join on a 20 byte natural
key compared with a 4 byte surrogate ID, there's often no difference. At
200 bytes, there is a noticable difference, but less than 2:1 for tables
I've seen.

There is a moderate issue of disk storage space, as the composite key could
double the storage for a table with a 6 part composite key. This is still
not a very big issue for the size of database I propose. I really haven't
done the math for huge databases. They're outside the scope of this
discussion.

Actually, there's more I could say, a book perhaps, but I'm saving that for
my retirement days! But you now have half the first chapter in your hands.

I have also found there to be some communities of database professionals
where this debate also occurs, but in which the majority is on the side I
have proposed to you (and toward which you were already leaning I expect).
I can give evidence that, in the community where we are posting this, we're
a minority group. But the main thing is, and my experience and advice to
you, this won't trip you up. It definitely works either way. One thing I
try to advance is that the equipment is important. I prefer to have a bit
more memory in the computers, to cache a larger proportion of the indexes.
I don't know how much this affects a Jet based system, and the memory there
would need to be on every system. With MSDE, only the server would benefit
from this, so it is a very modest expense, and gives a good return. For a 1
gigabyte database, have at least a gigabtye of memory on the server. After
a few minutes of operation, nearly everything is cached, and the thing is a
rocket anyway. At processor speed, given enough cache, any differences
disappear.

Enjoy!

Tom Ellison
 
J

JMF

---- Original Message -----
From: "Tom Ellison" <[email protected]>
Newsgroups: microsoft.public.access.tablesdbdesign
Sent: Monday, February 06, 2006 1:51 AM
Subject: Re: Table design for simple time management

Dear John:

Well, you seem to have absorbed a great deal. Likely your thinking was
already quite advanced.

Now, the thing about JOINs. This does nothing to "avoid joins". It just
changes them to be based on the natural keys. Being compound, the JOINs
are therefore more complex. You'll be JOINing on a longer list of columns
than you would need if you had a thorough set of identity surrogate keys.
With surrogate ID keys, you can build a ladder, with one identity forming
the JOIN downward, and another single identity key column added at each
rung to be used from tables above. Not everyone does this, but it
certainly can work that way. So then, the JOINs do not compound.

Ah, I see. I had not understood that at all, although I should have realized
it by thinking a bit.
The result is a multiplier effect, with longer and longer composite
natural keys when there is a hierarchy of tables. I've had this extend to
6 levels and more! Now, remembering exactly what that composite key
should be for a table would seem to be daunting. Somehow, for me, it has
never been so. Remembering just what the composite key is, even 6 levels
in, has been (again, for me) a perfectly natrual expression of exactly
what I am doing.

The total length of such a composite key can approach 200 characters in
length (they're typically text, with a few date/time values thrown in
toward the twigs of the tree). So, many ask, how could it me that a 200
byte key length doesn't result in a 50:1 performance disadvantage when
compared with an 4 byte autonumber?

The key is that the performance is based on the number of b-tree levels in
the index, which is to say, the number of hard drive accesses that must be
made to retrieve the "leaf" node, which is the row of data.

The math is not too bad to explain this. Given a 4K node size, there
would be just fewer than 20 entries per node for a 200 byte key (there's
some overhead, not just 4096/200). For the 4 byte long integer of the
surrogate autonumber ID, the index would allow perhaps 400 entries per
node (the overhead here is the same in absolute size, so it is larger in
proportion, and I've use 6 bytes per entry for my calculations). So the
number of levels of index, being exponential, would be:

Levels 200 4
1 20 400
2 400 160000
3 8000 64000000
4 160000
5 6400000

I don't give the figures for tables above the moderate size, as this is
beyond the scope of what we do in Access, and I have not peformed the
associated experiments for them. They are not my point at all.

The number of hard drive accesses to work through the index would
typically be one less, as the "root" node of the index will almost alwasy
be cached, but then you must still read the data row from the disk, so the
access time is pretty much according the what is shown. So, for a table
with a million rows, there is a 5:3 ratio of performance advantage to the
shorter, surrogate identity key. When you factor in all the times you may
have to perform this lookup just to find the natural key values (as I
suggested before) which are already stored in the dependent table if you
use the natural key, so that you don't even need to perform any "lookup"
for it, this advantage can disappear. That's why I say that, for small to
moderate sized tables, the performance issue pretty much disappears. I've
done the math, and I've also confirmed that it's pretty close
experimentally. It's not the 50:1 performance advantage that might seem
to be the case when viewed at the surface.

If you want to know what I'm talking about, you can study the structure of
a b-tree index. I've actually designed a database engine that constructs
and uses one. That gives some insight, for sure. Alternatively, you can
test it empirically. In the test of just executing a join on a 20 byte
natural key compared with a 4 byte surrogate ID, there's often no
difference. At 200 bytes, there is a noticable difference, but less than
2:1 for tables I've seen.

Once, long ago, I did briefly study b-trees at least long enough to come to
the basic understanding that the key idea is the number of disk accesses
rather than other things that might seem to be determinant. So I can
basically follow the message you're elaborating here. And as an aside: in my
youth I briefly worked at the Technical University of Munich where "Mr. B"
himself (Rudolph Bayer) was actually my boss for a short while -- although I
never actually had anything to do with him and certainly didn't work in that
area at all.
There is a moderate issue of disk storage space, as the composite key
could double the storage for a table with a 6 part composite key. This is
still not a very big issue for the size of database I propose. I really
haven't done the math for huge databases. They're outside the scope of
this discussion.

Actually, there's more I could say, a book perhaps, but I'm saving that
for my retirement days! But you now have half the first chapter in your
hands.

I'll be the first in line for that book!
I have also found there to be some communities of database professionals
where this debate also occurs, but in which the majority is on the side I
have proposed to you (and toward which you were already leaning I expect).
I can give evidence that, in the community where we are posting this,
we're a minority group.

How strange. What exactly is it that others object to? They feel that
composite natural keys are too "heavy"?
But the main thing is, and my experience and advice to you, this won't
trip you up. It definitely works either way. One thing I try to advance
is that the equipment is important. I prefer to have a bit more memory in
the computers, to cache a larger proportion of the indexes. I don't know
how much this affects a Jet based system, and the memory there would need
to be on every system. With MSDE, only the server would benefit from
this, so it is a very modest expense, and gives a good return. For a 1
gigabyte database, have at least a gigabtye of memory on the server.
After a few minutes of operation, nearly everything is cached, and the
thing is a rocket anyway. At processor speed, given enough cache, any
differences disappear.

Certainly that won't happen quickly in my case. For now they'll be small
databases and so I'm more interested in straightforward, elegant designs
rather than having to worry about performance issues (yet).

Thanks again!

John
 
T

Tom Ellison

Dear John:

It has always been my understanding that the B in B-tree means "balanced"
because the path from the root to any leaf of the tree is identical. I had
not known of there being a Mr. B behind it. That would be news to me.

Tom Ellison
 
T

Tom Ellison

I mean identical in length!

Tom Ellison


Tom Ellison said:
Dear John:

It has always been my understanding that the B in B-tree means "balanced"
because the path from the root to any leaf of the tree is identical. I
had not known of there being a Mr. B behind it. That would be news to me.

Tom Ellison
 
J

JMF

Tom,

I just checked, just for fun, and at least as far as the Wikipedia is
concerned, it's shrouded in mystery:

"The B-tree's creator, Rudolf Bayer, has not explained what the B stands
for. The most common belief is that B stands for balanced, as all the leaf
nodes are at the same level in the tree. B may also stand for Bayer, or for
Boeing, because he was working for Boeing Scientific Research Labs."

John
 
T

Tom Ellison

Dear John:

You are a font of historical knowledge. Because of all this confusion, I'll
add another suggestion for the origin of the "B" - it stands for Babel.

Tom Ellison
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top