Strategy for combining lists

B

BruceM

In a training records database, each training session has one or more
instructors. Most of the training sessions are one-time events, perhaps to
learn about processing a new type of product or something like that In
other words, it's not a situation where there is a list of courses.
Sometimes the training is conducted by other employees such as supervisors
(who also attend training sessions). Sometimes the training is conducted by
people from outside the company. My approach in an early version of the
database was to just store the name. There is a combo box to select an
employee's name; if the instructor is not an employee, the combo box Limit
to List property is set to no, and the user can just type a name into the
box. I don't really object to doing it that way, although I realize it
violates some normalization principles. However, I'm starting to wonder
about that approach.
It seems to me that there will need to be a separate instructor table, maybe
with just InstructorID, FirstName, LastName, and Company. The employee
table also has FirstName and LastName fields. EmployeeID is a four-digit
number, so I could start the Instructor table with a number such as 1000000
and use DMax to increment by 1. Then I could combine the two tables into a
query that would be the row source for the Instructor combo box on the
Training Session form. By selecting a name I would store the ID number.
The thing I'm not quite sorting out in my head here is that the number will
relate to one of two tables (Employee and Instructor), so how would I go
about handling that relationship? Also, each instructor could be associated
with many training sessions, and each training session could have several
instructors, so it looks like there is a many-to-many there.
Somehow this is all starting to seem more complicated than is necessary. Am
I missing a more direct route?
 
T

tina

have you considered having a tblPeople? each person can be identified as
Employee, Instructor, or Both. don't use EmployeeID as the primary key,
because outside instructors won't have one. instead, use an Autonumber as
the primary key, and put the EmployeeID in its' own field if you need to
store it, allowing the field to be Null so you don't need to "make up" IDs
for outside instructors.

where you need to list employees (who attended a session, perhaps?), base a
combo box control on tblPeople, and restrict the RowSource to records where
the person is Employee, or Both. similarly, base an instructor combo box
control on tblPeople, restricting records to Instructor, or Both.

hth
 
J

Jamie Collins

In a training records database, each training session has one or more
instructors. Most of the training sessions are one-time events, perhaps to
learn about processing a new type of product or something like that In
other words, it's not a situation where there is a list of courses.
Sometimes the training is conducted by other employees such as supervisors
(who also attend training sessions). Sometimes the training is conducted by
people from outside the company. My approach in an early version of the
database was to just store the name. There is a combo box to select an
employee's name; if the instructor is not an employee, the combo box Limit
to List property is set to no, and the user can just type a name into the
box. I don't really object to doing it that way, although I realize it
violates some normalization principles.

Which normalization 'principles' were you concerned about? A person's
name is single a atomic fact which in a DBMS is usually split into
elements (e.g. first_name, middle_names, last_name) for practical
purposes, therefore normalization considerations are moot.

A person's name is a reasonable natural key in some circumstances e.g.
unique for all practical purposes in a company with few employees.
It seems to me that there will need to be a separate instructor table, maybe
with just InstructorID, FirstName, LastName, and Company.

Is InstructorID a relational key? Perhaps its trusted source is a
'closed-shop' association (i.e. compulsory membership) of instructors
for your region who ensure that the same instructor cannot be assigned
two different InstructorIDs? But if InstructorID is what I think it is
then this may violate some relational principles. Be wary of trading a
weak natural key for no key at all.
The employee
table also has FirstName and LastName fields. EmployeeID is a four-digit
number, so I could start the Instructor table with a number such as 1000000
and use DMax to increment by 1.

All you'd be doing is assigning a sequence number to non-employees;
that's not the same thing as uniquely identifying instructors. Would
you issue instructors with this sequence number to and ask them to
quote it in all related correspondence etc? A key must exist in the
reality being modelled therefore if you've invented it then you have
to expose it.

I suspect the natural key for external trainers in current use in your
organization is company + trainer name and is unique for all practical
purposes. Are you in a position to impose a new identifier on your
employees and their clients? It's great to be able to modify the
reality to fit your model but most of the time this isn't an option.
If you are in this fortunate position, do some research first because
an incrementing integer does not always make the best key (hint: think
of adding a check digit to reduce keying errors).

[Reminds me, about seven years I worked for a company where *every*
time I was asked to quote my employee number there was general
disbelieve that I really could be employee 2000 in the year 2000. I
wager you can deduce why I don't need to ask fellow current employees
their employee IDs when I tell mine is 'jamiec'.]
Then I could combine the two tables into a
query that would be the row source for the Instructor combo box on the
Training Session form. By selecting a name I would store the ID number.
The thing I'm not quite sorting out in my head here is that the number will
relate to one of two tables (Employee and Instructor), so how would I go
about handling that relationship? Also, each instructor could be associated
with many training sessions, and each training session could have several
instructors, so it looks like there is a many-to-many there.
Somehow this is all starting to seem more complicated than is necessary. Am
I missing a more direct route?

I think I'd keep things separate as far as possible i.e. internal
employee led training as a activity distinct from that of external
instructor led training and use UNION ALL when the results need to be
combined.

It seems to me that 'employees' and 'external instructors' are not of
the same 'domain' (modelling a superclass of 'people' might be a bit
too generic e.g. privacy and trust issues associated with reliable
identifiers) and it sounds like you are having problems combining them
anyhow so why force it? Issues of keys aside, what practical problem
are you trying to address?

Jamie.

--
 
B

BruceM

Thanks for giving some thought to this, and for taking the time to respond.
Yes, I knew all along that having a single table for everybody would be one
way. When I first put together the database I made a decision to use
EmployeeID as the PK. In retrospect I wish I had not, but now there are
about a thousand training session records and about 5000 individual attendee
records (one session may have many attendees, and one person may attend many
sessions, so the attendance information is stored in a junction table
between the Employee table and the Training Session table). If the need
arises I will switch to an autonumber PK, and then update the records. I
guess it wouldn't be that big a deal, but I would rather not undertake the
project.
Some outside instructors conduct several training sessions, but most conduct
just one. Some employees are designated as Instructors by means of a
Yes/No field in the employee table; theirs are the only names that appear in
the drop-down list for Instructors. It occurs to me that the outside person
who conducted one training session will continue to appear on the list
unless they are specifically designated as Inactive (another Yes/No field).
I could probably come up with criteria to avoid having the list be cluttered
with non-employee names that will probably not be used again, but it may
start to get more complex than is justified.
Another consideration is the users. As things stand, if the instructor is
an employee the name is selected from a list. If the instructor is from
outside the name is typed in. From the user's point of view that is about
as simple as it can get. The system I am contemplating would involve
opening an input box or a pop-up form to enter the new name, then having the
user select the name from the combo box (although I could probably come up
with a way of adding the new record to the combo box automatically upon
closing the form or input box). The potential complication is that users
will need to learn (and remember) what to do when the name does not appear
in the list.
I'm sort of thinking out loud here. In any case, you seem to be saying you
do not see a ready way of using one of two tables as I described in my OP.
 
B

BruceM

Responses inline.

Jamie Collins said:
Which normalization 'principles' were you concerned about? A person's
name is single a atomic fact which in a DBMS is usually split into
elements (e.g. first_name, middle_names, last_name) for practical
purposes, therefore normalization considerations are moot.

The normalization principle is storing the name redundantly. Joe Jones
appears in the Employee table (stored as LastName and FirstName). Then Joe
Jones is stored in the Instructor field as the full name each time he
conducts a training session. In truth, that doesn't bother me very much,
but I wonder if there is a better way.
A person's name is a reasonable natural key in some circumstances e.g.
unique for all practical purposes in a company with few employees.

EmployeeID is used in other company records, including payroll. For now the
databases are separate, and will likely remain so for some time to come, but
there may well be a time when the data need to be combined. I think it is a
reasonable single-field key, expecially considering that it is being used as
the key elsewhere in the company. I see that if I were storing three fields
(Last, First, Middle) as the FK in a table related to the Employee table
then there is no question of reduncancy, since I would be storing the full
name in every record related to the employee table. As I expect you know
from our previous discussions, I am inclined to use surrogate keys unless a
simple natural key presents itself. I have read the arguments on both
sides, and have concluded that I am on solid ground with a surrogate key. I
will continue to use one-field keys where I can.
Is InstructorID a relational key? Perhaps its trusted source is a
'closed-shop' association (i.e. compulsory membership) of instructors
for your region who ensure that the same instructor cannot be assigned
two different InstructorIDs? But if InstructorID is what I think it is
then this may violate some relational principles. Be wary of trading a
weak natural key for no key at all.

InstructorID would be an arbitrary number, incremented by 1 starting from
1000000, as I see it. It would not be part of correspondence, or in any
other way used in the real world. When somebody comes in to to HazMat or
First Aid training, we contact a qualified company, and they send an
instructor. The instructor is identified by name.
All you'd be doing is assigning a sequence number to non-employees;
that's not the same thing as uniquely identifying instructors. Would
you issue instructors with this sequence number to and ask them to
quote it in all related correspondence etc? A key must exist in the
reality being modelled therefore if you've invented it then you have
to expose it.

I don't see that there is a reason why the key "must" exist in the reality
being modelled, or that I have to expose the key. See above.
I suspect the natural key for external trainers in current use in your
organization is company + trainer name and is unique for all practical
purposes. Are you in a position to impose a new identifier on your
employees and their clients? It's great to be able to modify the
reality to fit your model but most of the time this isn't an option.
If you are in this fortunate position, do some research first because
an incrementing integer does not always make the best key (hint: think
of adding a check digit to reduce keying errors).

This has nothing to do with imposing a new identifier on anybody. The
identifier is behind the scenes. I am not at this time going to convert to
using natural keys at any cost. I will index as needed, and will not worry
about physical position on the disk or clustered indexes. The surrogate key
does what I need.
I'm not sure what you mean by the incremented number representing a
potential problem. It is assigned automatically. In a multi-user
environment I would simply assign it during the form's Before Update event
(with error handling in the very unlikely case that a problem arises from
two users creating a record at the same moment).
[Reminds me, about seven years I worked for a company where *every*
time I was asked to quote my employee number there was general
disbelieve that I really could be employee 2000 in the year 2000. I
wager you can deduce why I don't need to ask fellow current employees
their employee IDs when I tell mine is 'jamiec'.]
Then I could combine the two tables into a
query that would be the row source for the Instructor combo box on the
Training Session form. By selecting a name I would store the ID number.
The thing I'm not quite sorting out in my head here is that the number
will
relate to one of two tables (Employee and Instructor), so how would I go
about handling that relationship? Also, each instructor could be
associated
with many training sessions, and each training session could have several
instructors, so it looks like there is a many-to-many there.
Somehow this is all starting to seem more complicated than is necessary.
Am
I missing a more direct route?

I think I'd keep things separate as far as possible i.e. internal
employee led training as a activity distinct from that of external
instructor led training and use UNION ALL when the results need to be
combined.

It seems to me that 'employees' and 'external instructors' are not of
the same 'domain' (modelling a superclass of 'people' might be a bit
too generic e.g. privacy and trust issues associated with reliable
identifiers) and it sounds like you are having problems combining them
anyhow so why force it? Issues of keys aside, what practical problem
are you trying to address?

I agree that the Employee records and Outside Instructor records are
separate enough that they should be stored separately. A single generic
table has its own difficulties.
Let's say that I am using two tables, and combining them with a union query
as needed. Please accept for now that I have a single numeric field as the
identifier. How can I relate the instructor field in the Training Session
table to that key, which may be in either of two tables.
If I use a natural key consisting of First, Last, Company for the
instructor, and First, Middle, Last for the Employee, how do I handle the
relationship with the Instructor field (or combination of fields, since if I
am using a compond PK I also need a compound FK) in the Training Session
table?
In the end I may well decide to just store the name as I have in a single
field. As I explained to Tina, that is the simplest solution for the user.
In the end, this is about creating the best possible tool for the user.
Thanks for your thoughts and comments, Jamie. We may take a different
approach on some matters, but I always appreciate your thoughtful and
thorough postings.
 
T

tina

yes the primary key debate has been beaten to death in these newsgroups. the
clearest explanation, that i have ever read, of the purpose of a primary key
in a table is at
http://www.dbpd.com/vault/9805xtra.htm

i know it really clarified my thinking; and it validated my choice of
surrogate primary keys for most situations.


BruceM said:
Responses inline.

Jamie Collins said:
Which normalization 'principles' were you concerned about? A person's
name is single a atomic fact which in a DBMS is usually split into
elements (e.g. first_name, middle_names, last_name) for practical
purposes, therefore normalization considerations are moot.

The normalization principle is storing the name redundantly. Joe Jones
appears in the Employee table (stored as LastName and FirstName). Then Joe
Jones is stored in the Instructor field as the full name each time he
conducts a training session. In truth, that doesn't bother me very much,
but I wonder if there is a better way.
A person's name is a reasonable natural key in some circumstances e.g.
unique for all practical purposes in a company with few employees.

EmployeeID is used in other company records, including payroll. For now the
databases are separate, and will likely remain so for some time to come, but
there may well be a time when the data need to be combined. I think it is a
reasonable single-field key, expecially considering that it is being used as
the key elsewhere in the company. I see that if I were storing three fields
(Last, First, Middle) as the FK in a table related to the Employee table
then there is no question of reduncancy, since I would be storing the full
name in every record related to the employee table. As I expect you know
from our previous discussions, I am inclined to use surrogate keys unless a
simple natural key presents itself. I have read the arguments on both
sides, and have concluded that I am on solid ground with a surrogate key. I
will continue to use one-field keys where I can.
Is InstructorID a relational key? Perhaps its trusted source is a
'closed-shop' association (i.e. compulsory membership) of instructors
for your region who ensure that the same instructor cannot be assigned
two different InstructorIDs? But if InstructorID is what I think it is
then this may violate some relational principles. Be wary of trading a
weak natural key for no key at all.

InstructorID would be an arbitrary number, incremented by 1 starting from
1000000, as I see it. It would not be part of correspondence, or in any
other way used in the real world. When somebody comes in to to HazMat or
First Aid training, we contact a qualified company, and they send an
instructor. The instructor is identified by name.
All you'd be doing is assigning a sequence number to non-employees;
that's not the same thing as uniquely identifying instructors. Would
you issue instructors with this sequence number to and ask them to
quote it in all related correspondence etc? A key must exist in the
reality being modelled therefore if you've invented it then you have
to expose it.

I don't see that there is a reason why the key "must" exist in the reality
being modelled, or that I have to expose the key. See above.
I suspect the natural key for external trainers in current use in your
organization is company + trainer name and is unique for all practical
purposes. Are you in a position to impose a new identifier on your
employees and their clients? It's great to be able to modify the
reality to fit your model but most of the time this isn't an option.
If you are in this fortunate position, do some research first because
an incrementing integer does not always make the best key (hint: think
of adding a check digit to reduce keying errors).

This has nothing to do with imposing a new identifier on anybody. The
identifier is behind the scenes. I am not at this time going to convert to
using natural keys at any cost. I will index as needed, and will not worry
about physical position on the disk or clustered indexes. The surrogate key
does what I need.
I'm not sure what you mean by the incremented number representing a
potential problem. It is assigned automatically. In a multi-user
environment I would simply assign it during the form's Before Update event
(with error handling in the very unlikely case that a problem arises from
two users creating a record at the same moment).
[Reminds me, about seven years I worked for a company where *every*
time I was asked to quote my employee number there was general
disbelieve that I really could be employee 2000 in the year 2000. I
wager you can deduce why I don't need to ask fellow current employees
their employee IDs when I tell mine is 'jamiec'.]
Then I could combine the two tables into a
query that would be the row source for the Instructor combo box on the
Training Session form. By selecting a name I would store the ID number.
The thing I'm not quite sorting out in my head here is that the number
will
relate to one of two tables (Employee and Instructor), so how would I go
about handling that relationship? Also, each instructor could be
associated
with many training sessions, and each training session could have several
instructors, so it looks like there is a many-to-many there.
Somehow this is all starting to seem more complicated than is necessary.
Am
I missing a more direct route?

I think I'd keep things separate as far as possible i.e. internal
employee led training as a activity distinct from that of external
instructor led training and use UNION ALL when the results need to be
combined.

It seems to me that 'employees' and 'external instructors' are not of
the same 'domain' (modelling a superclass of 'people' might be a bit
too generic e.g. privacy and trust issues associated with reliable
identifiers) and it sounds like you are having problems combining them
anyhow so why force it? Issues of keys aside, what practical problem
are you trying to address?

I agree that the Employee records and Outside Instructor records are
separate enough that they should be stored separately. A single generic
table has its own difficulties.
Let's say that I am using two tables, and combining them with a union query
as needed. Please accept for now that I have a single numeric field as the
identifier. How can I relate the instructor field in the Training Session
table to that key, which may be in either of two tables.
If I use a natural key consisting of First, Last, Company for the
instructor, and First, Middle, Last for the Employee, how do I handle the
relationship with the Instructor field (or combination of fields, since if I
am using a compond PK I also need a compound FK) in the Training Session
table?
In the end I may well decide to just store the name as I have in a single
field. As I explained to Tina, that is the simplest solution for the user.
In the end, this is about creating the best possible tool for the user.
Thanks for your thoughts and comments, Jamie. We may take a different
approach on some matters, but I always appreciate your thoughtful and
thorough postings.
 
B

BruceM

Yup, I bookmarked that very article a while ago, and stopped worrying about
it (not that worrying about it was taking much of my time).

tina said:
yes the primary key debate has been beaten to death in these newsgroups.
the
clearest explanation, that i have ever read, of the purpose of a primary
key
in a table is at
http://www.dbpd.com/vault/9805xtra.htm

i know it really clarified my thinking; and it validated my choice of
surrogate primary keys for most situations.


BruceM said:
Responses inline.

Jamie Collins said:
In a training records database, each training session has one or more
instructors. Most of the training sessions are one-time events, perhaps
to
learn about processing a new type of product or something like that
In
other words, it's not a situation where there is a list of courses.
Sometimes the training is conducted by other employees such as
supervisors
(who also attend training sessions). Sometimes the training is conducted
by
people from outside the company. My approach in an early version of the
database was to just store the name. There is a combo box to select
an
employee's name; if the instructor is not an employee, the combo box
Limit
to List property is set to no, and the user can just type a name into the
box. I don't really object to doing it that way, although I realize
it
violates some normalization principles.

Which normalization 'principles' were you concerned about? A person's
name is single a atomic fact which in a DBMS is usually split into
elements (e.g. first_name, middle_names, last_name) for practical
purposes, therefore normalization considerations are moot.

The normalization principle is storing the name redundantly. Joe Jones
appears in the Employee table (stored as LastName and FirstName). Then Joe
Jones is stored in the Instructor field as the full name each time he
conducts a training session. In truth, that doesn't bother me very much,
but I wonder if there is a better way.
A person's name is a reasonable natural key in some circumstances e.g.
unique for all practical purposes in a company with few employees.

EmployeeID is used in other company records, including payroll. For now the
databases are separate, and will likely remain so for some time to come, but
there may well be a time when the data need to be combined. I think it
is a
reasonable single-field key, expecially considering that it is being used as
the key elsewhere in the company. I see that if I were storing three fields
(Last, First, Middle) as the FK in a table related to the Employee table
then there is no question of reduncancy, since I would be storing the
full
name in every record related to the employee table. As I expect you know
from our previous discussions, I am inclined to use surrogate keys unless a
simple natural key presents itself. I have read the arguments on both
sides, and have concluded that I am on solid ground with a surrogate key. I
will continue to use one-field keys where I can.
It seems to me that there will need to be a separate instructor table,
maybe
with just InstructorID, FirstName, LastName, and Company.

Is InstructorID a relational key? Perhaps its trusted source is a
'closed-shop' association (i.e. compulsory membership) of instructors
for your region who ensure that the same instructor cannot be assigned
two different InstructorIDs? But if InstructorID is what I think it is
then this may violate some relational principles. Be wary of trading a
weak natural key for no key at all.

InstructorID would be an arbitrary number, incremented by 1 starting from
1000000, as I see it. It would not be part of correspondence, or in any
other way used in the real world. When somebody comes in to to HazMat or
First Aid training, we contact a qualified company, and they send an
instructor. The instructor is identified by name.
The employee
table also has FirstName and LastName fields. EmployeeID is a four-digit
number, so I could start the Instructor table with a number such as
1000000
and use DMax to increment by 1.

All you'd be doing is assigning a sequence number to non-employees;
that's not the same thing as uniquely identifying instructors. Would
you issue instructors with this sequence number to and ask them to
quote it in all related correspondence etc? A key must exist in the
reality being modelled therefore if you've invented it then you have
to expose it.

I don't see that there is a reason why the key "must" exist in the
reality
being modelled, or that I have to expose the key. See above.
I suspect the natural key for external trainers in current use in your
organization is company + trainer name and is unique for all practical
purposes. Are you in a position to impose a new identifier on your
employees and their clients? It's great to be able to modify the
reality to fit your model but most of the time this isn't an option.
If you are in this fortunate position, do some research first because
an incrementing integer does not always make the best key (hint: think
of adding a check digit to reduce keying errors).

This has nothing to do with imposing a new identifier on anybody. The
identifier is behind the scenes. I am not at this time going to convert to
using natural keys at any cost. I will index as needed, and will not worry
about physical position on the disk or clustered indexes. The surrogate key
does what I need.
I'm not sure what you mean by the incremented number representing a
potential problem. It is assigned automatically. In a multi-user
environment I would simply assign it during the form's Before Update
event
(with error handling in the very unlikely case that a problem arises from
two users creating a record at the same moment).
[Reminds me, about seven years I worked for a company where *every*
time I was asked to quote my employee number there was general
disbelieve that I really could be employee 2000 in the year 2000. I
wager you can deduce why I don't need to ask fellow current employees
their employee IDs when I tell mine is 'jamiec'.]

Then I could combine the two tables into a
query that would be the row source for the Instructor combo box on the
Training Session form. By selecting a name I would store the ID number.
The thing I'm not quite sorting out in my head here is that the number
will
relate to one of two tables (Employee and Instructor), so how would I go
about handling that relationship? Also, each instructor could be
associated
with many training sessions, and each training session could have several
instructors, so it looks like there is a many-to-many there.
Somehow this is all starting to seem more complicated than is necessary.
Am
I missing a more direct route?

I think I'd keep things separate as far as possible i.e. internal
employee led training as a activity distinct from that of external
instructor led training and use UNION ALL when the results need to be
combined.

It seems to me that 'employees' and 'external instructors' are not of
the same 'domain' (modelling a superclass of 'people' might be a bit
too generic e.g. privacy and trust issues associated with reliable
identifiers) and it sounds like you are having problems combining them
anyhow so why force it? Issues of keys aside, what practical problem
are you trying to address?

I agree that the Employee records and Outside Instructor records are
separate enough that they should be stored separately. A single generic
table has its own difficulties.
Let's say that I am using two tables, and combining them with a union query
as needed. Please accept for now that I have a single numeric field as the
identifier. How can I relate the instructor field in the Training
Session
table to that key, which may be in either of two tables.
If I use a natural key consisting of First, Last, Company for the
instructor, and First, Middle, Last for the Employee, how do I handle the
relationship with the Instructor field (or combination of fields, since
if I
am using a compond PK I also need a compound FK) in the Training Session
table?
In the end I may well decide to just store the name as I have in a single
field. As I explained to Tina, that is the simplest solution for the user.
In the end, this is about creating the best possible tool for the user.
Thanks for your thoughts and comments, Jamie. We may take a different
approach on some matters, but I always appreciate your thoughtful and
thorough postings.
 
J

Jamie Collins

the
clearest explanation, that i have ever read, of the purpose of a primary key
in a table is athttp://www.dbpd.com/vault/9805xtra.htm

i know it really clarified my thinking; and it validated my choice of
surrogate primary keys for most situations.

I hope you haven't missed that article's point that in order to have a
surrogate ('physical key') you must first have a logical key: "The
logical unique key distinguishes entity occurrences from others to
avoid duplicates before entry into the database. That is why this key
is typically composed of attributes and relationships that the
business can recognize as distinguishing occurrences logically from
others. The primary key, however, is a physical design element that
uniquely identifies rows after they are in the database."

BruceM has told us he has a real life natural logical key of 'company
identifier' and trainer name'. He needs to constrain this natural key
in the DBMS to prevent duplicates in the DBMS. If a table's only
unique constraint is a PRIMARY KEY designation on an autonumber column
then there is nothing to prevent duplicates e.g.

INSERT INTO ExternalTrainers (company_duns, trainer_name)
VALUES ('12345678900', 'Joe Jones')
;
INSERT INTO ExternalTrainers (company_duns, trainer_name)
VALUES ('12345678900', 'Joe Jones')
;
INSERT INTO ExternalTrainers (company_duns, trainer_name)
VALUES ('12345678900', 'Joe Jones')
;

If there was an autonumber of the above table it would not be
preventing those duplicates.

Sure, add a surrogate to the table if you think it adds value (more
efficient table joins etc) but before you do ensure you've uniquely
constrained your logical key.

Jamie.

--
 
J

Jamie Collins

The normalization principle is storing the name redundantly. Joe Jones
appears in the Employee table (stored as LastName and FirstName). Then Joe
Jones is stored in the Instructor field as the full name each time he
conducts a training session. In truth, that doesn't bother me very much,
but I wonder if there is a better way.

If you've got a compound key of (first name, last name) then using it
as a foreign key is not redundancy. If you mean that two text columns
are 'bigger' than one numeric column then that's not redundancy
either.
I see that if I were storing three fields
(Last, First, Middle) as the FK in a table related to the Employee table
then there is no question of reduncancy, since I would be storing the full
name in every record related to the employee table.

So you seem to understand the point so why the contradiction above?
As I expect you know
from our previous discussions, I am inclined to use surrogate keys unless a
simple natural key presents itself.

You need a natural/logical key before you can use a surrogate. See my
reply elsewhere in this thread.
When somebody comes in to to HazMat or
First Aid training, we contact a qualified company, and they send an
instructor. The instructor is identified by name.

I don't see that there is a reason why the key "must" exist in the reality
being modelled, or that I have to expose the key.

The most common reason you get for not using (first name, last name)
as key is because they are commonly duplicated. Take you natural key
for external trainer, being 'company identifier' (don't know your
jurisdiction/business so let's say DUNS number) and trainer's name.
You could tell me that training company '12345678900' has two trainers
named 'Joe Jones'. There's a temptation to think an autonumber (ID)
will help you here because it allows you to do this:

INSERT INTO ExternalTrainers (ID, company_duns, trainer_name)
VALUES (1, '12345678900', 'Joe Jones')
;
INSERT INTO ExternalTrainers (ID, company_duns, trainer_name)
VALUES (2, '12345678900', 'Joe Jones')
;

You book both for a course. Joe Jones ID=1 goes down a storm but Joe
Jones ID=2 is most unpopular. Upon re-booking you ask the company,
"Send us Joe Jones but can we have someone else instead of Joe Jones,
please?" I don't think asking for Joe Jones ID=1 is going to clarify
matters either. It's the nature of identifiers: they actually have to
identify people.

My bank issues me with a number, embosses it on my card, encodes it
into the chip and give me a pin to remember. They invented these
identifiers, exposed them and I have to be in possession of them to
identify myself as the account holder.

Jamie.

--
 
B

BruceM

The point about having a natural key is clear. Each company in a vendor
database is distinct from each other company by virtue of easily recognized
attributes such as company name (and maybe address, although that need has
not arisen so far). At the same time, quite a few companies have changed
their names, become affiliates of other companies, and so forth. However,
all records associated with the old company name are still associated with
the new one without the need to do cascade updates or other machinations.
I never meant to suggest that I regard a surrogate PK (or a "natural"
numeric one such as EmployeeID) as a substitute for a unique record, but
merely as a way the database can identify that record without my
intervention.
 
B

BruceM

Jamie Collins said:
If you've got a compound key of (first name, last name) then using it
as a foreign key is not redundancy. If you mean that two text columns
are 'bigger' than one numeric column then that's not redundancy
either.

OK, I think I see your point, but it seems like semantics to me. In a large
company you would probably need five fields to assure that a person can be
uniquely identified, and one of those fields would probably need to be a
Comment field in case there is no "natural" way to tell them apart. By that
reasoning it seems that redundancy is almost irrelevant, since you are
always storing the natural key. By the way, do you use Cascade Updates or
Update queries or what exactly when somebody's name changes?
So you seem to understand the point so why the contradiction above?

I do not choose to store three FK fields where one will do. Furthermore,
two people could have the same name, so either that situation needs to be
anticipated, and a suffix field or something added to the key, or the
database will need to be redesigned in order to accomodate that situation.
You need a natural/logical key before you can use a surrogate. See my
reply elsewhere in this thread.

Acknowledged and commented upon in other posting.
The most common reason you get for not using (first name, last name)
as key is because they are commonly duplicated. Take you natural key
for external trainer, being 'company identifier' (don't know your
jurisdiction/business so let's say DUNS number) and trainer's name.
You could tell me that training company '12345678900' has two trainers
named 'Joe Jones'. There's a temptation to think an autonumber (ID)
will help you here because it allows you to do this:

INSERT INTO ExternalTrainers (ID, company_duns, trainer_name)
VALUES (1, '12345678900', 'Joe Jones')
;
INSERT INTO ExternalTrainers (ID, company_duns, trainer_name)
VALUES (2, '12345678900', 'Joe Jones')
;

You book both for a course. Joe Jones ID=1 goes down a storm but Joe
Jones ID=2 is most unpopular. Upon re-booking you ask the company,
"Send us Joe Jones but can we have someone else instead of Joe Jones,
please?" I don't think asking for Joe Jones ID=1 is going to clarify
matters either. It's the nature of identifiers: they actually have to
identify people.

This is not the model with which I am dealing. A company that has two
people with the same name will find a way to distinguish them if they need
to be paged or whatever. People have more sense than it seems you are
willing to acknowledge. I understand the hypothetical problem, but in the
real world people will find a way of distinguishing one person from another,
except maybe on the No-Fly list. ;-)
 
J

Jamie Collins

A company that has two
people with the same name will find a way to distinguish them if they need
to be paged or whatever. People have more sense than it seems you are
willing to acknowledge. I understand the hypothetical problem, but in the
real world people will find a way of distinguishing one person from another

But computers are pretty dumb. You need to tell them everything. You
can't expect them to verify personal data and identifiers for you. I'm
reminded of this article about what a DBMS does:

An Old Class of Errors
by Fabian Pascal
www.dbazine.com/ofinterest/oi-articles/pascal29

No DBMS can guarantee truth. If, for example, there is an EMPLOYEES
table in the database:

EMPLOYEES {EMP#, ENAME, DEPT#, SALARY}

that contains a row:

{E21,Pianka,A00,25000}

representing a proposition about an employee:

"Employee uniquely identified by employee number E21, has name
Pianka, works in department A00, earns salary 25,000"

there is no way a DBMS can tell whether the row represents a
proposition that is true or not in the real world (in fact, it does
not even know what an employee, a name, a department, or a salary is!)
The only thing it can, and should do, is to guarantee consistency with
the integrity constraints (and, therefore, the business rules) in
effect.
[Unquote]

Consider the EmployeeID *you* use. I assume the trusted source for
this identifier is your organization's personnel department, who are
responsible for verifying that the person exists, are who they say
they are, are fit for the job (health, qualifications, spent criminal
convictions, etc) and so on. Then, to make things easy for the
enterprise as a whole, the personnel department associate the person
with an EmployeeID, something I call a 'business key', and exposes it
e.g. prints it on the employee's payslip. The personnel department may
even have used positive monotonic integers and a computer could have
generated this for them (i.e. an autonumber) but the important thing
is that the job of identifying people is the responsibility if the
agency that exposes the identifier.

Yes, people can be distinguished by a human in the reality being
modelled. Yes, people entities within the table can be distinguished
by a human using the primary key, even if it's an autonumber. What the
human in question needs is a way to associate the two e.g. which Joe
Jones is autonumber ID=1. AFAIK there are two ways to do this: either
bring different/additional identifier data into the DBMS or expose the
autonumber to the reality e.g. keep different/additional identifier
plus the autonumber in a document management system (think filing
cabinet, one file per trainer, autonumber written on the front of each
file).

Now do you see why I don't think you should put the identifiers
EmployeeID and external_trainer_ID in the same coumn? They may be the
same data type, both identify professional people, etc but they are
not of the same domain and each have different trusted sources. If you
are tempted to do the latter, consider that autonumber advocates
(including the article tina linked to) say that the autonumber pk must
be meaningless outside the DBMS but exposing it means you are
attaching meaning; many autonumber advocates go as far as saying that
an autonumber should not be exposed. Also consider that a monotonic
integer may be good for the DBMS but does not make a user-friendly
identifier in reality, so why not come up with a well thought out
external_trainer_ID and, if you must, use an autonumber as a
'surrogate'?

FWIW I wonder if external_trainer_ID is actually required: I suspect
the most important fact is the contract with the training company and
companies are much easier to identify, even after companies have
merged etc. Maybe the person who gave the training on the day is
something for the 'notes' rather than being part of the identifier.
By the way, do you use Cascade Updates

Sometimes but on a strictly case-by-case basis.

Jamie.

--
 
B

BruceM

Jamie Collins said:
A company that has two
people with the same name will find a way to distinguish them if they
need
to be paged or whatever. People have more sense than it seems you are
willing to acknowledge. I understand the hypothetical problem, but in
the
real world people will find a way of distinguishing one person from
another

But computers are pretty dumb. You need to tell them everything. You
can't expect them to verify personal data and identifiers for you. I'm
reminded of this article about what a DBMS does:

An Old Class of Errors
by Fabian Pascal
www.dbazine.com/ofinterest/oi-articles/pascal29

No DBMS can guarantee truth. If, for example, there is an EMPLOYEES
table in the database:

EMPLOYEES {EMP#, ENAME, DEPT#, SALARY}

that contains a row:

{E21,Pianka,A00,25000}

representing a proposition about an employee:

"Employee uniquely identified by employee number E21, has name
Pianka, works in department A00, earns salary 25,000"

there is no way a DBMS can tell whether the row represents a
proposition that is true or not in the real world (in fact, it does
not even know what an employee, a name, a department, or a salary is!)
The only thing it can, and should do, is to guarantee consistency with
the integrity constraints (and, therefore, the business rules) in
effect.
[Unquote]

Consider the EmployeeID *you* use. I assume the trusted source for
this identifier is your organization's personnel department, who are
responsible for verifying that the person exists, are who they say
they are, are fit for the job (health, qualifications, spent criminal
convictions, etc) and so on. Then, to make things easy for the
enterprise as a whole, the personnel department associate the person
with an EmployeeID, something I call a 'business key', and exposes it
e.g. prints it on the employee's payslip. The personnel department may
even have used positive monotonic integers and a computer could have
generated this for them (i.e. an autonumber) but the important thing
is that the job of identifying people is the responsibility if the
agency that exposes the identifier.

Yes, people can be distinguished by a human in the reality being
modelled. Yes, people entities within the table can be distinguished
by a human using the primary key, even if it's an autonumber. What the
human in question needs is a way to associate the two e.g. which Joe
Jones is autonumber ID=1. AFAIK there are two ways to do this: either
bring different/additional identifier data into the DBMS or expose the
autonumber to the reality e.g. keep different/additional identifier
plus the autonumber in a document management system (think filing
cabinet, one file per trainer, autonumber written on the front of each
file).

I work in a facility that has fewer than 100 employees. Due to the nature
of the business, it isn't going to grow to the point of having thousands of
employees. The real-world environment in which this project occurs guides
my decisions. There is no realistic chance that we won't be able to find a
way to distinguish people from each other. If two people with the exact
same name end up working in the same department and have the same job title,
then we can expose the employee ID number or do something else, but until
then I will concern myself with real issues.
I am going to use surrogate keys. I am not going to read any more articles
on the subject. My database is not going to disintegrate into anarchy as a
result of my using surrogate keys. Go ahead and believe that I am wrong,
misguided, or whatever you wish. I don't care.
I have asked time and time again what you do if a person's name changes and
you are using a "natural" key that includes the name. You need to update
all of the five or six or whatever number of foreign key fields are needed
to resolve the relationship, yet you use cascade updates infrequently, I
gather.
Now do you see why I don't think you should put the identifiers
EmployeeID and external_trainer_ID in the same coumn? They may be the
same data type, both identify professional people, etc but they are
not of the same domain and each have different trusted sources. If you
are tempted to do the latter, consider that autonumber advocates
(including the article tina linked to) say that the autonumber pk must
be meaningless outside the DBMS but exposing it means you are
attaching meaning; many autonumber advocates go as far as saying that
an autonumber should not be exposed. Also consider that a monotonic
integer may be good for the DBMS but does not make a user-friendly
identifier in reality, so why not come up with a well thought out
external_trainer_ID and, if you must, use an autonumber as a
'surrogate'?

FWIW I wonder if external_trainer_ID is actually required: I suspect
the most important fact is the contract with the training company and
companies are much easier to identify, even after companies have
merged etc. Maybe the person who gave the training on the day is
something for the 'notes' rather than being part of the identifier.

Right now the users either select an instructor name from the list, or they
type in a name. Anything else would involve using an input box, at the
least, to enter a new outside instructor. I choose to make it as simple as
I can for the users. I will store the instructor name as I always have. If
the instructor is also an employee, I will store that name. If that
instructor's name changes one day, the old records will have the old name.
It doesn't matter. I have asked if it is possible to use a number as the
identifier. I can force the number in the Outside Instructors table to be
above 1000000 or whatever I want. Our employee ID numbers will not go above
four digits. If need be, I can pad the number with zeros or something, but
I won't worry about it for now.
I have asked if it is possible to create the relationship between one of two
tables (Employees and Instructors) and an Instructor field in the
TrainingSession table (actually, in a related table, since there can be
several instructors for a session). Whether I am using a surrogate key or a
six-field natural key, my question, still unanswered, is about that very
problem.
 
J

Jamie Collins

I have asked if it is possible to create the relationship between one of two
tables (Employees and Instructors) and an Instructor field in the
TrainingSession table (actually, in a related table, since there can be
several instructors for a session). Whether I am using a surrogate key or a
six-field natural key, my question, still unanswered, is about that very
problem.

A foreign key to references one table. To be able to reference 'one of
two tables' you would need to use a level of abstraction such as
subclassing e.g. (air code)

CREATE TABLE Trainers (
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
CHECK (trainer_type IN ('Internal', 'External'))
)
;
CREATE TABLE InternalTrainers (
employee_ID INTEGER NOT NULL UNIQUE
REFERENCES Employees (employee_ID),
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
CHECK (trainer_type = 'Internal'),
FOREIGN KEY (trainer_type, trainer_ID)
REFERENCES Trainers (trainer_type, trainer_ID),
<<columns specific to employee-as-trainer here>>
)
;
CREATE TABLE ExternalTrainers (
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
CHECK (trainer_type = 'External'),
FOREIGN KEY (trainer_type, trainer_ID)
REFERENCES Trainers (trainer_type, trainer_ID),
<<columns specific to external trainer, including key, here>>
)
;
CREATE TABLE TrainingSessions (
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
FOREIGN KEY (trainer_type, trainer_ID)
REFERENCES Trainers (trainer_type, trainer_ID),
<<columns specific to training session, including key, here>>
)
;

Of course, trainer_ID is completely artificial here. Perhaps the above
should be named 'superclassing' i.e. taking two real but distinct
domains and artificially combining them via a fabricated 'superclass'.

As I said earlier, for me the obvious answer is to keep them separate
i.e. one table for employee-as-trainer sessions and one for external
trainer-run sessions. If you are having trouble combining them, why
force the issue? What is the motivation for a combined table?
I have asked time and time again what you do if a person's name changes and
you are using a "natural" key that includes the name. You need to update
all of the five or six or whatever number of foreign key fields are needed
to resolve the relationship, yet you use cascade updates infrequently, I
gather.

Me personally? I rarely if ever use a person's name as a DBMS
identifier, for all the obvious reasons. I tend to use industry
standard keys (ISBN) and business keys (employee_ID) that have a
trusted source. I think I asked this upthread e.g. are trainers
(people) regulated by an authority, association, etc that can provide
an identifier? And I've also said I think the identifier should be for
the training company (i.e. the one with which there is a constract)
rather than individual people. Sorry, I'm not familiar with the
business training sector in your region of the world to give a
prescriptive answer but every region tends to have a governmental
bureaucracy (e.g. UK=Companies House Reference Number) or tax agency
(UK=HMRC Unique Tax Reference, ECON/SCON numbers, etc) that can
assist; DUNS is private sector and not universal but has the advantage
of being global.
I am going to use surrogate keys. I am not going to read any more articles
on the subject. My database is not going to disintegrate into anarchy as a
result of my using surrogate keys. Go ahead and believe that I am wrong,
misguided, or whatever you wish. I don't care.

I haven't been arguing against surrogates and I generally tend to
avoid doing so because I see it as one of those 'lifestyle choices'
i.e. it's not for me but I don't criticize you for choosing them
because I can see some benefit but I think the disadvantages outweigh.

Jamie.

--
 
B

BruceM

Jamie Collins said:
A foreign key to references one table. To be able to reference 'one of
two tables' you would need to use a level of abstraction such as
subclassing e.g. (air code)

CREATE TABLE Trainers (
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
CHECK (trainer_type IN ('Internal', 'External'))
)
;
CREATE TABLE InternalTrainers (
employee_ID INTEGER NOT NULL UNIQUE
REFERENCES Employees (employee_ID),
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
CHECK (trainer_type = 'Internal'),
FOREIGN KEY (trainer_type, trainer_ID)
REFERENCES Trainers (trainer_type, trainer_ID),
<<columns specific to employee-as-trainer here>>
)
;
CREATE TABLE ExternalTrainers (
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
CHECK (trainer_type = 'External'),
FOREIGN KEY (trainer_type, trainer_ID)
REFERENCES Trainers (trainer_type, trainer_ID),
<<columns specific to external trainer, including key, here>>
)
;
CREATE TABLE TrainingSessions (
trainer_ID CHAR(10) NOT NULL UNIQUE,
trainer_type CHAR(8) NOT NULL,
FOREIGN KEY (trainer_type, trainer_ID)
REFERENCES Trainers (trainer_type, trainer_ID),
<<columns specific to training session, including key, here>>
)
;

I don't know how to read this code well enough either to understand just
what it means (what is CHAR(8), etc.), or to get it to produce a table. I
thought that I need to check the ANSI 92 compatible syntax box in Tools >
Options > Queries, but as I said I don't know how to make it work when it
doesn't produce a table. Or maybe that's not the box I need to check.
Anyhow, I'll take your word for it that I would need to use subclassing.
Of course, trainer_ID is completely artificial here. Perhaps the above
should be named 'superclassing' i.e. taking two real but distinct
domains and artificially combining them via a fabricated 'superclass'.

As I said earlier, for me the obvious answer is to keep them separate
i.e. one table for employee-as-trainer sessions and one for external
trainer-run sessions. If you are having trouble combining them, why
force the issue? What is the motivation for a combined table?

I got it into my head that I should try to avoid storing data redundantly.
To my way of thinking, storing a name over and over again would qualify as
redundancy. Now I don't know what to think, so I'm just going to go with
something that will work for my purposes.
Me personally? I rarely if ever use a person's name as a DBMS
identifier, for all the obvious reasons. I tend to use industry
standard keys (ISBN) and business keys (employee_ID) that have a
trusted source. I think I asked this upthread e.g. are trainers
(people) regulated by an authority, association, etc that can provide
an identifier? And I've also said I think the identifier should be for
the training company (i.e. the one with which there is a constract)
rather than individual people. Sorry, I'm not familiar with the
business training sector in your region of the world to give a
prescriptive answer but every region tends to have a governmental
bureaucracy (e.g. UK=Companies House Reference Number) or tax agency
(UK=HMRC Unique Tax Reference, ECON/SCON numbers, etc) that can
assist; DUNS is private sector and not universal but has the advantage
of being global.

Some trainers are individuals and some are from companies. Identifying
them by company will only work for a very few.

The only thing I'm trying to do is to assure that each person's record has
an unchanging unique identifier. If I am making a database for a volunteer
organization it is quite possible that the people won't have ID numbers, so
I will have to assign something arbitrary. I have no problem with doing
that in situations where people do have standard-issue ID numbers, because I
have already seen the ID number format change. The trainers are not
necessarily regulated in any particular way. HazMat trainers are very
different from trainers who represent a government agency, who are in turn
very different from those who represent our customers and who are providing
instruction on using a new kind of tooling or whatever. For some of these
people there is no particular numbering system at all, much less a single
system for identifying all of them without risk of duplication.
I haven't been arguing against surrogates and I generally tend to
avoid doing so because I see it as one of those 'lifestyle choices'
i.e. it's not for me but I don't criticize you for choosing them
because I can see some benefit but I think the disadvantages outweigh.
I had thought your arguments were in favor of exposing the key fields, which
is why I thought you were opposed to the use of surrogate keys. Frankly, I
don't follow all of what you are saying, as I don't have the same
vocabulary. Thanks for weighing in, though. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top