Table/Query trouble

RipperT · Jun 12, 2006

InmateId In_Out HU CellNo Bunk MvDate
140060 In 4 127 B 18-Jan-06
140060 Out 8 18 18-Jan-06
497386 In 7 137 T 18-Jan-06
497386 Out 8 27 18-Jan-06
237480 In 8 1 17-Jan-06
237480 Out 7 245 B 17-Jan-06
536235 In C 75 B 17-Jan-06
536235 Out A 37 17-Jan-06
467801 In 5 123 T 16-Jan-06
467801 Out 4 158 B 16-Jan-06
258327 In 5 118 T 16-Jan-06
258327 Out 4 158 T 16-Jan-06
478494 In 8 22 15-Jan-06
478494 Out 6 117 B 15-Jan-06
229895 In 8 6 15-Jan-06
229895 Out 7 156 B 15-Jan-06

The above is a table that is programmatically written to each time an inmate moves from one bunk to another. The inspector wants to know who all lived in HU (Housing Unit) 4 on a specific date. Is it possible to extract this data from this table? IOW, who were the last inmates to move into HU 4 prior to the date in question?
Many thanx,

Rip

Tom Ellison · Jun 12, 2006

Dear Rip:

In one sense, the data is redundant. If Inmate 140060 moved out of HU 8 on 18-Jan-06 that fact is already represented by the fact he moved into HU 4 on that date. Presumably the Cell and Bunk from which he moved would be the same as the one to which he formerly moved.

Assuming an inmate can only be in one HU/Cell/Bunk at a time, the following is sufficient:

Inmate HU Cell Bunk MovedIn
140060 4 127 B 18-Jan-06
497386 7 137 T 18-Jan-06
237480 8 1 17-Jan-06
536235 C 75 B 17-Jan-06
467801 5 123 T 16-Jan-06
258327 5 118 T 16-Jan-06
478494 8 22 15-Jan-06
229895 8 6 15-Jan-06

The case where an inmate is discharged could be represented with HU value D (for discharged) if that has not other meaning, or with a NULL value for HU.

Why is this important? If you have an inmate moving out on 15-Jan-06 but not moving in until 18-Jan-06, then you have a real problem with tracking him. Likely this would be due to a user error. Asking for redundant information not only is a cause of such errors, but is a waste of time to enter the data and of storage space.

An important case is not represented in the data you give. You do not show any inmate having successive HU/Cell/Bunk assignments.

Now, to find a list of all inmates who are in HU 5 on 16-Jan-06, I propose a query:

SELECT InmateID
FROM YourTable T
WHERE HU = 5
AND MovedIn =
(SELECT MAX(MovedIn)
FROM YourTable T1
WHERE T1.InmateID = T.InmateID
AND T1.MovedIn >= #01/16/06#)

An inmate who moved into HU 5 on the 16th would presumably have been at some other HU/Cell/Bunk prior to that. The above does not count him as having been at both the previous HU/Cell/Bunk on the 165h, and at the new one. He could, of course, have been moved late in the day and would have been in the previous location for most of the day. The query could be adjusted for that if necessary.

It is not databases that are messy in this way, it is real live that is often less deterministic than the mathematics of set theory. It is our job to find such ambiguity and nail down the intent of those who would use the software.

Tom Ellison

"RipperT @comcast.net>" <rippertsmith<N_oS_pAm> wrote in message InmateId In_Out HU CellNo Bunk MvDate
140060 In 4 127 B 18-Jan-06
140060 Out 8 18 18-Jan-06
497386 In 7 137 T 18-Jan-06
497386 Out 8 27 18-Jan-06
237480 In 8 1 17-Jan-06
237480 Out 7 245 B 17-Jan-06
536235 In C 75 B 17-Jan-06
536235 Out A 37 17-Jan-06
467801 In 5 123 T 16-Jan-06
467801 Out 4 158 B 16-Jan-06
258327 In 5 118 T 16-Jan-06
258327 Out 4 158 T 16-Jan-06
478494 In 8 22 15-Jan-06
478494 Out 6 117 B 15-Jan-06
229895 In 8 6 15-Jan-06
229895 Out 7 156 B 15-Jan-06

The above is a table that is programmatically written to each time an inmate moves from one bunk to another. The inspector wants to know who all lived in HU (Housing Unit) 4 on a specific date. Is it possible to extract this data from this table? IOW, who were the last inmates to move into HU 4 prior to the date in question?
Many thanx,

Rip

Jamie Collins · Jun 12, 2006

Tom said:
In one sense, the data is redundant. If Inmate 140060 moved out of HU 8 on 18-Jan-06 that fact is already represented by the fact he moved into HU 4 on that date.
Why is this important? If you have an inmate moving out on 15-Jan-06 but not moving in until 18-Jan-06, then you have a real problem with tracking him. Likely this would be due to a user error. Asking for redundant information not only is a cause of such errors, but is a waste of time to enter the data and of storage space.

I think for this reason (plus other more fundamental reasons) there
should be two dates for each row i.e. a start_date and an end_date
pair. As you say, (assuming no time elements in dates) it is unlikely
that there would be non-contiguous periods.

With your proposed design the end_date is determined by the state of
the data in other rows.

Consider this data:

140060 127 2006-01-01
140060 137 2006-06-07
140060 123 2006-06-15

Let's say that some kind of user error or weird file corruption caused
the above 2006-06-07row to be lost. Its loss may go undetected and the
inmate erroneously assumed to be in cell 127 on 2006-06-08.

Now consider my proposed modification i.e. also capture the end date
(time elements would be useful here):

140060 127 2006-01-01 2006-06-06
140060 137 2006-06-07 2006-06-14
140060 123 2006-06-15 NULL

Notice that a NULL end_date immediately indicates that the inmate is
currently in cell 123; I don't have to do an (expensive) subquery to
get that information. Now, if the 2006-06-07 row is lost it is obvious
because there will be a hole in the data.

If I need an extra column to ensure data integrity then I don't
consider that to be redundant.

Jamie.

--

Tom Ellison · Jun 12, 2006

Dear Jamie:

I do not agree. Having both a begin and end date in each row is redundant,
causing both increased data entry and requiring that these entries be made
with perfect consistency. To a significant degree, this violates one of the
fundamental rules of normalization.

If you rely on the fact that successive records for the same Inmate must
have the ending date of the prior row equal to the beginning date of the
subsequent one, you will "detect" a missing row every time a user makes a
data entry error. Likely, this will occur quite frequently just from user
errors. When an inmate moves, the user must locate his prior location row
and enter the ending date there, then enter that same date in the new
location row.

Tom Ellison

on 18-Jan-06 that fact is already represented by the fact he moved into HU 4
on that date.but not moving in until 18-Jan-06, then you have a real problem with
tracking him. Likely this would be due to a user error. Asking for
redundant information not only is a cause of such errors, but is a waste of
time to enter the data and of storage space.

Jamie Collins · Jun 12, 2006

Tom said:
Having both a begin and end date in each row is redundant

I already covered this: if it's required for data integrity then by
definition it's not redundant. Think of DRI: you repeat the key column
value(s) in the FK column(s) but you don't think they are redundant, do
you?

To a significant degree, this violates one of the
fundamental rules of normalization.

Please give more details: which rule, to what degree etc.

If you rely on the fact that successive records for the same Inmate must
have the ending date of the prior row equal to the beginning date of the
subsequent one

Successive records, prior/subsequent row? Why, you are now violating

the 'rules' of RM (a set has no order...?) said:
requiring that these entries be made
with perfect consistency.

you will "detect" a missing row every time a user makes a
data entry error. Likely, this will occur quite frequently just from user
errors.

Good! I don't want bad, inconsistent data in my database, therefore I
write constraints to prevent it. Hopefully the front end application
developer will handle the bad user entry with grace. Perfect
consistency? Absolutely yes.

causing both increased data entry...
the user must locate his prior location row
and enter the ending date there, then enter that same date in the new
location row.

You've lost me now. As the database guy, I'll take responsibility for
data integrity. There's no reason why the front end guy couldn't get
just one date from the user to close off the end_date and open the new
start_date, with helper procedures provided by me. But maybe I'm not
seeing the problem here: please post an example scenario.

Jamie.

--

Tom Ellison · Jun 13, 2006

Jamie Collins said:
I already covered this: if it's required for data integrity then by
definition it's not redundant. Think of DRI: you repeat the key column
value(s) in the FK column(s) but you don't think they are redundant, do
you?

This is not required for data integrity. You posit a situation in which it
could be used to detect that data is missing due to corruption. Your
proposal does not provide a way of restoring the data (you would not know
the HU/Cell/Bunk for the missing row) nor would you know if more than one
row were missing. Perhaps if you created a duplicate of every row in the
table, with every column, you would then be able to restore all the data. I
commonly do this. It's called a backup. The backup can be used to detect
and also to fully repair the type of corruption you suggest. Just having
two dates saved will only detect that there is an error, not how many, and
cannot be used to repair the damage.

Third normal form. If the user needs to alter the date an imate moves, you
must update two rows. Redundancy, pure and simple.

Please give more details: which rule, to what degree etc.

Successive records, prior/subsequent row? Why, you are now violating
the 'rules' of RM (a set has no order...?) <g>.

Yes, the table "bag" has no order unless we posit to apply one. This refers
to the ordering of each Inmate by the MvDate.

Good! I don't want bad, inconsistent data in my database, therefore I
write constraints to prevent it. Hopefully the front end application
developer will handle the bad user entry with grace. Perfect
consistency? Absolutely yes.

Having just one date in each row does this the same way as you suggest.
Having the user enter it just once, then storing it twice is not much of a
solution, as I have covered.

You've lost me now. As the database guy, I'll take responsibility for
data integrity. There's no reason why the front end guy couldn't get
just one date from the user to close off the end_date and open the new
start_date, with helper procedures provided by me. But maybe I'm not
seeing the problem here: please post an example scenario.

Fine. You can have the user enter this once, then save it in two places.
Still not a good or proper solution to the problem, as I have tried to show.

Jamie.

In summary, the date an imate leaves can be derived from the date in the
subsequent row (in date order and for the same inmate). Storing data that
can be derived is redundancy.

This invokes all the common problems of storing redundant data:

If the date an imate arrives is changed, you must find and fix up the date
departed from another row.

If a row is deleted, then the process must find the date arrived from the
subsequent row and update the previous row.

If a new row is inserted between two existing rows, another "fix up" is
mandated.

These kinds of housekeeping nightmares are what we avoid by not having
redundant data. That's a primary meaning for the 3rd Normal Rule.

Tom Ellison

RipperT · Jun 13, 2006

Thanx to you both for valuable input. I don't pretend to understand all of
it, as you are both far and away more skilled in the art of DB design than
I.

The table, as I said, is written to programmatically in the course of the
user plugging inmate ID's into each row of a table called
tblLockAllocations. Each row of this table represents a bunk in the
institution. There is one, and only one, row in this table for each bunk in
the institution, so we can take care of the bigger business of counting
inmates at specific times of the day. As the user continually changes these
inmate ID's in this table (ave. 65/day), tblLockHistory is written to with
the data originally posted (I removed other fields for clarity).
tblLockHistory works well for eyeballing the lock history of single inmate,
but I hadn't anticipated the inspector's request.

What Jamie is proposing would require the user to go to tblLockHistory and
find the move-in date of a prisoner and fill in the move-out date of that
record when the prisoner moves out. The DB's main purpose, as I said, is for
counting inmates. My users don't even know that tblLockHistory exists. Also,
this DB is only used in a single institution. If an inmate shows an 'out'
record with no corresponding 'in' record in tblLockHistory, that means he
transferred to another institution and he is off our count, in which case we
won't worry about tracking him until he transfers back in again.

Hope this clears things up a bit. Thanks again for the help.

Rip

Jamie Collins · Jun 13, 2006

Tom said:
the date an imate leaves can be derived from the date in the
subsequent row (in date order and for the same inmate). Storing data that
can be derived

Here's your problem: you are not modelling exit dates. You are
effectively saying, if the inmate entered a new bunk then they must
have exited the last one on the previous day (previous second or
whatever is the smallest unit of time being modelled). That's data
modelling by implication which may be OK for some business applications
but I'm thinking that tracking inmates requires the data to be a bit
more explicit.

Think of the situation with your 'one date' approach where an insert
error alters the data e.g.

2005-06-01
2005-06-15 <<< error: actual date 2006-06-15
2005-06-30
2005-12-24

With your model this insert could well get inserted undetected because
the data 'recovers itself' erroneously. The exit date is being modelled
by implication only.

Using start- and end dates looks more like this:

2005-06-01 2005-06-29
2005-06-15 <<< error: actual date 2006-06-15
2005-06-30 2005-12-24

The keying error wouldn't get into the table: it would fail my check
constraint for non-overlapping periods.

Third normal form. If the user needs to alter the date an inmate moves, you
must update two rows. Redundancy, pure and simple.

That would be two facts: the date/time they exited the old bunk and the
date/time they entered the new. You are hard-wiring the assumption that
one automatically follows the other: what happens to your model when
that can no longer be assumed (bunk > AWOL > solitary) ...?

Anyhow, the issues are more fundamental than even 3NF. It is the
standard approach to model time, being a continuum, in SQL using
durations as start and end date pairs (on the same row). Google for
temporal database expert Rick Snodgrass.

Jamie.

--

Tom Ellison · Jun 14, 2006

Dear Jamie:

Previously, you stated:

"There's no reason why the front end guy couldn't get just one date from the
user to close off the end_date and open the new start_date"

Now you're saying the end_date of the prior occupancy need not be the same
as the start_date of the new occupancy.

Previously, you said the repetition of the same date ending one occupancy
and beginning the next could be used to detect a corruption in the data.

With your current assertion that the occupancy need not be continuous, this
can no longer be the case.

If you continuously make arguments that are contradictory and make
assumptions about the data that may not be appropriate, you will doubtless
come up with a combination that makes my suggestions inappropriate. Whether
this would have any bearing on the original question is doubtful.

Using the old axiom, "everybody has to be someplace" it is not an unlikely
assumption that each inmate is assigned to one HU/Cell/Bunk right up to the
moment he is assigned to another. I do not think this is unlikely. Only
the person who originally asked the question here can evaluate that.

When you shift your objections every time I respond, I can only see that
what you want is to argue with me, not to help with the original question.
That is not my goal, so I will not continue this.

Tom Ellison

Jamie Collins · Jun 14, 2006

Tom said:
I can only see that
what you want is to argue with me, not to help with the original question.

Sorry I sound contradictory and/or augmentative; that's not my
intention.

I admit I'm finding it hard to explain; these are complex matters for
my simple brain. If I were a better wordsmith I'd say the duration is
part of an atomic fact; you are trying to find your end point in
another row, not in the row where it belongs. The complete fact is the
duration, not just the start of it.
(http://groups.google.com/group/comp.databases.theory/browse_frm/thread/545c4a47d744b8f6/)

Bottom line: read the Snodgrass book. It's a free to download as a 5MB
PDF:

http://www.cs.arizona.edu/~rts/tdbbook.pdf

It contains Access/Jet specifics including code, huge but also free to
download (http://www.cs.arizona.edu/~rts/cdrom.zip).

It concludes that periods are best modelled in SQL (including
Access/Jet) using the closed-open representation via start date and end
date pairs, where the end_date (open) instant for the 'prior' row is
one granule (being one second in Access/Jet) before the start_date
(closed) instant in the 'subsequent' row ('prior' and 'subsequent'
being Tom Ellison's terms). For ease of reference, the relevant pages
of the PDF (not the book pages) are 113-114 and 132. The Access/Jet
specific recommendation is on page 125.

Jamie.

--