The Importance of Cleaning Up Data

D

David W. Fenton

The cutover for my SQL Server upsizing project is this weekend.
Things have gone well in terms of the app design, but one thing was
driven home to me by this upsizing project that I think it's good to
be reminded of:

Clean data is important not just to accuracy, but because clean data
makes your database function better.

For example:

I had a table with over 500K records that had a foreign key from a
5K table. There was no relationship enforced, and so many of the
records had no value there. But I needed to display the data from
both tables. That meant that I needed an outer join.

The cost of this in SQL Server was HUGE.

So, after speaking with the client, it was determined that there
really was no reason except for sloppy data entry that this field
should ever be Null. Thus, I was able to populate the field, make it
required (i.e., NULL not allowed) and change the join in views to
INNER JOIN. The performance improvement was drastic -- from over a
minute to pull up a filtered result on the view that was doing the
joining to instantaneous results (i.e., much less than a second).

I started looking around for other outer joins that could be
eliminated by cleaning up the data, and found several places where
bad data entry and failure to enforce relationships had led to null
values in foreign keys. After cleaning this up, a lot of performance
improvements resulted.

The main point:

Avoid Null foreign keys whenever you can. In Jet, that means making
your foreign keys required.

Obviously, the N:0 relationship requires allowing a Null FK, but if
you don't need that, it's best to not allow it. In the database I've
been working on, there were plenty of places where the only reason
the Null was allowed was to accomodate legacy data that didn't have
any values in the FK fields. Fortunately, there were obvious choices
for the parent records in all cases, so populating the FK fields was
very desirable.

So, performance can be a matter of schema design, and avoiding outer
joins is A GOOD THING WHENEVER POSSIBLE.
 
T

Tony Toews [MVP]

David W. Fenton said:
Avoid Null foreign keys whenever you can. In Jet, that means making
your foreign keys required.

I have many, many tables in many apps with null foreign keys. That's
quite standard. Relationships however are rigidly enforced.

Tony
--
Tony Toews, Microsoft Access MVP
Please respond only in the newsgroups so that others can
read the entire thread of messages.
Microsoft Access Links, Hints, Tips & Accounting Systems at
http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
 
T

Tom van Stiphout

On Fri, 26 Oct 2007 18:50:57 -0500, "David W. Fenton"

It's also important for the developer to allocate sufficient time in
his/her proposal for such cleanup. This can take up a good chunk of
time, and typically you don't know how bad the data is when you write
up your proposal.

-Tom.
 
D

David W. Fenton

I have many, many tables in many apps with null foreign keys.
That's quite standard. Relationships however are rigidly
enforced.

Um, isn't that what I said with "avoid Null FKs WHENEVER YOU CAN"
and with the sentence immediately following the sentences you quote,
as in "Obviously, the N:0 relationship requires allowing a Null FK"?

When you have to have N:0, then yes, you have to allow Null FKs. But
if you *don't* want N:0, then make sure your FK is Null. In the
database I was working with, the data should have disallowed N:0,
but somewhere in the 7 years of revisions I've been doing on this
app (which was not originally created by a professional developer),
I never fixed that problem. With a Jet back end, it worked OK. With
SQL Server, the outer join that was required when presenting selects
on the two tables was causing horrendous performance. Implementing
the correct design fixed the performance problem.

Yes, of course, it was a mistake in the design, but the whole point
was that Jet's efficiency let me get away with that mistake for 7
years, while SQL Server was very unforgiving.

That was entirely my point, that schema design can have a huge
effect on performance, not just data consistency.
 
D

David W. Fenton

It's also important for the developer to allocate sufficient time
in his/her proposal for such cleanup. This can take up a good
chunk of time, and typically you don't know how bad the data is
when you write up your proposal.

In this case, I gave them a fixed estimate for the things I knew
would require adjustments. Turns out, though, that the things I
expected to take a long time to adjust either didn't need adjustment
at all (one operation ran orders of magnitude faster with SQL Server
than on Jet!), or took very little time to adjust (because I'd
designed the code in such a way as to make it easy to replace with a
new approach). So, that left me time to do other things that weren't
expected, without having to exceed the estimate.

Of course, I did *not* include data import/massage in the 10 hours,
because I know perfectly well that it's impossible to estimate how
long it will take. In this case, I thought I knew the data very
well, but SQL Server flushed out a whole lot of issues that I either
didn't know about (the outer join issue), or didn't realize were a
problem (the invalid dates).

I guess the point I was trying to make is that we usually think of
programming and UI design and efficient data retrieval as the tasks
we address in trying to maximize performance. In this case, it was
the underlying schema that was leading to the performance hit, and
that's something I guess I'd forgotten about in my years of adapting
other people's apps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top