Looking for dedup/ignore nulls suggestion

B

Bonnie

I would like to dedup a file but ignore nulls. So there are 15,000 records
with a IDnumber, state and reportNO - some of the IDnumbers are null. If I
make the table key for those fields it sees the nulls as a dup and doesn't
append. Any ideas?

Bonnie
 
A

a a r o n _ k e m p f

I frequently use ISNULL or COALESCE in SQL Server to treat nulls as if
they were something else.

-Aaron
 
J

Jerry Whittle

Instead of the primary key, you may be able to use the combination of those
three fields as a unique index. Set the Ignore Nulls property to No. If that
doesn't work, you could set Ignore Nulls to Yes.

For the primary key, I suggest an Autonumber field.
 
K

Ken Sheridan

Bonnie:

Is it just the IDnumber column which determines the duplication? If so then
firstly add an autonumber column (YourID in the example below) to both the
original table to enable you to distinguish between the duplicated values in
the original table, and to the new table into which you want to insert the
distinct rows. Then use a subquery to restrict the query by arbitrarily
selecting the MAX value of the autonumber column for each IDnumber value, so
that it returns one instance of each row with the same non-Null IDnumber
values, and a Boolean OR operation to additionally return all rows where
IDnumber is Null:

INSERT INTO YourNewTable
SELECT *
FROM YourTable AS T1
WHERE YourID =
(SELECT MAX(YourID)
FROM YourTable AS T2
WHERE T2.IDnumber = T1.IDnumber)
OR IDnumber IS NULL;

If its more than one column in combination which determines the duplication
then you'd just need to correlate the subquery with the outer query on all
the relevant columns rather than just on IDnumber, e.g. if duplication is
determined by all three columns being the same:

INSERT INTO YourNewTable
SELECT *
FROM YourTable AS T1
WHERE YourID =
(SELECT MAX(YourID)
FROM YourTable AS T2
WHERE T2.IDnumber = T1.IDnumber
AND T2.state = T1.state
AND T2.reportNO = T1.reportNO)
OR IDnumber IS NULL;


You'll need the YourID column as the key of the new table at least
temporarily as you cannot use IDnumber as the key due to the fact that a
primary key, or part of a primary key, cannot be Null. This is why its not
appending the Null IDnumber rows at present in fact, not that it regards the
Nulls as duplicates. A Null cannot be equal to anything else in fact, even
to another Null, so two rows with Nulls in the IDnumber column cannot be
'duplicates' as such. You might find the following, which is my standard
spiel on Null, helpful in getting your teeth around the concept of Null:


"The important thing to realize is that Null is not a value, but the absence
of a value. The Value property of a field, control etc can be Null, but Null
is not a value as such. About the nearest we can get to what it means is
'unknown', so it could be anything. When dealing with numbers for instance
it might be zero, it might be one million, or any other number; when we
encounter a Null we just don't know.

In arithmetic operations Null 'propagates'. If you think of Null as
'unknown' then this makes more sense. 123 + unknown for instance could be
123 + 0, 123 + 42, 123 +1,000,000, the possibilities are infinite, so the
result is also unknown, i.e. Null. The same goes for any other type of
arithmetic operation.

Similarly you cannot compare Null with something, e.g. in a query WHERE x =
NULL, as this is saying does x = unknown. Again because the 'unknown' might
be the same as the value of x, or it might not be the result is again
unknown, i.e. Null. This is even true of NULL = NULL as each 'unknown' might
or might not be the same as the other 'unknown', so again the result is
unknown, Null.

There are a number of practical implications of this and ways around them.
The Nz function returns the value of something if it has a value, or another
value if it is Null. Most often this is used to return a zero so that an
arithmetic operation doesn't result in Null, but it can be used to return any
other value you wish.

The same applies when using the Boolean AND or OR operators; TRUE OR NULL
results in TRUE as only one operand need be True, but TRUE AND NULL results
in NULL, not in True or in False as every operand in an AND operation must be
True for it to evaluate to True, and at least one must be False and none of
then Null for it to evaluate to False.

When comparing something with Null, as we've seen the usual equality
operator, = , can't be used. In VB, however, there is an IsNull function, so
you can use that, IsNull(x), or Not IsNull(x) to determine if x is Null or
not. IN a query there is a similar provision, in this case WHERE x IS NULL,
or WHERE x IS NOT NULL.

There are some situations in which a Null can cause problems which you might
not anticipate. An interesting one is when using the NOT IN predicate in a
query with a subquery, e.g.

SELECT *
FROM a
WHERE x NOT IN
(SELECT y
FROM b);

now lets say there are rows in table a where the value of x is 123 in one
row and in table b the column y contains 345, NULL, 456, 678

Now in plain English what we are asking here in connection with the row
containing 123 in the x column is does any row in b contain 123 in column y,
and if not return the row from a, i.e. the NOT IN predicate evaluates to
TRUE. At first sight that would look to be the case, but if we analyse it a
little its saying:

WHERE 123 <> 456 AND 123 <> NULL AND 123 <> 679

For this expression to evaluate to TRUE, and the row be returned all three
of the comparative operations would have to evaluate to TRUE, but as we've
seen 123 <> NULL results in NULL, so the expression as a whole evaluates to
NULL and the row isn't returned even though there is no row in b with123 in
the y column.

The same does not apply to the IN predicate, however as:

SELECT *
FROM a
WHERE x IN
(SELECT y
FROM b);

is, if we now assume that one row in b contains 123 in y, the same as saying:

WHERE 123 = 123 OR 123 = NULL OR 123 = 679

This time we only require any one of the three comparative operations to
evaluate to TRUE, which the first one does, so the NULL has no effect in this
case.

We can get round the problem with the NOT IN predicate by using the NOT
EXISTS predicate instead and correlating the subquery with the outer query:

SELECT *
FROM a
WHERE NOT EXISTS
(SELECT *
FROM b
WHERE b.y = a.x);

In fact the EXISTS or NOT EXISTS predicates will in most cases also give
better performance than the IN or NOT IN predicates.

Finally its worth looking at the question of whether fields should be
allowed to contain Nulls or not (setting the Required property of a field in
table design prohibits Nulls). For the point of view of the theoretical
database relational model the answer is probably that Nulls are not allowed.
This is because the model requires every column position in every row to
contain a legal value of the attribute type which the column represents (its
worded rather differently in relational theory using different terms, but I
don't want to confuse the issue so I'm deliberately being a little sloppy in
my wording here). In practice most people would say that Nulls are
permissible in a table to represent missing or inapplicable values.
Nevertheless, caution is necessary. Say you have a CreditLimit column in a
Customers table, what does a Null CreditLimit mean? That the customer is
allowed zero credit? That the customer is allowed unlimited credit? We
simply don't know from the Null, its entirely a matter of interpretation as
the Null is 'semantically ambiguous'. Most people would probably say its
means zero credit, at least pending a decision on the customer's
creditworthiness, but if so the value should be zero. With currency data in
fact its almost always the case that Nulls should be prohibited and a
DefaultValue property of zero given to the field. In other cases the same
considerations won't arise; if we don't currently know someone's phone number
for instance then leaving the field Null should cause no problems."

Ken Sheridan
Stafford, England
 
B

Bonnie

Thanks to everyone for the suggestions. Particular you Ken for such a
comprehensive explanation. It's a big help and I have printed it out for
reference.

Bonnie
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top