GROUP BY vs DISTINCT

M

Michel Walsh

Where? Here. You don't see it? Well, re-initialize your newsreader. Baz
reports it, and, since I was not believing its experimentation was right, I
decided to do it myself and got a similar result that I also reported.


Vanderghast, Access MVP.
 
M

Michel Walsh

I really don't know about what Jet does in the step "group" of its query
plan. MS SQL Server is more detailed and uses the index (if it exists, and
if the number of records is high enough) of the field to be DISTINCT/GROUP
BY.

If you implant that strategy yourself, in this case, and if the index
exists, since the reading of the records (note the query is covered by the
index) is done already sorted, so you can dump the result list onto a stack
rather than on a sorted list. Indeed, to find if a record is already in the
"selected/grouped" list, you only have to compare it with the top value on
the stack. If the index does not exist, you have to relay on stats (if they
are available): does the result will have much less rows than the initial
data, or not. If the number of row is much smaller in the result, may be
preferable to use a sorted list (faster to sort, a little bit longer to
search) than to sort the whole initial set first, then use a stack (longer
to sort the whole initial set, faster to 'search' for existence). Again, if
the index exists, the query being covered by the index, you don't need to
touch the data at all, just the index, so, you have less IO (since index
would be, generally, more compact, than the 'table', it will resides on less
'pages', so less requirements to hit the hard disk). You see, that is what
SQL is: it knows different way of doing the described set, and decide which
strategy it will take. If you want to do the same, clearly, you should
weight the similar alternatives, and 'hard code' multiple solutions, not
just one (in general).



Vanderghast, Access MVP
 
D

David W. Fenton

Most peoples cars could run faster if they opened the hood and
made a few adjustments. . There are a lot of Access users out
there that do not want to "open the hood", or know anything about
what SQL is or does. For those less skilled users learning how to
use Group By on the QBE grid will be far more productive in the
short to medium term than learning how to edit SQL to add
"DISTINCT".

Er, you don't have to edit SQL to do that.
Using my test query on an old slow PC I would need to run the
query over 20 times before it saved me time.

I agree that it's not really a significant difference for easy test
cases.
The other, more worrying, thing about distinct is that I can see
no indication that it is in the query when viewed from the grid
design window. Stranger still when I added a field x:1 to the
design grid the "Group By" version ran faster than Distinct. by a
factor around 8:7

DISTINCT shows 1 record for N duplicates.
GROUP BY shows 1 record for N duplicates

I think I have discovered, at last, why Microsoft did not feel the
need for a "DISTINCT" button.

I think you're imagining things.
 
D

David W. Fenton

Vanderghast, Access MVP



Because it is far from being optimal.

Well, when used WRONG, perhaps.

But have you tested it for its actual performance in aggregating
data? If Jet would aggregate data more slowly if it optimized GROUP
BY to DISTINCT, then I'm glad that it's slower when used WRONGLY as
the equivalent of DISTINCT.

But nobody has tested that.

I see no reason to, since I have nothing to prove here.
 
D

David W. Fenton

You are also missing that GROUP BY WITH AGGREGATE can use the
similar query plan than the one used by DISTINCT and GROUP BY
without aggregate (as you can check my 'claim' using MS SLQ
Server), where the aggregates are done for a small fraction (less
than half of one percent of the whole query execution). Jet is
inefficient in both cases, I mean, with or without aggregate.

Have you tested that Jet's GROUP BY for aggregates has a performance
problem?
I NEVER said that result that are identical implies the execution
plan should be the same. I said LOGICALLY EQIVALENT description
SHOULD be optimized the same way. And DISTINCT and GROUP BY
without aggregate ARE logically equivalent descriptions.

But they *aren't* logically equivalent.

I'm done here.
 
M

Michel Walsh

1- Yes, they are as slow as without aggregate, while they should run in a
time almost equivalent as DISTINCT, for simple aggregate (that is, not for
overly complex arithmetic expression, VBA function, sub-query, ... don't add
words I don't say).

2- They are equivalent, as mentioned by Keith Hare, Convenor of the
International SQL Standards. I try to find a public reference on the web at
the moment, no success yet, but they are logically equivalent. And you are
wrong.

3- Allen Browne reports a case when DISTINCT, in JET, is buggy, on the
reported case, and YOU HAVE TO use GROUP BY, without aggregate, to really
get a list of "distinct" values: http://allenbrowne.com/bug-12.html


Vanderghast, Access MVP
 
D

David Cox

David W. Fenton said:
Er, you don't have to edit SQL to do that.

My apologies to the group. It is possible to specify Unique records in the
properties window of a query, brought up by left clicking in the design
window.

I do not like this implementation, so I had never used it, and forgot it
existed.
a} This is quite a different query with the option selected, but it looks
the same in the design window.
b} It is potentially misleading. Unique means to me records that are not
duplicated, I would naturally tend to interpret distinct the same way. I
would prefer to see the word FIRST, or FIRSTOFGROUP used for this feature.
Given the way it is I would find it easier to explain to a newbie that
Access coceptually uses the first value it finds in a group by default, and
expect it to be actually optimised to work that way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top