Getting TOP of Subsets of a Table

S

Smartin

/Preamble/
Dear Reader,
There have been many times I have started to write out a problem that
has me stumped, only to find that as I write the solution magically
materializes. Sometimes before I can finish my thoughts I have solved
the problem, and I cancel my message. If I end up pressing "send" on
this message, Dear Reader, please do not think I did so hastily. The
collective knowledge and helpfulness in this group has been nothing
short of amazing, and is highly prized. Kudos to you all.

/The Problem/

Given the following table

TAT_DATA
=========
PK
LOB
SOURCE
BNK
REPRICED
PROC_MONTH
JURIS
LAGTIME

I need to return all columns without aggregation, but filtered to return
the TOP 95 PERCENT of rows relative to LAGTIME (ORDERed BY ASC) within
each LOB and PROC_MONTH pair.

A little bit about the data and what I'm doing:

LOB can be one of three distinct values.

PROC_MONTH is a date column representing a month. Each month we will add
more rows to the table representing the previous month's activity.
Defining a list of possible values here is possible, but not desirable
from a maintenance standpoint. Each PROC_MONTH adds some 80K rows on
average to the table.

LAGTIME is a integer (in the mathematical sense). I want to return the
top 95% of these when ordered ascending (so as to discard the "worst"
5%). Additionally, there are a few outliers I need to exclude before the
TOP calculation is applied. I assume this will happen in a WHERE clause,
e.g.,
WHERE LAGTIME >=0 AND LAGTIME < 1000

I cannot deliver aggregated data because the detail is needed in further
analysis for which requirements vary. Various individuals will be
examining averages, counts and deviations by SOURCE, BNK, JURIS,
REPRICED etc., but always(?) within an LOB and PROC_MONTH. An Excel
Pivot table is supplied to the end-users for this step, no worries
there. I just need to supply the filtered detail as a data source.

The production platforms are A97 and E97.

*However* if you would like real data to play with I have created an
anonymized, 1001 row sample mdb in A2002 format (zipped to 36K) and made
it available here:

http://vfdrake.home.comcast.net/files/TAT2002.zip

Thanks, as always, for any suggestions you may have.
 
S

Smartin

Smartin said:
/Preamble/
Dear Reader,
There have been many times I have started to write out a problem that
has me stumped, only to find that as I write the solution magically
materializes. Sometimes before I can finish my thoughts I have solved
the problem, and I cancel my message. If I end up pressing "send" on
this message, Dear Reader, please do not think I did so hastily. The
collective knowledge and helpfulness in this group has been nothing
short of amazing, and is highly prized. Kudos to you all.

/The Problem/

Given the following table

TAT_DATA
=========
PK
LOB
SOURCE
BNK
REPRICED
PROC_MONTH
JURIS
LAGTIME

I need to return all columns without aggregation, but filtered to return
the TOP 95 PERCENT of rows relative to LAGTIME (ORDERed BY ASC) within
each LOB and PROC_MONTH pair.

A little bit about the data and what I'm doing:

LOB can be one of three distinct values.

PROC_MONTH is a date column representing a month. Each month we will add
more rows to the table representing the previous month's activity.
Defining a list of possible values here is possible, but not desirable
from a maintenance standpoint. Each PROC_MONTH adds some 80K rows on
average to the table.

LAGTIME is a integer (in the mathematical sense). I want to return the
top 95% of these when ordered ascending (so as to discard the "worst"
5%). Additionally, there are a few outliers I need to exclude before the
TOP calculation is applied. I assume this will happen in a WHERE clause,
e.g.,
WHERE LAGTIME >=0 AND LAGTIME < 1000

I cannot deliver aggregated data because the detail is needed in further
analysis for which requirements vary. Various individuals will be
examining averages, counts and deviations by SOURCE, BNK, JURIS,
REPRICED etc., but always(?) within an LOB and PROC_MONTH. An Excel
Pivot table is supplied to the end-users for this step, no worries
there. I just need to supply the filtered detail as a data source.

The production platforms are A97 and E97.

*However* if you would like real data to play with I have created an
anonymized, 1001 row sample mdb in A2002 format (zipped to 36K) and made
it available here:

http://vfdrake.home.comcast.net/files/TAT2002.zip

Thanks, as always, for any suggestions you may have.

Well, no action here huh?

OK I toyed with this idea... it's easy enough to extract existing
PROC_MONTH and LOB values into separate queries or tables. It /seems/
like I should be able to join these to the original table and limit the
SELECT TOP piece to these pairs. That's as far as I got today. After
that, my brain turns into something like asphalt.
 
S

Smartin

Smartin said:
Well, no action here huh?

OK I toyed with this idea... it's easy enough to extract existing
PROC_MONTH and LOB values into separate queries or tables. It /seems/
like I should be able to join these to the original table and limit the
SELECT TOP piece to these pairs. That's as far as I got today. After
that, my brain turns into something like asphalt.

I'm getting closer. I have something that works, kind of...

SELECT M1.*
FROM TAT_DATA AS M1
WHERE M1.PK IN (
SELECT TOP 95 PERCENT M2.PK
FROM TAT_DATA AS M2
WHERE M2.LOB = M1.LOB
AND M2.PROC_MONTH = M1.PROC_MONTH
AND M2.LAGTIME >= 0
AND M2.LAGTIME < 720
ORDER BY M2.LAGTIME);

I presume this is known as a Cartesian or cross-product, since every row
is compared to every row.

It works great against 1000 rows, but unfortunately my production data
is 300K rows, and the query doesn't exactly fly. I tried creating an
index on LOB and PROC_MONTH but the query still acts like it's going to
take days to run.

So to get my answer I need to feed PROC_MONTH and LOB to the subquery,
one combination at a time. I have a query that creates those
combinations on the fly, let's call it LOB_MONTH, but if I use that in
the outer select how can I get at the detail in TAT_DATA? Urrgh...
 
S

Smartin

Smartin said:
I'm getting closer. I have something that works, kind of...

SELECT M1.*
FROM TAT_DATA AS M1
WHERE M1.PK IN (
SELECT TOP 95 PERCENT M2.PK
FROM TAT_DATA AS M2
WHERE M2.LOB = M1.LOB
AND M2.PROC_MONTH = M1.PROC_MONTH
AND M2.LAGTIME >= 0
AND M2.LAGTIME < 720
ORDER BY M2.LAGTIME);

I presume this is known as a Cartesian or cross-product, since every row
is compared to every row.

It works great against 1000 rows, but unfortunately my production data
is 300K rows, and the query doesn't exactly fly. I tried creating an
index on LOB and PROC_MONTH but the query still acts like it's going to
take days to run.

So to get my answer I need to feed PROC_MONTH and LOB to the subquery,
one combination at a time. I have a query that creates those
combinations on the fly, let's call it LOB_MONTH, but if I use that in
the outer select how can I get at the detail in TAT_DATA? Urrgh...

I have a solution, albeit a kludgy one. It uses VBA that creates a
recordset of LOB & PROC_MONTH pairs and loops through the recordset
whilst feeding the values to a query that updates a second table.

It seems I should be able to accomplish this with straight SQL, and
indeed I would like to port the query to a DBMS that does not support
VB/VBA, hence the kludge designation. If anyone is still reading and has
any suggestions, feel free to suggest.

Thanks all,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top