Calculate Percentile by grouping fields in Database

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi, i am having DATA like this in access table , i would like to calculate
Percentile 50% of sales by grouping data at Month Level, that means for each
month i ll get differnt percentile value , eg. 2007/01 - Percentile Value,
2007/02 another percentile value.... and so on, is it possible to calculate
this way.

Customer-Month - Sales
X-2007/01-50000
Y-2007/01-75000
Z-2007/01-80000
X-2007/02-50000
Z-2007/02-80000

Thanks
 
Max said:
Hi, i am having DATA like this in access table , i would like to calculate
Percentile 50% of sales by grouping data at Month Level, that means for
each
month i ll get differnt percentile value , eg. 2007/01 - Percentile
Value,
2007/02 another percentile value.... and so on, is it possible to
calculate
this way.

Customer-Month - Sales
X-2007/01-50000
Y-2007/01-75000
Z-2007/01-80000
X-2007/02-50000
Z-2007/02-80000
Hi Max,

This can be accomplished via crosstab query
if you are up to exploring that route.

Here is what I think I know
(with apologies to Steve Dassin
if I have errored somewhere):

I believe we are talking about:

Percentile = (Rank/TotalCount) * 100

The following solution requires that you
be able to use DCount to produce a rank
and a total count over the group.

My example changes [Month] to non-reserved
word [SaleMonth] and assumes it is a Date/Time.
I also assumed your example data was contained
in a table "tblSales."

Let's start out easy to see if your DCounts work...

SELECT
tblSales.Customer,
tblSales.SaleMonth,
tblSales.Sales,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1 AS SaleRank,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#") AS TotCnt,
([SaleRank]/[TotCnt])*100 AS Percentile
FROM tblSales;

result:


Customer
SaleMonth
Sales
SaleRank
TotCnt
Percentile
X 1/1/2007 50000 1 3 33.3333333333333
Y 1/1/2007 75000 2 3 66.6666666666667
Z 1/1/2007 80000 3 3 100
X 2/1/2007 50000 1 2 50
Z 2/1/2007 80000 2 2 100

Our journey starts with changing above to crosstab...

for now we want to use SaleMonth, TotCnt, and SaleRank
as Row Headings (this SaleRank is here in RowHeading now
only to understand how this will work),

another instance of SaleRank as Column Heading (PIVOT),

and Sales as the Value -- with our grouping the Value
will be distinct sales values for each distinct rank column so aggregate
could be "anything" -- we will choose MAX, but could be FIRST, etc.

Here's how it might look in query grid:

Field: SaleMonth TotCnt: DCount(.. SaleRank: DCount(...
Table: tblSales
Total: Group By Group By Group By
Crosstab:Row Heading Row Heading Row Heading
Sort:
Criteria:

Field: Expr1: DCount(... S: Sales Sales
Table: tblSales tblSales
Total: Group By Max Where
Crosstab: Column Heading Value
Sort:
Criteria: IS NOT NULL

the SQL:

TRANSFORM Max(tblSales.Sales) AS S
SELECT
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#") AS TotCnt,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1 AS SaleRank,
FROM tblSales
WHERE (((tblSales.Sales) Is Not Null))
GROUP BY
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1
PIVOT
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1;

the results:

SaleMonth
TotCnt
SaleRank 1 2 3
1/1/2007 3 1 50000
1/1/2007 3 2 75000
1/1/2007 3 3 80000
2/1/2007 2 1 50000
2/1/2007 2 2 80000

Based on our formula above, if we have a target percentile (50),
then we should be able to choose the Sales(s) that give
us the 50th percentile (median) based on rank.

Percentile = (Rank/TotalCount) * 100

--> Rank = (TotalCnt * Percentile)/100

But Rank is integer and our calculation is probably float!

We handle that with Fix() function over calculation,
then add 1.

Rank = Fix((TotalCnt * Percentile)/100) + 1

i.e., for 1/1/2007,

Rank = Fix((3 * 50)/100) + 1 = 2
--> median = Sales where Rank = 2
--> median = 75000

BUT that does not work for 2/1/2007! We would think that

Rank = Fix((TotalCnt * Percentile)/100) + 1 = Fix((2*50)/100) + 1 = 2
--> median = Sales where Rank = 2
--> median = 80000
BUT that's not right!

It turns out in this case we need to get 2 Sales and average them,
and their ranks are determined by

Rank1 = Fix((TotalCnt * Percentile)/100) + 1 = 2
Rank2 = Fix((TotalCnt * Percentile)/100) = 1
--> median = Avg(Sales where rank = 2 and 1)
--> median = Avg(50000 + 80000) = 65000

How do we identify when we need to use 2 values?

(TotCnt * Percentile) MOD 100

if <> 0, use

Rank = Fix((TotalCnt * Percentile)/100) + 1

if = 0, use Rank1 and Rank2

Let's add some expressions to our query grid as Row Headings:

Field: Percentile: 50
Table:
Total: Expression
Crosstab: Row Heading
Sort:
Criteria:

Field:TotCntxPercentileMOD100: ([TotCnt]*[Percentile]) Mod 100
Table:
Total: Expression
Crosstab: Row Heading
Sort:
Criteria:

Field: R1: Fix(([TotCnt]*[Percentile])/100)+1
Table:
Total: Expression
Crosstab: Row Heading
Sort:
Criteria:

Field: R2: Fix(([TotCnt]*[Percentile])/100)
Table:
Total: Expression
Crosstab: Row Heading
Sort:
Criteria:


TRANSFORM Max(tblSales.Sales) AS S
SELECT
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1 AS SaleRank,
50 AS Percentile,
([TotCnt]*[Percentile]) Mod 100 AS TotCntxPercentileMOD100,
Fix(([TotCnt]*[Percentile])/100)+1 AS R1,
Fix(([TotCnt]*[Percentile])/100) AS R2
FROM tblSales
WHERE (((tblSales.Sales) Is Not Null))
GROUP BY
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1
PIVOT
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#
AND
( ([Sales]<" & [Sales] & ")
OR
([Sales] = " & [Sales] & "
AND [Customer] < '" & [Customer] & "'))")+1;



SaleMonth
TotCnt
SaleRank
Percentile
TotCntxPercentileMOD100
R1
R2 1 2 3
1/1/2007 3 1 50 50 2 1 50000
1/1/2007 3 2 50 50 2 1 75000
1/1/2007 3 3 50 50 2 1 80000
2/1/2007 2 1 50 0 2 1 50000
2/1/2007 2 2 50 0 2 1 80000

So now you know what will be happening "internally"
in our final query.

We only want to group by SaleMonth. All other row headings
that cause our result to "expand" beyond those groups
will go away to preserve the group, but will be used
within an AVG aggregate row heading where we sift out
what to use via SWITCH() functions.

Loosely defined, the SWITCH() functions works like:

SWITCH(cond1,do1,cond2,do2)

if condition1 is true, then do "do1"
else if condition2 is true, do "do2"
else return null

By giving the TRANSFORM Value an alias ("S"), we can
use S "internally as each group is being made," i.e.,
we can say "include S in the AVG if rank meets condition
described above using our SWITCH() functions."

See how the first SWITCH determines if "MOD100"
is not 0
--> return S where rank meets condition R1
or is 0
--> return S where rank in R1 or R2

Then what S's get returned for the group get averaged.

Avg(
Switch(
([TotCnt]*[Percentile]) Mod 100<>0,
Switch(
[SaleRank]=Fix(([TotCnt]*[Percentile])/100)+1,

),
True,
Switch(
[SaleRank] IN (
Fix(([TotCnt]*[Percentile])/100)+1,
Fix(([TotCnt]*[Percentile])/100)
),

)
)
) AS Median


So...the final crosstab:

TRANSFORM Max(tblSales.Sales) AS S
SELECT
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#") AS TotCnt,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "# AND ( ([Sales]<" &
[Sales] & ") OR ([Sales] = " & [Sales] & " AND [Customer] < '" & [Customer]
& "'))")+1 AS

SaleRank,
50 AS Percentile,
Avg(
Switch(
([TotCnt]*[Percentile]) Mod 100<>0,
Switch(
[SaleRank]=Fix(([TotCnt]*[Percentile])/100)+1,

),
True,
Switch(
[SaleRank] IN (
Fix(([TotCnt]*[Percentile])/100)+1,
Fix(([TotCnt]*[Percentile])/100)
),

)
)
) AS Median

FROM tblSales
WHERE (((tblSales.Sales) Is Not Null))
GROUP BY
tblSales.SaleMonth
PIVOT
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "# AND ( ([Sales]<" &
[Sales] & ") OR ([Sales] = " & [Sales] & " AND [Customer] < '" & [Customer]
& "'))")+1;


and the results are (where one might hide column SaleRank):


SaleMonth
TotCnt
SaleRank
Percentile
Median 1 2 3
1/1/2007 3 1 50 75000 50000 75000 80000
2/1/2007 2 1 50 65000 50000 80000

Good luck,

gary
 
yeah...after typing all that out, I gave wrong final SQL (argh)...

TRANSFORM Max(tblSales.Sales) AS S
SELECT
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#") AS TotCnt,
DCount("*","tblSales",
"[SaleMonth]=#" & [SaleMonth] &
"# AND ( ([Sales]<" & [Sales] &
") OR
([Sales] = " & [Sales] &
" AND [Customer] < '" & [Customer] & "'))")+1
AS SaleRank,
50 AS Percentile,
Avg(Switch(
([TotCnt]*[Percentile]) Mod 100<>0,
Switch([SaleRank]=Fix(([TotCnt]*[Percentile])/100)+1,),
True,
Switch([SaleRank] In
(
Fix(([TotCnt]*[Percentile])/100)+1,
Fix(([TotCnt]*[Percentile])/100)
),)
)
) AS Median
FROM tblSales
WHERE (((tblSales.Sales) Is Not Null))
GROUP BY
tblSales.SaleMonth
PIVOT
DCount("*","tblSales",
"[SaleMonth]=#" & [SaleMonth] &
"# AND ( ([Sales]<" & [Sales] &
") OR
([Sales] = " & [Sales] &
" AND [Customer] < '" & [Customer] & "'))")+1;

your grid would look differently that what I gave previously
("expression" instead of "group by" for TotCnt & SaleRank):

Field: SaleMonth TotCnt: DCount(.. SaleRank: DCount(...
Table: tblSales
Total: Group By Expression Expression
Crosstab:Row Heading Row Heading Row Heading
Sort:
Criteria:

Please respond back if I was not clear about something.

gary
 
Hi Max,

First, maybe not explicitly referenced, but all credit for
this method (if I explained correctly) goes to Steve
Dassin who demoed this back in 1998 (Google Group
search of "Access Steve Dassin Percentiles" should pull it up).

Second, in construction of DCount/SalesRank, the field
Customer was used to handle ties in Sales, but would be
less efficient than if your data had a primary key to use
for these ties in Sales instead (DCount is so slow anyway).

good luck,

gary


"Max"wrote:
thanx Gary for this detailed information, I ll check out this solution :-)

Gary Walter said:
yeah...after typing all that out, I gave wrong final SQL (argh)...

TRANSFORM Max(tblSales.Sales) AS S
SELECT
tblSales.SaleMonth,
DCount("*","tblSales","[SaleMonth]=#" & [SaleMonth] & "#") AS TotCnt,
DCount("*","tblSales",
"[SaleMonth]=#" & [SaleMonth] &
"# AND ( ([Sales]<" & [Sales] &
") OR
([Sales] = " & [Sales] &
" AND [Customer] < '" & [Customer] & "'))")+1
AS SaleRank,
50 AS Percentile,
Avg(Switch(
([TotCnt]*[Percentile]) Mod 100<>0,
Switch([SaleRank]=Fix(([TotCnt]*[Percentile])/100)+1,),
True,
Switch([SaleRank] In
(
Fix(([TotCnt]*[Percentile])/100)+1,
Fix(([TotCnt]*[Percentile])/100)
),)
)
) AS Median
FROM tblSales
WHERE (((tblSales.Sales) Is Not Null))
GROUP BY
tblSales.SaleMonth
PIVOT
DCount("*","tblSales",
"[SaleMonth]=#" & [SaleMonth] &
"# AND ( ([Sales]<" & [Sales] &
") OR
([Sales] = " & [Sales] &
" AND [Customer] < '" & [Customer] & "'))")+1;

your grid would look differently that what I gave previously
("expression" instead of "group by" for TotCnt & SaleRank):

Field: SaleMonth TotCnt: DCount(.. SaleRank: DCount(...
Table: tblSales
Total: Group By Expression Expression
Crosstab:Row Heading Row Heading Row Heading
Sort:
Criteria:

Please respond back if I was not clear about something.

gary
 
Back
Top