Finding the 70th percentile

G

Guest

Hello,

I have a huge amount of data in a table. These are times taken to fly
somewhere.

I want to group this data by month and year, and for each month I want to
consider all the flight times in ascending order of duration. From this I
want to capture the 70th percentile data.

Take this example:
We have 200 flights and the durations are 300,301,302,303,...,499 mins
I want to order them (ascending) and pick out the value of the 70th
percentile. In this case, that would mean the 140th flight in the sorted
list, which would have a flight duration of 439 mins.

Is there any way to do this in Access?

I could even do it in Excel if it's not possible in Access - although I can
already do this as a one off, I would like to find a way that would be able
to consider multiple flights, months and years all at the same time - which
would be an absolute nightmare from what I've tried.... so I'd rather find a
way in Access.

Any help would be massively appreciated.

Basil
 
T

Tom Ellison

Dear Basil:

In order to find this, you first need a total count of the rows for each
flight (I'm assuming your table may have a Flight column with numerous rows
for FlightTime of each sample).

You will also need a "rank" for each flight. Likely there will be ties.
Consider a set of flight times of 300, 300, 300, and 301 minutes. The ranks
are 1, 1, 1, and 4, respectively.

The rank you want will be 0.7 * X where X is the total number of flights
mentioned in my first paragraph. For the 4 flights in my sample, you would
want 4 * .7 = 2.8. Do you want the 2nd or 3rd ranked in this case? Would
you round it off, 5 up 4 down, truncate or ruond up, or what? It makes a
difference, and you cannot leave this ambiguous. You must have specifics in
order to program this. Next, there is no rank 2 or 3. Do you want the
highest ranked duration less than or equal to the calculated percentile
rank, or the lowest ranked duration greater than or equal to the calculated
rank? In other words, for my sample, is the 70th percentile duration 1 or
4? Actually, this definition can, and usually does override the question
about rounding, and that's probably what you want.

The ranking query would be:

SELECT Flight, Duration,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) AS Rank
FROM YourTable T

The above uses a correlated subquery to count the number of rows with
shorter duration. Adding one changes the starting rank from 0 to 1 (that's
what humans are accustomed to seeing). T and T1 are aliases, which allows
the query to reference two independent copies of the same table.

Adding to this query, here's the percentiles of every row:

SELECT Flight, Duration,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) AS Rank,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) / (SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight)
AS Percentile
FROM YourTable T

Be sure to change YourTable to the actual name of your table.

Please try this much and see how we're doing. I cannot test any of this, as
I do not have your table.

Tom Ellison
 
G

Guest

Start the SQL with: SELECT TOP 70 PERCENT

Make sure that you have the durations sorted in descending order and only
one month/year combination selected at a time.
 
G

Guest

Hi Tom,

Thank you so much for this. It is really good. (thanks also to Jerry - that
also worked - but I couldn't get it to handle the groupings).

I found a problem with the first query though and thought you might know why
it's not quite right. This is the SQL:

SELECT tblData.Flight, tblData.[Total Departure Taxy Time],
((SELECT Count(*)
FROM tblData AS tblData_1
WHERE tblData_1.Flight = tblData.Flight
AND tblData_1.[Total Departure Taxy Time] < tblData.[Total Departure Taxy
Time]) + 1) AS Rank
FROM tblData;

The problem lies in that it seems to rank to a higher number than the number
of rows that exist - eg. (sorted by rank)

Row Flt Duration Rank
1 BA1 5 1
2 BA1 6 2
3 BA1 6 2
4 BA1 7 4
5 BA1 7 4
.. . . .
20 BA1 7 4
21 BA1 8 22
.. . . .
.. . . .

Now surely row 21 should have been rank 21 not 22?

The bottom row (highest duration) is rank 2887 when there are only actually
2882 rows of data.

Have I got something wrong in the query or is this an inaccuracy I'm going
to have to live with?

Thanks,

Basil
 
T

Tom Ellison

Dear Basil:

Here's something that may help debug the problem:

SELECT Flight, Duration,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) AS Rank,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration = T.Duration)
AS RepeatCount
FROM YourTable T

The new column "repeat count" tells you how many rows of the current values
for Flight and Duration there are. This number should add in to the current
Rank value when the next Rank turns up.

Also, try this:

SELECT Flight, Duration
FROM YourTable
WHERE Flight = "BA1"
AND Duration = 7

How many rows does it give? 17?

Also, check that the results you are viewing are sorted by Flight then
Duration. Otherwise, the Row count would not be meaningful. If the rows
were added to the table in almost sequential order, and you don't explicitly
sort your results when viewing them, then they may appear to be in nearly
sequential order, but not be exactly in order.

Tom Ellison


Basil said:
Hi Tom,

Thank you so much for this. It is really good. (thanks also to Jerry -
that
also worked - but I couldn't get it to handle the groupings).

I found a problem with the first query though and thought you might know
why
it's not quite right. This is the SQL:

SELECT tblData.Flight, tblData.[Total Departure Taxy Time],
((SELECT Count(*)
FROM tblData AS tblData_1
WHERE tblData_1.Flight = tblData.Flight
AND tblData_1.[Total Departure Taxy Time] < tblData.[Total Departure Taxy
Time]) + 1) AS Rank
FROM tblData;

The problem lies in that it seems to rank to a higher number than the
number
of rows that exist - eg. (sorted by rank)

Row Flt Duration Rank
1 BA1 5 1
2 BA1 6 2
3 BA1 6 2
4 BA1 7 4
5 BA1 7 4
. . . .
20 BA1 7 4
21 BA1 8 22
. . . .
. . . .

Now surely row 21 should have been rank 21 not 22?

The bottom row (highest duration) is rank 2887 when there are only
actually
2882 rows of data.

Have I got something wrong in the query or is this an inaccuracy I'm going
to have to live with?

Thanks,

Basil

Tom Ellison said:
Dear Basil:

In order to find this, you first need a total count of the rows for each
flight (I'm assuming your table may have a Flight column with numerous
rows
for FlightTime of each sample).

You will also need a "rank" for each flight. Likely there will be ties.
Consider a set of flight times of 300, 300, 300, and 301 minutes. The
ranks
are 1, 1, 1, and 4, respectively.

The rank you want will be 0.7 * X where X is the total number of flights
mentioned in my first paragraph. For the 4 flights in my sample, you
would
want 4 * .7 = 2.8. Do you want the 2nd or 3rd ranked in this case?
Would
you round it off, 5 up 4 down, truncate or ruond up, or what? It makes a
difference, and you cannot leave this ambiguous. You must have specifics
in
order to program this. Next, there is no rank 2 or 3. Do you want the
highest ranked duration less than or equal to the calculated percentile
rank, or the lowest ranked duration greater than or equal to the
calculated
rank? In other words, for my sample, is the 70th percentile duration 1
or
4? Actually, this definition can, and usually does override the question
about rounding, and that's probably what you want.

The ranking query would be:

SELECT Flight, Duration,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) AS Rank
FROM YourTable T

The above uses a correlated subquery to count the number of rows with
shorter duration. Adding one changes the starting rank from 0 to 1
(that's
what humans are accustomed to seeing). T and T1 are aliases, which
allows
the query to reference two independent copies of the same table.

Adding to this query, here's the percentiles of every row:

SELECT Flight, Duration,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) AS Rank,
(SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight
AND T1.Duration < T.Duration)
+ 1) / (SELECT COUNT(*)
FROM YourTable T1
WHERE T1.Flight = T.Flight)
AS Percentile
FROM YourTable T

Be sure to change YourTable to the actual name of your table.

Please try this much and see how we're doing. I cannot test any of this,
as
I do not have your table.

Tom Ellison
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top