Removing statistical outliers

W

William

I have a large range of data (thousands of rows) and I was performing the
functions average and max on cells that had time format 37:30:55. I realized
in the data there were a few records that had an abnormally large value in
them and from discussion they were due to unavoidable circumstances. There
are a few records in each data set I am working with which are skewing my
numbers (very large values). Is there a way to tell these functions (or use a
different function that returns the same results) to exclude outliers or any
outlier larger/smaller then what I specify?

I know you can use the average function, select a range, skip over the
outlier and select another range after the value, and repeat, but now that
the data has so many recrds it is getting to be a tiresome task with the new
data (thousands of additional records) every month that I put on a new tab.
It takes a lot of time to go through thousands of rows to find the outlier
and not select them in the average/max function.
 
L

Luke M

One way to discard outlier is to use an array* function similar to this:

=AVERAGE(IF(A2:A2000<100,A2:A2000))

This formula is designed to take the average of all number sin the range
A2:A2000 that are less than 100.

*Array functions are confirmed using Ctrl+Shift+Enter, not just Enter.
 
W

William

Luke, how do I use time in that array? My time is format hh:mm:ss do I use
quotes, single quotes??
 
J

Jacob Skaria

24:00 equates to 1. So in the below formula you are filtering entries which
are less than 100 which means you are filtering down to hours which are less
than 2400:00:00 which should an ideal case to go with...

=MAX(IF(A2:A2000<100,A2:A2000))
for finding the max

If this post helps click Yes
 
W

William

That kind of sucks. I have dozens and dozens of data sets, each with a
different outlier some are wierd like < 434 hours or < 820 hours. It will be
a nightmare for me to try and caclulate what 434 equals if 24 is 1. No easier
way, like just using the real time value????
 
J

Jacob Skaria

I understand you are looking to find Average and max from large data sets...

Have you tried the formula Luke has given on average and the below max
formulas (array formulas)..

=MAX(IF(A2:A2000<100,A2:A2000))

It would be easy for us to suggest a solution if you can give more
information on how your data is arrranged....

If this post helps click Yes
 
W

William

I tried the formula, and it works, however my outliers are strange numbers
at times (i.e. 2224:18:40). My spreadsheet is pretty simple. Column A is the
date and time when a report was received in format mm/dd/yy hh:mm:ss AM/PM
and the column B is the time it was acknowledged in the same format. Column C
is the difference between column A & B in time format 37:30:55.

I have dozens of tabs and spreadsheets with similiar data (each one unique
to itself representing various things). Some of the records have a huge time
difference due to the fact that someone forgot about it for months and
months. It is always different for outliers one tab has that huge one over
2224 another is 50:34:56 and another is 6:46:20.

Even though the formula works great, I would have to take time out to find
what number to use in the formula for the outliers instead of using a default
formula in my spreadsheets I can reuse.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top