PC Review


Reply
 
 
Zone
Guest
Posts: n/a
 
      9th Jul 2007
I have a column of numbers. I want to eliminate all data in the
column that is greater than 2.5 standard deviations from the mean,
leaving only the data that is within 2.5 StdDevs. I have a lot of
data, so I need to do this as quickly as possible so I can then get
statistical quartiles for the data. Any help on one or both issues
greatly appreciated! James

 
Reply With Quote
 
 
 
 
johnhildreth@citynet.net
Guest
Posts: n/a
 
      9th Jul 2007
James,

There are many "sexy" ways to do it, but if you only have to do it
once or twice....

1) Calculate the mean and standard deviation
2) Calculate the upper and lower cutoff values
3) Sort the data
4) Manually delete those above/below the cutoff values

If this is something you will do on a regular basis, then it may be
worth pursing a more automated method.

HTH,
John

 
Reply With Quote
 
Zone
Guest
Posts: n/a
 
      9th Jul 2007
John, Thank you for your reply. I am well-versed in VBA programming,
but a complete dunce at advanced mathematics like this. The problem
is that the data is very "raw" -- about 12,000 datapoints ranging from
0.22 to over 100,000. You probably know that this is the kind of mess
one sometimes gets from clients who don't really understand the ratios
that are being asked for. The objective is to pare the data down to
the "reasonable" values and get the bulk of it for subsequent figuring
of quartiles. May I use the Reply to Author button to reply to you
directly? James

On Jul 9, 1:04?pm, johnhildr...@citynet.net wrote:
> James,
>
> There are many "sexy" ways to do it, but if you only have to do it
> once or twice....
>
> 1) Calculate the mean and standard deviation
> 2) Calculate the upper and lower cutoff values
> 3) Sort the data
> 4) Manually delete those above/below the cutoff values
>
> If this is something you will do on a regular basis, then it may be
> worth pursing a more automated method.
>
> HTH,
> John



 
Reply With Quote
 
johnhildreth@citynet.net
Guest
Posts: n/a
 
      9th Jul 2007
James,

Sorry, that email doesn't exist anymore. The programming is fairly
straight forward. I'll do what I can to help you, but if it isn't
something you are going to do on a regular basis....manual is the way
to go.

John


 
Reply With Quote
 
Zone
Guest
Posts: n/a
 
      9th Jul 2007
John, I understand. I'll plow away at the problem in here in hopes
that you will stick with me for a while and not get irritated with my
lack of mathematical expertise. I would like to use built-in Excel
functions as much as possible to speed the calculation. Your first
step, "Calculate the mean". Would I use Excel's AVERAGE function for
this? Say column M contains 12,410 numbers, starting on row 2. Would
I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
friend has told me that what is really wanted is the "threshold to the
second quartile", so I'm confused. James

On Jul 9, 1:48?pm, johnhildr...@citynet.net wrote:
> James,
>
> Sorry, that email doesn't exist anymore. The programming is fairly
> straight forward. I'll do what I can to help you, but if it isn't
> something you are going to do on a regular basis....manual is the way
> to go.
>
> John



 
Reply With Quote
 
=?Utf-8?B?VG9tIE9naWx2eQ==?=
Guest
Posts: n/a
 
      9th Jul 2007
What is the basis for excluding data that is beyond 2.5 Std Dev from the mean.

Why not use the built in quartiles worksheet function against the raw data
to get the quartiles from the raw data. If the data is incorrect, throwing
out the data you suggest doesn't seem to have much justification to me. You
need to get correct data. Nonetheless:

Yes
=Average($m$2:$M$121411)

would give you the average

You could use an adjacent empty column to do

Assume the data points are in M as you show, in R2 for example

=if(Abs(M2-Average($M$2:$M$12411))>2.5*Stdev($m$2:$M$121411),"Delete","Keep")

then do an autofilter on column R and select Delete in the dropdown. Select
all the data except the first row and do Edit=>Delete and select entirerow.
Note that once you delete the rows (and remove the filter), the formulas will
recalculate and new mean and standard deviation and more rows will show up as
Delete, but you can ignore these.

--
Regards,
Tom Ogilvy


then drag fill down to



"Zone" wrote:

> John, I understand. I'll plow away at the problem in here in hopes
> that you will stick with me for a while and not get irritated with my
> lack of mathematical expertise. I would like to use built-in Excel
> functions as much as possible to speed the calculation. Your first
> step, "Calculate the mean". Would I use Excel's AVERAGE function for
> this? Say column M contains 12,410 numbers, starting on row 2. Would
> I then use a formula such as =AVERAGE(m2:m12411) to get the mean? A
> friend has told me that what is really wanted is the "threshold to the
> second quartile", so I'm confused. James
>
> On Jul 9, 1:48?pm, johnhildr...@citynet.net wrote:
> > James,
> >
> > Sorry, that email doesn't exist anymore. The programming is fairly
> > straight forward. I'll do what I can to help you, but if it isn't
> > something you are going to do on a regular basis....manual is the way
> > to go.
> >
> > John

>
>
>

 
Reply With Quote
 
johnhildreth@citynet.net
Guest
Posts: n/a
 
      9th Jul 2007
James,

I'm thinking like Tom....what is it exactly you are trying to get to?

Are you trying to eliminate outliers from your dataset?....all those
beyond 2.5 std devs. Or are you trying to find the 25th and 75th
quartiles?

John

 
Reply With Quote
 
Zone
Guest
Posts: n/a
 
      9th Jul 2007
Tom and John,
I am paying very close attention to everything you say. The
problem is that for a particular ratio in a particular industry (such
as supply expense per client per day), it is known and published by
the experts that the figure should be in the range of a few hundred
dollars. Depending on other factors (which are captured), this could
vary between $200 or so and maybe $700. Problem is, the incoming data
includes figures such as 22 cents and $125,000, both of which are
obviously incorrect. Alas, we have no control over the incoming data,
so the best we can do is to try to grab the "reasonable" figures that
are the bulk of the data and ignore the "unreasonable" figures that
have resulted from data input errors, misunderstanding of what is
being asked, etc.
What is wanted is Top, Second and Third Quartiles for each
indicator. These cannot be figured unless the ridiculous data is
excluded. Tom, I presume that when you say the AVERAGE function will
get me the average, this is the same as the mean, since neither
StdDevs nor quartiles can be figured without the mean, to my limited
mathematical knowledge.
I acually wasn't aware that there were a spreadsheet function for
quartiles, so that is interesting. I take it that this reveals true
quartiles, not just counting down 3 in a set of 12 data, for instance.
I am grateful to both of you for your interest in my problem. I
readily admit to being mathematically obtuse. I have hundreds of
indicators to run, so I'm looking for a fast and efficient way to
handle this. James
On Jul 9, 2:56?pm, johnhildr...@citynet.net wrote:
> James,
>
> I'm thinking like Tom....what is it exactly you are trying to get to?
>
> Are you trying to eliminate outliers from your dataset?....all those
> beyond 2.5 std devs. Or are you trying to find the 25th and 75th
> quartiles?
>
> John



 
Reply With Quote
 
=?Utf-8?B?VG9tIE9naWx2eQ==?=
Guest
Posts: n/a
 
      9th Jul 2007
quartile returns the quartile value, not the position of the quartile. if
you had 12 random numbers between 100 and 1000, quartile(A1:A12,1) would
return the appropriate number between 100 and 1000, not 3.75 which would be
the position of the first quartile. See Excel help for details.

Average returns the arithmetic mean.

stdev returns the sample standard deviation. stdevp returns the population
standard deviation. Neither requires you to calculate the arithmetic mean
separately, but of course you would need that value to determine the distance
your datapoint is from the mean; but I already provided that information to
you.

--
Regards,
Tom Ogilvy




"Zone" wrote:

> Tom and John,
> I am paying very close attention to everything you say. The
> problem is that for a particular ratio in a particular industry (such
> as supply expense per client per day), it is known and published by
> the experts that the figure should be in the range of a few hundred
> dollars. Depending on other factors (which are captured), this could
> vary between $200 or so and maybe $700. Problem is, the incoming data
> includes figures such as 22 cents and $125,000, both of which are
> obviously incorrect. Alas, we have no control over the incoming data,
> so the best we can do is to try to grab the "reasonable" figures that
> are the bulk of the data and ignore the "unreasonable" figures that
> have resulted from data input errors, misunderstanding of what is
> being asked, etc.
> What is wanted is Top, Second and Third Quartiles for each
> indicator. These cannot be figured unless the ridiculous data is
> excluded. Tom, I presume that when you say the AVERAGE function will
> get me the average, this is the same as the mean, since neither
> StdDevs nor quartiles can be figured without the mean, to my limited
> mathematical knowledge.
> I acually wasn't aware that there were a spreadsheet function for
> quartiles, so that is interesting. I take it that this reveals true
> quartiles, not just counting down 3 in a set of 12 data, for instance.
> I am grateful to both of you for your interest in my problem. I
> readily admit to being mathematically obtuse. I have hundreds of
> indicators to run, so I'm looking for a fast and efficient way to
> handle this. James
> On Jul 9, 2:56?pm, johnhildr...@citynet.net wrote:
> > James,
> >
> > I'm thinking like Tom....what is it exactly you are trying to get to?
> >
> > Are you trying to eliminate outliers from your dataset?....all those
> > beyond 2.5 std devs. Or are you trying to find the 25th and 75th
> > quartiles?
> >
> > John

>
>
>

 
Reply With Quote
 
gimme_this_gimme_that@yahoo.com
Guest
Posts: n/a
 
      9th Jul 2007
Hi Zone,

What makes you certain that 2.5 standard deviations is the right tool
for the job? I fathom you've been given a "left handed monkey wrench".
Tip: there is no such thing as a "left handed monkey wrench".

To put things in perspective:

Without qualifying a particular probablity density function, a
standard deviation is associated a normal curve - something that is
equally skewed to the left and right and looks like a single hump with
the middle of the distribution correspoding to both the mean and
median.The height of the hump and thickness of the data from the mean
can vary.

So what does 2.5 standard deviations correspond to?

Assuming a normal curve of mean 1 and STD 1:

Data outside of 1 STD corresponds to (assuming you're looking at both
the left and right tail) 33% (about 16% on each side)

Data outside of 2 STDs corresponds to 2.5%

Data outside of 2.5 STDs so closely corresponds to zero that it's not
within the tolerance of error computation. You're talking 1 out of
100,000 (That's my guess). This probablity is so small that you won't
find the value a published table on the normal curve.

In your case it's unlikely the data is distributed normally.

Also, supposed that you get 100,000 observations, that means that on
average you'll only be excluding 1 outlier. Most of the time you won't
be excluding any values.

Also, suppose there are 1000 outliers in 100,000 observations, that
means that, if you use the 2.5 STD approach you're going to end up
keeping most of the hosed observations.

In other words 2.5 STDs isn't the tool for this job.

What you need to do is filter the data, just as you described. Set
limits on minimum and maximum values and exclude anything smaller or
larger.

IF you're working with statisticians and you want to talk the jargon,
you should compute the skew statistic, approximate the statistical
distribution (that might a a Pareto or Beta distribution). At the
least you should confirm your assumption that the data is normally
distributed. And you should analyze whether assuming a normal
distribution is going to do the job.

As to fast ways of dealing with the data ... What is it that is so
time consuming? Just filter the data. What's time consuming about
that?


 
Reply With Quote
 
 
 
Reply

Thread Tools
Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
More on 2.5 StdDevs Zone Microsoft Excel Programming 0 11th Jul 2007 09:23 PM


Features
 

Advertising
 

Newsgroups
 


All times are GMT +1. The time now is 12:53 PM.