multivariate regression

J

John

I am trying to teach myself the multivariate regression function in
tools>data analysis>regression. The following dummy variables I made up are
'significant' by the F-test but...what is a layman's explanation for the
Coefficients and Residuals you get? Is there a good Excel page that explains
to 'us' dummies the Summary Output page?

AGE (X1) PACKS/MONTH (X2) LUNG CANCER RISK (Y)
30 25 0.4
40 30 1.2
50 25 1.7
60 37 3.5
70 42 6.7
80 39 9.9

Thanks!!!
 
J

Jerry W. Lewis

A quibble with the terminology in your subject line: "multivariate
regression" is usually taken to mean regression with multiple response (y)
variables per observation. Your example with a single response variable but
multiple predictors is more commonly called "multiple regression" or
"multivariable regression".

The coefficients section says that your model estimates that
Y = -6.92404 + 0.176712*X1 + 0.033481*X2

Note that this model predicts a negative risk of cancer for non-smoking
babies; you should always beware of extrapolating beyond your data!

Next, notice the p-values for these individual coefficients. While the
overall model is mildly significant, no single predictor is; although age
exhibits the most evidence for inclusion in the model. It is possible for
observational data to have correlated predictors that make it difficult to
separate which variables are responsible for the response; but your data set
is small and regression on age alone appears significant while regression on
smoking level alone does not.

In particular, your data gives no evidence of a dose response relationship
between the amount smoked and the risk of lung cancer. This could be due to
a number of reasons including
- There may be no dose response relationship (unlikely given external
evidence).
- Your data set is too small relative to the inherent variability to show
much or avoid artifacts.
- All of your data is from relatively heavy smokers (~ 1 pack per day or
more) where a little more or less may not make that much difference. Note
especially that prediction for light nor non-smokers of any age would be
extrapolation beyond the data.
- There may be other important predictor variables that are not constant
for these subjects but are not provided in the data set.
....

Residuals are the discrepancy between predicted values and actual values.
If you had more data, then you would want to plot them against each of your
predictors. If the model were adequate, you should not see any obvious
patterns in those plots.

This barely scratches the surface of the topic. I highly recommend taking a
class or at least reading a good text on regression, such as "Applied
Regression Analysis" by Draper and Smith.

Jerry
 
Top