Example of best subsets regression
main topic
     interpreting results     session command     see also 

Total heat flux is measured as part of a solar thermal energy test. You wish to see how total heat flux is predicted by other variables: insolation, the position of the focal points in the east, south, and north directions, and the time of day. Data are from Montgomery and Peck [31], page 486.

1    Open the worksheet EXH_REGR.MTW.

2    Choose Stat > Regression > Regression > Best Subsets.

3    In Response, enter Heatflux.

4    In Free Predictors, enter Insolation-Time. Click OK.

Session window output

Best Subsets Regression: HeatFlux versus Insolation, East, ...

 

 

Response is HeatFlux

 

                                            I

                                            n

                                            s

                                            o

                                            l

                                            a   S N

                                            t E o o T

                                            i a u r i

             R-Sq    R-Sq  Mallows          o s t t m

Vars  R-Sq  (adj)  (pred)       Cp       S  n t h h e

   1  72.1   71.0    66.9     38.5  12.328        X

   1  39.4   37.1    26.3    112.7  18.154  X

   2  85.9   84.8    81.4      9.1  8.9321      X X

   2  82.0   80.6    74.2     17.8  10.076        X X

   3  87.4   85.9    79.0      7.6  8.5978    X X X

   3  86.5   84.9    81.4      9.7  8.9110  X   X X

   4  89.1   87.3    80.6      5.8  8.1698  X X X X

   4  88.0   86.0    79.3      8.2  8.5550  X   X X X

   5  89.9   87.7    78.8      6.0  8.0390  X X X X X

 

Interpreting the results

Each line of the output represents a different model. Vars is the number of variables or predictors in the model. Rimage\SQUARED.gif and adjusted Rimage\SQUARED.gif are converted to percentages. Predictors that are present in the model are indicated by an X.

In this example, it isn't clear which model fits the data best. The model with all the variables has the highest adjusted Rimage\SQUARED.gif (87.7%),  a low Mallows' Cp value (6.0), and the lowest S value (8.0390). The four-predictor model with all variables except Time has a lower Cp value (5.8), although S is slightly higher (8.16) and adjusted Rimage\SQUARED.gif is slightly lower (87.3%). The best three-predictor model includes North, South, and East, with a slightly higher Cp value (7.6) and a lower adjusted Rimage\SQUARED.gif(85.9%).

The best two predictor model includes North and South and is tied for having the highest predicted R-squared (81.4%). This fact suggests that the models that include additional predictors may be overfitting the data. Overfit models appear to explain the relationship between the predictor and response variables for the data set used for model calculation but fail to provide valid predictions for new observations. If you are mainly interested in predictions for new observations, this two predictor model may be the best model and you will only need to measure data for two predictors. Further, the multiple regression example  indicates that adding the variable East does not improve the fit of the model.

Before choosing a model, you should always check to see if the models violate any regression assumptions using residual plots and other diagnostic tests. See Checking your model.