Identifying Outliers
main topic
 

Outliers are observations with larger than average response or predictor values. Minitab provides several ways to identify outliers, including residual plots and three stored statistics: leverages, Cook's distance, and DFITS, which are described below.  It is important to identify outliers because they can significantly influence your model, providing potentially misleading or incorrect results. If you identify an outlier in your data, you should examine the observation to understand why it is unusual and identify an appropriate remedy.

·    Leverage values provide information about whether an observation has unusual predictor values compared to the rest of the data. Leverages are a measure of the distance between the x-values for an observation and the mean of x-values for all observations. A large leverage value indicates that the x-values of an observation are far from the center of x-values for all observations. Observations with large leverage may exert considerable influence on the fitted value, and thus the regression model.

Leverage values fall between 0 and 1. A leverage value greater than 2p/n or 3p/n, where p is the number of predictors plus the constant and n is the number of observations, is considered large and should be examined. Minitab identifies observations with leverage over 3p/n or .99, whichever is smaller, with an X in the table of unusual observations.

·    Cook's distance or D is an overall measure of the combined impact of each observation on the fitted values. Because D is calculated using leverage values and standardized residuals, it considers whether an observation is unusual with respect to both x- and y-values. Geometrically, Cook's distance is a measure of the distance between the fitted values calculated with and without the ith observation. Large values, which signify unusual observations, can occur because the observation has 1) a large residual and moderate leverage, 2) a large leverage and moderate residual, or 3) a large residual and leverage. Some statisticians recommend comparing D to the F-distribution (p, n-p). If D is greater than the F-value at the 50th percentile, then D is considered extreme and should be examined. Other statisticians recommend comparing the D statistics to one another, identifying values that are extremely large relative to the others. An easy way to compare D values is to graph them using Graph >Time Series, where the x-axis represents the observations, not an index or time period.

·    DFITS provides another measure to determine whether an observation is unusual. It uses the leverage and deleted (Studentized) residual to calculate the difference between the fitted value calculated with and without the ith observation. DFITS represents roughly the number of estimated standard deviations that the fitted value changes when the ith observation is removed from the data. Some statisticians suggest that an observation with a DFITS value greater than sqrt(2p/n) is influential. Other statisticians recommend comparing DFITS values to one another, identifying values that are extremely large relative to the others. An easy way to compare DFITS is to graph the DFITS values using Graph >Time Series, where the x-axis represents the observations, not an index or time period.