Identifying Outliers
main topic
Outliers
are observations with larger than average response or predictor values.
Minitab provides several ways to identify outliers, including residual
plots and three stored statistics: leverages,
Cook's distance,
and DFITS,
which are described below. It
is important to identify outliers because they can significantly
influence
your model, providing potentially misleading or incorrect results. If
you identify an outlier in your data, you should examine the observation
to understand why it is unusual and identify an appropriate remedy.
· Leverage
values provide information about whether an observation has unusual predictor
values compared to the rest of the data. Leverages are a measure of the
distance between the x-values for an observation and the mean of x-values
for all observations. A large leverage value indicates that the x-values
of an observation are far from the center of x-values for all observations.
Observations with large leverage may exert considerable influence on the
fitted value, and thus the regression model.
Leverage values fall between 0 and 1. A leverage value
greater than 2p/n or 3p/n, where p is the number of predictors plus the
constant and n is the number of observations, is considered large and
should be examined. Minitab identifies observations with leverage over
3p/n or .99, whichever is smaller, with an X in the table of unusual observations.
· Cook's
distance or D is an overall measure of the combined impact of each observation
on the fitted values. Because D is calculated using leverage values and
standardized residuals, it considers whether an observation is unusual
with respect to both x- and y-values. Geometrically, Cook's distance is
a measure of the distance between the fitted values calculated with and
without the ith
observation. Large values, which signify unusual observations, can occur
because the observation has 1) a large residual and moderate leverage,
2) a large leverage and moderate residual, or 3) a large residual and
leverage. Some statisticians recommend comparing D to the F-distribution
(p, n-p). If D is greater than the F-value at the 50th percentile, then
D is considered extreme and should be examined. Other statisticians recommend
comparing the D statistics to one another, identifying values that are
extremely large relative to the others. An easy way to compare D values
is to graph them using Graph
>Time Series, where the x-axis represents the observations, not
an index or time period.
· DFITS
provides another measure to determine whether an observation is unusual.
It uses the leverage and deleted (Studentized) residual to calculate the
difference between the fitted value calculated with and without the ith observation.
DFITS represents roughly the number of estimated standard deviations that
the fitted value changes when the ith
observation is removed from the data. Some statisticians suggest that
an observation with a DFITS value greater than sqrt(2p/n) is influential.
Other statisticians recommend comparing DFITS values to one another, identifying
values that are extremely large relative to the others. An easy way to
compare DFITS is to graph the DFITS values using Graph
>Time Series, where the x-axis represents the observations, not
an index or time period.