Wednesday, March 30, 2011

Data Homogeneity - An excerpt from "Data Analysis with Minitab"

The following is an excerpt from my Data Analysis with Minitab course. I thought this was too important; it's ignored far too often. For more information, see Davis Balestracci's Data Sanity (the paper or the book), or Don Wheeler's The Six Sigma Practitioner's Guide to Data Analysis.

Shape, Center and Spread: Histograms

 

We have discussed simple graphical analysis in histograms. Remember, a histogram allows us visually to get a feel for shape, center and spread of a set of data. Adding the specification limits to a histogram allow us to see performance in relationship to specifications, and any outliers might show up on a histogram.

Important to note: A histogram is a snapshot in time. It shows how the data are “piled.” If the process is not stable, we can’t make any assumptions about the distribution. So, while a histogram is a very useful tool, it’s more useful when used in conjunction with some time-series plot. The following scenarios, adapted from Davis Balestracci’s Data Sanity, illustrate the importance of looking at process data over time.

These scenarios depict the percentage of calls answered within 2 minutes for three different clinics in a metropolitan area. All three sets of data were collected during the same 60-day time period.

What can you say about the performance of the clinics, based on the histograms and data summaries?





The summaries presented in the histograms all show unimodal, fairly symmetrical, bell-shaped piles of data. The p-values for the Anderson-Darling tests for normality are all high, indicating no significant departures from a normal distribution. There are no apparent outliers. The mean percentage for each clinic is a little over 85%, and the standard deviations are all around 2.5%.

The histogram, though, is a snapshot. It only reveals how the data piled up at a particular point in time. The graphic, and its associated summary statistics, can only represent what’s happening at the clinics if the data are homogeneous. These data were gathered over time: what would a picture of the data over time reveal?

The control chart for clinic A is below. Although the histogram showed the same bell-shaped pattern and high p-value for the normality test, you can easily see that the histogram can’t represent the data for clinic A; we caught it in an overall upward trend, and so a histogram of the next sixty days will no doubt look very different from the histogram of the first sixty days.


Likewise, the control chart for Clinic B…



This chart shows that what we are actually looking at is three different processes, the data for which just appear to stack up to a single, not-different-from-normal distribution. In fact, by slicing the chart at the shifts, we can see that there are three distinct time periods when the variation is in control:



The only one of the three clinics with a stable process is clinic C. Looking at Clinic C’s plot over time, we see the random pattern of variation within the control limits. We can now expect that the histogram will not change shape significantly over time, the parameters will all remain about the same, so our assumptions about distribution will be valid and useful.