SOCI208 Module 3 - Summarizing Distributions

1.  Introduction

Much of data analysis consists in summarizing distributions of variables, and then comparing summary measures.
Summary measures are used to describe 3 main aspects of distributions:
Exhibit: Descriptive statistics for variables in GRAD data set [m3011.htm]

2.  Measures of Location

1.  The Mean X.

Note: I use the notation X. instead of "X bar" for the mean for typographical reasons, because the bar is not available in HTML character sets.
The formula for the mean of a set of observations Xi is
X. = (1/n) Si=1 to nXi
Useful Properties of the Mean
1.  The sum of the observations is equal to n times the mean so that
Si=1 to nXi = nX.
2.  The sum of the deviations of the observations from the mean equals 0 so that
Si=1 to n(Xi  - X.) = 0 
3.  The expression
Si=1 to n(Xi - A)2
(where A represents a fixed value) is a minimum when A = X..  In other words X. is the ordinary least squares (OLS) estimate of the central tendency of the observations. 

4.  When observations are subdivided into k subsets such that fi is the number of oservations in subset i, Xi. is the mean of X in subset i, and n is the total number of observations (n = Si=1 to nfi) then the mean X. is the weighted sum of the means of the subsets, so that

X. = (1/n)(f1X1. + f2X2. + ... + fkXk.) = (1/n)Si=1 to nfiXi.
Q - Can you show properties 1, 2, and 4?  How about 3?

2.  The Trimmed Mean

The T% trimmed mean is the mean of the T% "central" observations, i.e. the mean calculated after removing the (1/2)(100-T)% smallest and (1/2)(100-T)% largest observations.
Example: The following data set contains the incomes of 12 households in a fictitious town in ascending order (from Koopmans 1987 <>)
 
Household Annual Income (x$1000) Remark
1 0 Unemployed
2 2
3 2
4 5
5 5
6 7
7 7
8 7
9 8
10 12
11 46
12 1110 Retired billionaire living on dividends

The (ordinary) mean annual income is $100,917.
The 67% trimmed mean (removing observations 1, 2, 11, and 12) is $6,625.

Q - Why would $6,625 be "better" than $100,917 as an estimate of the "center" of the distribution of annual income?
The trimmed mean is used to avoid the impact of extreme observations (akaoutliers).  The trimmed mean is an example of a robust statistic, i.e. a statistic that is insensitive to the presence of outlying observations.  The concern for robustness is the hallmark of the modern approach to data analysis, as exemplified by the Exploratory Data Analysis movement.

3.  The Median Md

To calculate the median Md
  1. rank the observations from smallest (rank 1) to largest (rank n)
  2. then calculate the median location (n+1)/2
  3. if n is odd then Md as the value of the observation with rank equal to the median location
  4. if n is even (so the median location is not an integer value) then calculate Md as the average of the two observations with rank on either side of the median location
    Exhibit: Median and percentiles for graduation rates data (GRAD data) [m3001.htm]
    Exhibit: Median and quartiles read-off from stem & leaf display of graduation rates data (GRAD data) [m3012.htm]
Remark: Md belongs to the category of order statistics because it is based on a ranking of the observations.
 
Useful Properties of the Median
1.  In large data sets where observations are not repeated extensively about 50% of the observations are smaller, and about 50% larger, than the median
Q - Why this careful phrasing?
2.  The expression
Si=1 to n|Xi - A|
(where A represents a fixed value) is a minimum when A = Md.  In other words the median is the estimate of central tendency that minimizes the sum of absolute deviations from the observations.

3.  Md is not affected by outlying observations.

Q - Thus Md is called a r_____ s_______.  (Fill in the blanks.)
4.  Md is a special case of the trimmed mean.
Q - How so?

4.  The Mode

The concept of mode is only meaningful in the context of either
  1. observations classified in a frequency distribution with equal class intervals (as represented e.g., in a histogram or frequency polygon), or
  2. a graph of the density of the distribution estimated e.g. with a kernel estimator
In case (1) the modal class is the class with the largest frequency; the mode is the middle value of the class interval.
In case (2) the mode is the value of the variable corresponding to the highest estimated density.
If the distribution has a single peak the distribution is called unimodal.
If the distribution has two peaks the distribution is called bimodal.
Exhibit: Distribution of crude birth rate (V207) - Histogram and kernel density estimator (WORLD209 data)
Exhibit: Joint distribution of crude birth rate (V207) and crude death rate (V213) - Kernel density estimator in 3-D display (WORLD209 data)
Exhibit: Joint distribution of crude birth rate (V207) and crude death rate (V213) - Kernel density estimator in contour (geodesic) display (WORLD209 data)
A bimodal distribution may indicate a mixture of two populations.
Exhibit: Bimodal distribution resulting from a mixture of populations (NWW Figure 3.3 p. 77)

5.  Percentiles

The P-percentile is the value of a variable such that P % of the observations are at or below this value.
One may want to find the value of the variable corresponding to an a-priori percentile; then the percentages used most often are One may also want to find the percentile corresponding to a given observation.

To calculate the P-percentile

  1. rank the observations from smallest (rank 1) to largest (rank n) -- Q - Thus percentiles are o____ s________ .  (Fill in the blanks.)
  2. express the rank in percentile form by calculating 100*(rank/n)
  3. identify the observation associated with the percent interval in which the percentile falls
Exhibit: Finding percentiles in small data set (NWW Figure 3.4 p. 78) [m3003.gif]
Exhibit (repeat): Median and percentiles in graduation rates data (GRAD data) [m3001.htm]

3.  Measures of Variability

1.  Range

The range is the difference between the largest (maximum) and smallest (minimum) observations in a data set.
The range is conceptually straightforward and intuitively appealing as a measure of variability, but it has drawbacks:

2.  Interquartile Range IQR

The interquartile range (IQR) is the difference between the third and the first quartiles (75th and 25th percentiles) of the data set.
The IQR is a robust estimate of variability that is very useful when a data set contains extreme obervations (outliers).

3.  Variance s2

The variance s2 and its square root s (the standard deviaiton, see next) are the most commonly used measures of variability in statistical analysis.
The variance s2 of a set of observations X1, X2, ..., Xn is defined as
s2 = (1/(n-1))Si=1 to n(Xi-X.)2
or, alternatively, as
s2 = (1/(n-1))(Si=1 to nXi2 - (Si=1 to nXi)2/n)
(The second formula may be computationally more efficient as it requires only one pass through the data, while the first formula requires two passes: one pass to calculate the mean, and then a second pass to calculate the sum of squared deviations from the mean.  To see this think how you, or a computer, would go about calculating the variance using either formula.)
The formula for the variance may be viewed as an "average" of the squared deviations of the observations from the mean X., except that the sum is divided by n-1 instead of n.  Thus, looking at the following exhibit, one sees that the positive and negative deviations for the April shipment tend to be smaller (in absolute value) than for the May shipment; this difference in the deviations will be reflected in the smaller variance for the April shipment (487.5) than for the May shipment (1161.1).
Exhibit: Deviations from the mean in the April and May shipments (NWW Figure 3.5 p. 81) [m3004.gif]
The Mystery of n-1
Why divide by n-1 instead of n to estimate the variance?
One approach emphasizes that s2 is an estimate of the population variance on the basis of a sample (the data set).  If we knew the population mean of X, we would estimate s2 by dividing the sum of squared deviations from the mean by n.  But we don't know the population mean so we extimate it as the sample mean X..  Since X. is estimated from the observations in the data set, each Xi is a little bit closer to X. (since it contributed to estimate it) than it would be from a fixed number, such as the population mean.  Thus the sum of squared deviations from X. is a little bit smaller than the sum of squared deviations from the population mean would be, so dividing the sum by n would yield a variance estimate that is slightly biased downward.  It is to correct this small downward bias that one divides by n-1 rather than n.

Note that the variance s2 is expressed in units that are the square of the units used to measure the variable under study.  Thus the variance does not relate directly to the scale representing the values of the observations.  For example, one cannot represent the variance on the graph of the last exhibit.  The standard deviaiton, however, does relate directly to the scale of the observations.

4.  Standard Deviation s

The standard deviation s (aka SD) is the positive sqyare root of the variance s2 so that
s = (s2)1/2
The standard deviation, unlike the variance, is expressed in the original units of the variable.  For example, the standard deviation of filament melting point for the April shipment (square root of 487.5 = 22.1) and the May shipment (square root of 1161.1 = 34.1) can be plotted on the same scale as the deviations form the mean.
Exhibit (repeat): Deviations from the mean in the April and May shipments (NWW Figure 3.5 p. 81) [m3004.gif]

5.  Coefficient of Variation c

The coefficient of variation c is the ratio of the standard deviation to the mean, expressed as a percentage, so that
c = 100(s/X.)
Since both s and X. are in the same units, c is dimensionless.  Thus c can be used to compare the variation of variables expressed in different units.
<Example - comparison of 2 variables in different units>

4.  Skewness

1.  Visual Assessment of Skeweness

A data set is skewed when observations are not symmetrically distributed.
Skewness (i.e., lack of symmetry) can be assessed visually from a depiction of the distribution such as a histogram, frequency polygon, stem-and-leaf display, or box plot (see later).
Terms to describe skewness refer to the direction where the tail of the distribution points on the real line with negative numbers to the left, zero in the middle, and positive numbers to the right, so that

2.  Relative Positions of Mean, Median, & Mode

Skewness determines the relative positions of mean, median, and mode as shown in the next exhibit.
Exibit: Positions of mean, median, and mode in skewd distributions (NWW Figure 3.6 p. 85) [m3005.gif]
The mechanism involved are

3.  Standardized Skewness Measure

The standardized skewness measure is based on the third moment about the mean of the data, denoted m3, and calculated as the sum of 3d powers of deviations of the observations from the mean
m3 = (1/(n-1))Si=1 to n(Xi-X.)3
Since cubing preserves the sign of the deviation of an observation from the mean, and large deviations dominate the sum (because cubing "amplifies" them), it follows that The standardized skewness measure, denoted m3', is m3 divided by the cube of the standard deviation s as
m3' = m3/s3
Notes:
  • The term moment is a general term meaning the sum of deviations (e.g., from the mean, or from the origin) raised to a given power.  Thus m2 is the 2d moment, the sum of the deviations about the mean squared; m4 is the 4th moment, the sum of deviations about the mean raised to the 4th power, etc.  Moments are the basis of important statistics, so that
  • 5.  Standardized Observations

    Standardized observations Z1, ..., Zn corresponding to observations X1, ..., Xn are defined as
    Zi = (Xi-X.)/s for i=1, ..., n
    Zi is also called a "Z-score".  Zi measures the distance of observation Xi from the mean X. in units of the standard deviation s.
    It follows that Xi can be recoverd from Zi with the formula
    Xi = X.+sZi  for i=1, ..., n
    Two common misconceptions about standardized observations are that Example: calculating the Z-score of the high school graduation rate for NC.  Information needed is Thus, z-score of GRAD for NC is
    Exhibit: distributions of high school graduation rate, raw variable and Z-scores [m3013.htm]

    6.  The Box Plot

    The box plot is also called the box-and-whiskers plot.
    The box plot is a graphical summary of the distribution of a variable originally developed by John Tukey (Tukey 1977; see also the Sygraph manual, Wilkinson 1990:164-171).  The construction of the box plot is shown in the next exhibit.
    Exhibit: Contruction of the box plot [m3006.gif]
    The basic elements of the box plot are as follows


    The box plots in the next two exhibits, with the corresponding stem and leaf plots, illustrate two situations.  Female life expectancy, on the one hand, has a more or less compact and symmetric distribution, unlikely to cause problems in statistical analysis; energy consumption per capita, on the other, is characterized by severe skew to the right and the presence of major outliers.

    Exhibit: Box plot and stem & leaf display - Female life expectancy, 1975 (V195, World Handbook data) [m3017.htm]
    Exhibit: Box plot and stem & leaf display - Energy consumption per capita, 1975 (V120, World Handbook data) [m3016.htm]
    Q - Doe the box plot pick up bimodality?  (Hint: Look at V195.)
    Indentations, or notches, are an optional feature of the box plot.  The notches mark the confidence intervals for the median developed by McGill, Tukey, and Larsen (1978).  In comparing the boxplots for two populations along the same scale the two population medians can be considered different with about 95 percent confidence if the intervals around the two medians do not overlap.  A comparison of box plots is called a schematic diagram.
    Exhibit: Box plots of income for males and females (SURVEY2 data) [m3018.jpg]




    Last modified 27 Aug 2002