soci208 - Module 3

SOCI208 Module 3 - Summarizing Distributions

1. Introduction

Much of data analysis consists in summarizing distributions of variables, and then comparing summary measures.
Summary measures are used to describe 3 main aspects of distributions:

position (aka location, central tendency)
variability (aka dispersion)
skewness (aka asymmetry)

Exhibit: Descriptive statistics for variables in GRAD data set [m3011.htm]

2. Measures of Location

1. The Mean X_.

Note: I use the notation X_. instead of "X bar" for the mean for typographical reasons, because the bar is not available in HTML character sets.
The formula for the mean of a set of observations X_i is

X_. = (1/n) S_{i=1
to n}X_i

**Useful Properties of the Mean**
1. The sum of the observations is equal to n times the mean so that S_{i=1 to n}X_i = nX_. 2. The sum of the deviations of the observations from the mean equals 0 so that S_{i=1 to n}(X_i - X_.) = 0 3. The expression S_{i=1 to n}(X_i - A)² (where A represents a fixed value) is a minimum when A = X_.. In other words X_. is the *ordinary least squares (OLS) estimate* of the central tendency of the observations. 4. When observations are subdivided into k subsets such that f_i is the number of oservations in subset i, X_i. is the mean of X in subset i, and n is the total number of observations (n = S_{i=1 to n}f_i) then the mean X. is the weighted sum of the means of the subsets, so that X_. = (1/n)(f₁X_1. + f₂X_2. + ... + f_kX_k.) = (1/n)S_{i=1 to n}f_iX_i. Q - Can you show properties 1, 2, and 4? How about 3?

2. The Trimmed Mean

The T% trimmed mean is the mean of the T% "central" observations, i.e. the mean calculated after removing the (1/2)(100-T)% smallest and (1/2)(100-T)% largest observations.
Example: The following data set contains the incomes of 12 households in a fictitious town in ascending order (from Koopmans 1987 <>)

Household	Annual Income (x$1000)	Remark
1	0	Unemployed
2	2
3	2
4	5
5	5
6	7
7	7
8	7
9	8
10	12
11	46
12	1110	Retired billionaire living on dividends

The (ordinary) mean annual income is $100,917.
The 67% trimmed mean (removing observations 1, 2, 11, and 12) is $6,625.

Q - Why would $6,625 be "better" than $100,917 as an estimate of the "center" of the distribution of annual income?

The trimmed mean is used to avoid the impact of extreme observations (akaoutliers). The trimmed mean is an example of a robust statistic, i.e. a statistic that is insensitive to the presence of outlying observations. The concern for robustness is the hallmark of the modern approach to data analysis, as exemplified by the Exploratory Data Analysis movement.

3. The Median M_d

To calculate the median M_d

rank the observations from smallest (rank 1) to largest (rank n)
then calculate the median location (n+1)/2
if n is odd then M_d as the value of the observation with rank equal to the median location
if n is even (so the median location is not an integer value) then calculate M_d as the average of the two observations with rank on either side of the median location

Exhibit: Median and percentiles for graduation rates data (GRAD data) [m3001.htm]

Exhibit: Median and quartiles read-off from stem & leaf display of graduation rates data (GRAD data) [m3012.htm]

Remark: M_d belongs to the category of order statistics because it is based on a ranking of the observations.

**Useful Properties of the Median**
1. In large data sets where observations are not repeated extensively about 50% of the observations are smaller, and about 50% larger, than the median Q - Why this careful phrasing? 2. The expression S_{i=1 to n}\|X_i - A\| (where A represents a fixed value) is a minimum when A = M_d. In other words the median is the estimate of central tendency that minimizes the sum of absolute deviations from the observations. 3. M_d is not affected by outlying observations. Q - Thus M_d is called a r_____ s_______. (Fill in the blanks.) 4. M_d is a special case of the trimmed mean. Q - How so?

4. The Mode

The concept of mode is only meaningful in the context of either

observations classified in a frequency distribution with equal class intervals (as represented e.g., in a histogram or frequency polygon), or
a graph of the density of the distribution estimated e.g. with a kernel estimator

In case (1) the modal class is the class with the largest frequency; the mode is the middle value of the class interval.
In case (2) the mode is the value of the variable corresponding to the highest estimated density.

If the distribution has a single peak the distribution is called unimodal.
If the distribution has two peaks the distribution is called bimodal.

Exhibit: Distribution of crude birth rate (V207) - Histogram and kernel density estimator (WORLD209 data)
Exhibit: Joint distribution of crude birth rate (V207) and crude death rate (V213) - Kernel density estimator in 3-D display (WORLD209 data)
Exhibit: Joint distribution of crude birth rate (V207) and crude death rate (V213) - Kernel density estimator in contour (geodesic) display (WORLD209 data)

A bimodal distribution may indicate a mixture of two populations.

Exhibit: Bimodal distribution resulting from a mixture of populations (NWW Figure 3.3 p. 77)

5. Percentiles

The P-percentile is the value of a variable such that P % of the observations are at or below this value.
One may want to find the value of the variable corresponding to an a-priori percentile; then the percentages used most often are

quartiles (25th, 50th, 75th percentile); M_d is the 2d quartile or 50th percentile
quintiles (20th, 40th, 60th, 80th percentile)
deciles (10th, 20th, ..., 90th percentile)

One may also want to find the percentile corresponding to a given observation.

To calculate the P-percentile

rank the observations from smallest (rank 1) to largest (rank n) -- Q - Thus percentiles are o____ s________ . (Fill in the blanks.)
express the rank in percentile form by calculating 100*(rank/n)
identify the observation associated with the percent interval in which the percentile falls

Exhibit: Finding percentiles in small data set (NWW Figure 3.4 p. 78) [m3003.gif]
Exhibit (repeat): Median and percentiles in graduation rates data (GRAD data) [m3001.htm]

3. Measures of Variability

1. Range

The range is the difference between the largest (maximum) and smallest (minimum) observations in a data set.
The range is conceptually straightforward and intuitively appealing as a measure of variability, but it has drawbacks:

the range depends only on the largest and smallest observations, ignoring the rest of the observations
the range is extremely sensitive to outliers
the range is affected by the number of observations, since as the sample size increases, the range increases

2. Interquartile Range IQR

The interquartile range (IQR) is the difference between the third and the first quartiles (75th and 25th percentiles) of the data set.
The IQR is a robust estimate of variability that is very useful when a data set contains extreme obervations (outliers).

3. Variance s²

The variance s² and its square root s (the standard deviaiton, see next) are the most commonly used measures of variability in statistical analysis.
The variance s² of a set of observations X₁, X₂, ..., X_n is defined as

s² = (1/(n-1))S_{i=1
to n}(X_i-X_.)²

or, alternatively, as

s² = (1/(n-1))(S_{i=1
to n}X_i² - (S_{i=1
to n}X_i)²/n)

(The second formula may be computationally more efficient as it requires only one pass through the data, while the first formula requires two passes: one pass to calculate the mean, and then a second pass to calculate the sum of squared deviations from the mean. To see this think how you, or a computer, would go about calculating the variance using either formula.)
The formula for the variance may be viewed as an "average" of the squared deviations of the observations from the mean X., except that the sum is divided by n-1 instead of n. Thus, looking at the following exhibit, one sees that the positive and negative deviations for the April shipment tend to be smaller (in absolute value) than for the May shipment; this difference in the deviations will be reflected in the smaller variance for the April shipment (487.5) than for the May shipment (1161.1).

Exhibit: Deviations from the mean in the April and May shipments (NWW Figure 3.5 p. 81) [m3004.gif]

**The Mystery of n-1**
Why divide by n-1 instead of n to estimate the variance? One approach emphasizes that s² is an estimate of the population variance on the basis of a sample (the data set). If we knew the population mean of X, we would estimate s² by dividing the sum of squared deviations from the mean by n. But we don't know the population mean so we extimate it as the sample mean X_.. Since X_. is estimated from the observations in the data set, each X_i is a little bit closer to X_. (since it contributed to estimate it) than it would be from a fixed number, such as the population mean. Thus the sum of squared deviations from X_. is a little bit smaller than the sum of squared deviations from the population mean would be, so dividing the sum by n would yield a variance estimate that is slightly biased downward. It is to correct this small downward bias that one divides by n-1 rather than n.

Note that the variance s² is expressed in units that are the square of the units used to measure the variable under study. Thus the variance does not relate directly to the scale representing the values of the observations. For example, one cannot represent the variance on the graph of the last exhibit. The standard deviaiton, however, does relate directly to the scale of the observations.

4. Standard Deviation s

The standard deviation s (aka SD) is the positive sqyare root of the variance s² so that

s = (s²)^1/2

The standard deviation, unlike the variance, is expressed in the original units of the variable. For example, the standard deviation of filament melting point for the April shipment (square root of 487.5 = 22.1) and the May shipment (square root of 1161.1 = 34.1) can be plotted on the same scale as the deviations form the mean.

Exhibit (repeat): Deviations from the mean in the April and May shipments (NWW Figure 3.5 p. 81) [m3004.gif]

5. Coefficient of Variation c

The coefficient of variation c is the ratio of the standard deviation to the mean, expressed as a percentage, so that

c = 100(s/X_.)

Since both s and X_. are in the same units, c is dimensionless. Thus c can be used to compare the variation of variables expressed in different units.
<Example - comparison of 2 variables in different units>

4. Skewness

1. Visual Assessment of Skeweness

A data set is skewed when observations are not symmetrically distributed.
Skewness (i.e., lack of symmetry) can be assessed visually from a depiction of the distribution such as a histogram, frequency polygon, stem-and-leaf display, or box plot (see later).
Terms to describe skewness refer to the direction where the tail of the distribution points on the real line with negative numbers to the left, zero in the middle, and positive numbers to the right, so that

skewed to the left or skewed negatively means the tail points to the left (toward negative numbers)
skewed to the right or skewed positively means the tail points to the right (toward positive numbers)

2. Relative Positions of Mean, Median, & Mode

Skewness determines the relative positions of mean, median, and mode as shown in the next exhibit.

Exibit: Positions of mean, median, and mode in skewd distributions (NWW Figure 3.6 p. 85) [m3005.gif]

The mechanism involved are

the mean is "attracted" toward extreme values in the long tail more than the median
the median falls between the mean and the mode
a large difference between mean and median suggests the presence of skewness

3. Standardized Skewness Measure

The standardized skewness measure is based on the third moment about the mean of the data, denoted m₃, and calculated as the sum of 3d powers of deviations of the observations from the mean

m₃ = (1/(n-1))S_{i=1
to n}(X_i-X_.)³

Since cubing preserves the sign of the deviation of an observation from the mean, and large deviations dominate the sum (because cubing "amplifies" them), it follows that

if the largest deviations from the mean are positive (i.e., the distribution is skewed right) m₃ will be positive and, conversely
if the largest deviations from the mean are negative (i.e., the distribution is skewed left) m₃ will be negative

The standardized skewness measure, denoted m₃', is m₃ divided by the cube of the standard deviation s as

m₃' = m₃/s³

Notes:

The term moment is a general term meaning the sum of deviations (e.g., from the mean, or from the origin) raised to a given power. Thus m₂ is the 2d moment, the sum of the deviations about the mean squared; m₄ is the 4th moment, the sum of deviations about the mean raised to the 4th power, etc. Moments are the basis of important statistics, so that

m₂ is the basis of the variance
m₃ is the basis of the standardized skewness measure m₃'
m₄ is the basis of a measure of kurtosis or "peakedness" of a distribution that is rarely used (see NWW pp. 87-88)

Some computer programs use a measure of skewness called G₁ that has a slightly different value than m₃'. G₁ uses divisions by n instead of n-1.

5. Standardized Observations

Standardized observations Z₁, ..., Z_n corresponding to observations X₁, ..., X_n are defined as

Z_i = (X_i-X_.)/s for i=1, ..., n

Z_i is also called a "Z-score". Z_i measures the distance of observation X_i from the mean X_. in units of the standard deviation s.
It follows that

the mean of Z is 0
the standard deviation of Z is 1

X_i can be recoverd from Z_i with the formula

X_i = X_.+sZ_i for i=1, ..., n

Two common misconceptions about standardized observations are that

Z-scores are limited to normally distributed data. No. Any set of observations can be standardized!
Standardizing the observations makes the distribution of the observations normal. No. Standardizing does not change the shape of the distribution of a variable at all!

Example: calculating the Z-score of the high school graduation rate for NC. Information needed is

value of GRAD for NC = 69.3
mean GRAD (X.) = 73.5
standard deviation of GRAD (s) = 8.0

Thus, z-score of GRAD for NC is

(69.3-73.5)/8.0 = -0.525

Exhibit: distributions of high school graduation rate, raw variable and Z-scores [m3013.htm]

6. The Box Plot

The box plot is also called the box-and-whiskers plot.
The box plot is a graphical summary of the distribution of a variable originally developed by John Tukey (Tukey 1977; see also the Sygraph manual, Wilkinson 1990:164-171). The construction of the box plot is shown in the next exhibit.

Exhibit: Contruction of the box plot [m3006.gif]

The basic elements of the box plot are as follows

the vertical line near the center of the box corresponds to the median of the distribution
the left and right edges of the box correspond to the 25th percentile (first quartile) and 75th percentile (third quartile), respectively; the 25th and 75th percentiles are also termed lower hinge and upper hinge, respectively
the length of the box therefore corresponds to the interquartile range (IQR), a measure of dispersion computed as the third quartile minus the first quartile (see above)
the horizontal lines drawn from the sides of the box, called whiskers, extend to the most outlying value within 1.5 IQR from the sides
observations that lie beyond 1.5 IQR from either side of the box are represented individually; observations lying within 1.5 IQR and 3 IQR are marked with stars and termed minor outliers
observations that have values beyond 3 IQR from either side of the box are marked with circles and termed major outliers

The box plots in the next two exhibits, with the corresponding stem and leaf plots, illustrate two situations. Female life expectancy, on the one hand, has a more or less compact and symmetric distribution, unlikely to cause problems in statistical analysis; energy consumption per capita, on the other, is characterized by severe skew to the right and the presence of major outliers.

Exhibit: Box plot and stem & leaf display - Female life expectancy, 1975 (V195, World Handbook data) [m3017.htm]
Exhibit: Box plot and stem & leaf display - Energy consumption per capita, 1975 (V120, World Handbook data) [m3016.htm]

Q - Doe the box plot pick up bimodality? (Hint: Look at V195.)

Indentations, or notches, are an optional feature of the box plot. The notches mark the confidence intervals for the median developed by McGill, Tukey, and Larsen (1978). In comparing the boxplots for two populations along the same scale the two population medians can be considered different with about 95 percent confidence if the intervals around the two medians do not overlap. A comparison of box plots is called a schematic diagram.

Exhibit: Box plots of income for males and females (SURVEY2 data) [m3018.jpg]

Last modified 27 Aug 2002