SOCI208 Module 3 - Summarizing Distributions
1. Introduction
Much of data analysis consists in summarizing distributions of variables,
and then comparing summary measures.
Summary measures are used to describe 3 main aspects of distributions:
-
position (aka location, central tendency)
-
variability (aka dispersion)
-
skewness (aka asymmetry)
Exhibit: Descriptive statistics for variables
in GRAD data set [m3011.htm]
2. Measures of Location
1. The Mean X.
Note: I use the notation X. instead of "X bar" for the mean
for typographical reasons, because the bar is not available in HTML character
sets.
The formula for the mean of a set of observations Xi is
X. = (1/n) Si=1
to nXi
Useful Properties of the Mean
1. The sum of the observations is equal to n times the mean so
that
Si=1 to nXi
= nX.
2. The sum of the deviations of the observations from the mean equals
0 so that
Si=1 to n(Xi
- X.) = 0
3. The expression
Si=1 to n(Xi
- A)2
(where A represents a fixed value) is a minimum when A = X..
In other words X. is the ordinary least squares (OLS)
estimate of the central tendency of the observations.
4. When observations are subdivided into k subsets such that fi
is the number of oservations in subset i, Xi. is the mean of
X in subset i, and n is the total number of observations (n = Si=1
to nfi) then the mean X. is the weighted sum of the means
of the subsets, so that
X. = (1/n)(f1X1. + f2X2.
+ ... + fkXk.) = (1/n)Si=1
to nfiXi.
Q - Can you show properties 1, 2, and 4? How about 3? |
2. The Trimmed Mean
The T% trimmed mean is the mean of the T% "central" observations,
i.e. the mean calculated after removing the (1/2)(100-T)% smallest and
(1/2)(100-T)% largest observations.
Example: The following data set contains the incomes of 12 households
in a fictitious town in ascending order (from Koopmans 1987 <>)
Household |
Annual Income (x$1000) |
Remark |
1 |
0 |
Unemployed |
2 |
2 |
|
3 |
2 |
|
4 |
5 |
|
5 |
5 |
|
6 |
7 |
|
7 |
7 |
|
8 |
7 |
|
9 |
8 |
|
10 |
12 |
|
11 |
46 |
|
12 |
1110 |
Retired billionaire living on dividends |
The (ordinary) mean annual income is $100,917.
The 67% trimmed mean (removing observations 1, 2, 11, and 12) is $6,625.
Q - Why would $6,625 be "better" than $100,917 as an estimate
of the "center" of the distribution of annual income?
The trimmed mean is used to avoid the impact of extreme observations (akaoutliers).
The trimmed mean is an example of a robust statistic, i.e.
a statistic that is insensitive to the presence of outlying observations.
The concern for robustness is the hallmark of the modern approach to data
analysis, as exemplified by the Exploratory Data Analysis movement.
3. The Median Md
To calculate the median Md
-
rank the observations from smallest (rank 1) to largest (rank n)
-
then calculate the median location (n+1)/2
-
if n is odd then Md as the value of the observation with rank
equal to the median location
-
if n is even (so the median location is not an integer value) then calculate
Md as the average of the two observations with rank on either
side of the median location
Exhibit: Median and percentiles for graduation
rates data (GRAD data) [m3001.htm]
Exhibit: Median and quartiles read-off from stem
& leaf display of graduation rates data (GRAD data) [m3012.htm]
Remark: Md belongs to the category of order statistics
because it is based on a ranking of the observations.
Useful Properties of the Median
1. In large data sets where observations are not repeated extensively
about 50% of the observations are smaller, and about 50% larger, than the
median
Q - Why this careful phrasing?
2. The expression
Si=1 to n|Xi
- A|
(where A represents a fixed value) is a minimum when A = Md.
In other words the median is the estimate of central tendency that minimizes
the sum of absolute deviations from the observations.
3. Md is not affected by outlying observations.
Q - Thus Md is called a r_____ s_______. (Fill
in the blanks.)
4. Md is a special case of the trimmed mean.
Q - How so?
|
4. The Mode
The concept of mode is only meaningful in the context of either
-
observations classified in a frequency distribution with equal class intervals
(as represented e.g., in a histogram or frequency polygon), or
-
a graph of the density of the distribution estimated e.g. with a kernel
estimator
In case (1) the modal class is the class with the largest frequency; the
mode is the middle value of the class interval.
In case (2) the mode is the value of the variable corresponding to
the highest estimated density.
If the distribution has a single peak the distribution is called
unimodal.
If the distribution has two peaks the distribution is called bimodal.
Exhibit: Distribution of crude birth rate
(V207) - Histogram and kernel density estimator (WORLD209 data)
Exhibit: Joint distribution of crude birth rate
(V207) and crude death rate (V213) - Kernel density estimator in 3-D display
(WORLD209 data)
Exhibit: Joint distribution of crude birth rate
(V207) and crude death rate (V213) - Kernel density estimator in contour
(geodesic) display (WORLD209 data)
A bimodal distribution may indicate a mixture of two populations.
Exhibit: Bimodal distribution resulting
from a mixture of populations (NWW Figure 3.3 p. 77)
5. Percentiles
The P-percentile is the value of a variable such that P % of the observations
are at or below this value.
One may want to find the value of the variable corresponding to an
a-priori percentile; then the percentages used most often are
-
quartiles (25th, 50th, 75th percentile); Md is the 2d quartile
or 50th percentile
-
quintiles (20th, 40th, 60th, 80th percentile)
-
deciles (10th, 20th, ..., 90th percentile)
One may also want to find the percentile corresponding to a given observation.
To calculate the P-percentile
-
rank the observations from smallest (rank 1) to largest (rank n) -- Q -
Thus percentiles are o____ s________ . (Fill in the blanks.)
-
express the rank in percentile form by calculating 100*(rank/n)
-
identify the observation associated with the percent interval in which
the percentile falls
Exhibit: Finding percentiles in small data
set (NWW Figure 3.4 p. 78) [m3003.gif]
Exhibit (repeat): Median and percentiles in graduation
rates data (GRAD data) [m3001.htm]
3. Measures of Variability
1. Range
The range is the difference between the largest (maximum)
and smallest (minimum) observations in a data set.
The range is conceptually straightforward and intuitively appealing
as a measure of variability, but it has drawbacks:
-
the range depends only on the largest and smallest observations, ignoring
the rest of the observations
-
the range is extremely sensitive to outliers
-
the range is affected by the number of observations, since as the sample
size increases, the range increases
2. Interquartile Range IQR
The interquartile range (IQR) is the difference between the third and the
first quartiles (75th and 25th percentiles) of the data set.
The IQR is a robust estimate of variability that is very useful when
a data set contains extreme obervations (outliers).
3. Variance s2
The variance s2 and its square root s (the standard deviaiton,
see next) are the most commonly used measures of variability in statistical
analysis.
The variance s2 of a set of observations X1,
X2, ..., Xn is defined as
s2 = (1/(n-1))Si=1
to n(Xi-X.)2
or, alternatively, as
s2 = (1/(n-1))(Si=1
to nXi2 - (Si=1
to nXi)2/n)
(The second formula may be computationally more efficient as it requires
only one pass through the data, while the first formula requires two passes:
one pass to calculate the mean, and then a second pass to calculate the
sum of squared deviations from the mean. To see this think how you,
or a computer, would go about calculating the variance using either formula.)
The formula for the variance may be viewed as an "average" of the squared
deviations of the observations from the mean X., except that the sum is
divided by n-1 instead of n. Thus, looking at the following exhibit,
one sees that the positive and negative deviations for the April shipment
tend to be smaller (in absolute value) than for the May shipment; this
difference in the deviations will be reflected in the smaller variance
for the April shipment (487.5) than for the May shipment (1161.1).
Exhibit: Deviations from the mean in the
April and May shipments (NWW Figure 3.5 p. 81) [m3004.gif]
The Mystery of n-1
Why divide by n-1 instead of n to estimate the variance?
One approach emphasizes that s2 is an estimate of the population
variance on the basis of a sample (the data set). If we knew the
population mean of X, we would estimate s2 by dividing the sum
of squared deviations from the mean by n. But we don't know the population
mean so we extimate it as the sample mean X.. Since X.
is estimated from the observations in the data set, each Xi
is a little bit closer to X. (since it contributed to estimate
it) than it would be from a fixed number, such as the population mean.
Thus the sum of squared deviations from X. is a little bit smaller
than the sum of squared deviations from the population mean would be, so
dividing the sum by n would yield a variance estimate that is slightly
biased downward. It is to correct this small downward bias that one
divides by n-1 rather than n. |
Note that the variance s2 is expressed in units that are
the square of the units used to measure the variable under study.
Thus the variance does not relate directly to the scale representing the
values of the observations. For example, one cannot represent the
variance on the graph of the last exhibit. The standard deviaiton,
however, does relate directly to the scale of the observations.
4. Standard Deviation s
The standard deviation s (aka SD) is the positive sqyare root of the variance
s2 so that
s = (s2)1/2
The standard deviation, unlike the variance, is expressed in the original
units of the variable. For example, the standard deviation of filament
melting point for the April shipment (square root of 487.5 = 22.1) and
the May shipment (square root of 1161.1 = 34.1) can be plotted on the same
scale as the deviations form the mean.
Exhibit (repeat): Deviations from the mean
in the April and May shipments (NWW Figure 3.5 p. 81) [m3004.gif]
5. Coefficient of Variation c
The coefficient of variation c is the ratio of the standard
deviation to the mean, expressed as a percentage, so that
c = 100(s/X.)
Since both s and X. are in the same units, c is dimensionless.
Thus c can be used to compare the variation of variables expressed in different
units.
<Example - comparison of 2 variables in different units>
4. Skewness
1. Visual Assessment of Skeweness
A data set is skewed when observations are not symmetrically distributed.
Skewness (i.e., lack of symmetry) can be assessed visually from a depiction
of the distribution such as a histogram, frequency polygon, stem-and-leaf
display, or box plot (see later).
Terms to describe skewness refer to the direction where the tail
of the distribution points on the real line with negative numbers to
the left, zero in the middle, and positive numbers to the right, so that
-
skewed to the left or skewed negatively means
the tail points to the left (toward negative numbers)
-
skewed to the right or skewed positively means
the tail points to the right (toward positive numbers)
2. Relative Positions of Mean, Median, & Mode
Skewness determines the relative positions of mean, median, and mode as
shown in the next exhibit.
Exibit: Positions of mean, median, and
mode in skewd distributions (NWW Figure 3.6 p. 85) [m3005.gif]
The mechanism involved are
-
the mean is "attracted" toward extreme values in the long tail more than
the median
-
the median falls between the mean and the mode
-
a large difference between mean and median suggests the presence of skewness
3. Standardized Skewness Measure
The standardized skewness measure is based on the third moment about
the mean of the data, denoted m3, and
calculated as the sum of 3d powers of deviations of the observations from
the mean
m3 = (1/(n-1))Si=1
to n(Xi-X.)3
Since cubing preserves the sign of the deviation of an observation from
the mean, and large deviations dominate the sum (because cubing "amplifies"
them), it follows that
-
if the largest deviations from the mean are positive (i.e., the distribution
is skewed right) m3 will be positive and, conversely
-
if the largest deviations from the mean are negative (i.e., the distribution
is skewed left) m3 will be negative
The standardized skewness measure, denoted m3',
is m3 divided by the cube of the standard deviation s as
m3' = m3/s3
Notes:
The term moment is a general term meaning the sum of deviations
(e.g., from the mean, or from the origin) raised to a given power.
Thus m2 is the 2d moment, the sum of the deviations about the
mean squared; m4 is the 4th moment, the sum of deviations about
the mean raised to the 4th power, etc. Moments are the basis of important
statistics, so that
-
m2 is the basis of the variance
-
m3 is the basis of the standardized skewness measure m3'
-
m4 is the basis of a measure of kurtosis or "peakedness"
of a distribution that is rarely used (see NWW pp. 87-88)
-
Some computer programs use a measure of skewness called G1 that
has a slightly different value than m3'. G1
uses divisions by n instead of n-1.
5. Standardized Observations
Standardized observations Z1, ..., Zn corresponding
to observations X1, ..., Xn are defined as
Zi = (Xi-X.)/s for i=1, ...,
n
Zi is also called a "Z-score". Zi measures
the distance of observation Xi from the mean X. in
units of the standard deviation s.
It follows that
-
the mean of Z is 0
-
the standard deviation of Z is 1
Xi can be recoverd from Zi with the formula
Xi = X.+sZi for i=1,
..., n
Two common misconceptions about standardized observations are that
-
Z-scores are limited to normally distributed data. No. Any
set of observations can be standardized!
-
Standardizing the observations makes the distribution of the observations
normal. No. Standardizing does not change the shape of the
distribution of a variable at all!
Example: calculating the Z-score of the high school graduation rate for
NC. Information needed is
-
value of GRAD for NC = 69.3
-
mean GRAD (X.) = 73.5
-
standard deviation of GRAD (s) = 8.0
Thus, z-score of GRAD for NC is
Exhibit: distributions of high school graduation
rate, raw variable and Z-scores [m3013.htm]
6. The Box Plot
The box plot is also called the box-and-whiskers
plot.
The box plot is a graphical summary of the distribution of a variable
originally developed by John Tukey (Tukey 1977; see also the Sygraph manual,
Wilkinson 1990:164-171). The construction of the box plot is shown
in the next exhibit.
Exhibit: Contruction of the box plot [m3006.gif]
The basic elements of the box plot are as follows
-
the vertical line near the center of the box corresponds to the median
of the distribution
-
the left and right edges of the box correspond to the 25th percentile (first
quartile) and 75th percentile (third quartile), respectively; the 25th
and 75th percentiles are also termed lower hinge and upper
hinge, respectively
-
the length of the box therefore corresponds to the interquartile range
(IQR), a measure of dispersion computed as the third quartile minus the
first quartile (see above)
-
the horizontal lines drawn from the sides of the box, called whiskers,
extend to the most outlying value within 1.5 IQR from the sides
-
observations that lie beyond 1.5 IQR from either side of the box are represented
individually; observations lying within 1.5 IQR and 3 IQR are marked with
stars and termed minor outliers
-
observations that have values beyond 3 IQR from either side of the box
are marked with circles and termed major outliers
The box plots in the next two exhibits, with the corresponding stem
and leaf plots, illustrate two situations. Female life expectancy,
on the one hand, has a more or less compact and symmetric distribution,
unlikely to cause problems in statistical analysis; energy consumption
per capita, on the other, is characterized by severe skew to the right
and the presence of major outliers.
Exhibit: Box plot and stem & leaf display
- Female life expectancy, 1975 (V195, World Handbook data) [m3017.htm]
Exhibit: Box plot and stem & leaf display -
Energy consumption per capita, 1975 (V120, World Handbook data) [m3016.htm]
Q - Doe the box plot pick up bimodality? (Hint: Look
at V195.)
Indentations, or notches, are an optional feature of the
box plot. The notches mark the confidence intervals for the median
developed by McGill, Tukey, and Larsen (1978). In comparing the boxplots
for two populations along the same scale the two population medians can
be considered different with about 95 percent confidence if the intervals
around the two medians do not overlap. A comparison of box plots
is called a schematic diagram.
Exhibit: Box plots of income for males
and females (SURVEY2 data) [m3018.jpg]
Last modified 27 Aug 2002