Statistical functions in Common Lisp.
This documentation is copyright (c) 2000 by Larry Hunter
(Larry.Hunter@ucdenver.edu). It is released under the GNU General Public
License. See the file statistics.copyright-notices for license details.
1. Introduction.
This code is intended to be used as an internal library to provide
statistical functions for GPL'd common lisp code. It may not be used
in proprietary software. Also, this code has NO WARRANTEE; see the
accompanying file statistics.copyright-notices for more information.
The code was written for a biostatistics course I teach at the
University of Colorado Health Sciences Center. The course web site
may also be useful in understanding the code here. See:
http://porsche.ucdenver.edu/pharm/Courses/PHCL7610
In addition, the course textbook, Bernard Rosner's "Fundamentals of
Biostatistics," 5th edition, Duxbury, CA, ISBN 0-534-37068-3 is an
excellent introduction to biostatistics. The formulae used in this
code are taken largely from this book, and specific page number
references are included in many of the functions.
A few of the numeric functions were adopted from the CLASP package, by
Paul Cohen. Please see the web site
http://eksl-www.cs.umass.edu/clasp.html and the attached
copyright-notices file for more information about that package.
I am interested in bug reports, corrections, and other relevant
feedback. Please send any comments to Larry.Hunter@ucdenver.edu. Thanks to Paul Cohen for the original CLASP package. Thanks also to bug
reports from Rob St. Amant and Lee Ayres, and several bug fixes from
Robert Goldman.
2. Naming conventions
Most of the functions and variable names are spelled out fully, with
hyphens in place of spaces. In the interests of brevity, the
following abreviations used in function and variable names:
ci = confidence interval
cdf = cumulative density function
ge = greater than or equal to
le = less than or equal to
pdf = probability density function
sd = standard deviation
rxc = rows by columns
sse = sample size estimate
3. The functions.
This section is a list of the functions provided, the arguments that
the functions take, and the values that they return, along with other
useful commentary. There are some additional comments in the source
files.
3.1 Descriptive statistics
The following functions take a sequence of numbers as an argument and
return the named descriptive statistic:
MEAN
MEDIAN
MODE
GEOMETRIC-MEAN
RANGE
VARIANCE
STANDARD-DEVIATION
COEFFICIENT-OF-VARIATION
STANDARD-ERROR-OF-THE-MEAN
The function SD is a synonym for standard-deviation.
The function PERCENTILE takes a sequence and a percent (a number
between 0 and 100) as arguments and returns the percentile.
3.2 Distributional functions
The following distributional functions compute densities and
cumulative probabilities for binomial, Poisson, Normal and Chi-square
distributions.
BINOMIAL-PROBABILITY takes three arguments, N, K and p, and returns
the probability of seeing exactly K out of N events under the binomial
distribution with probability p. N and K must be fixnums K <= N, and
p must be a number between 0 and 1. The calculation is exact unless
N>100 and p<0.01, in which case the Poisson approximation is used.
BINOMIAL-CUMULATIVE-PROBABILITY takes three arguments N, K and p, and
returns the probability of seeing fewer than K events in N trials
under the binomial distribution with probability p.
BINOMIAL-GE-PROBABILITY takes three arguments N, K and p, and returns
the probability of seeing K or more events in N trials under the
binomial distribution with probability p.
BINOMIAL-LE-PROBABILITY takes three arguments N, K and p, and returns
the probability of seeing K or fewer events in N trials under the
binomial distribution with probability p. It is the cumulative
probability plus the binomial probability of K.
POISSON-PROBABILITY takes two arguments mu and K, and returns the
probability of seeing k events over a time period under a Poisson
distribution with an expected number of events mu. Mu must be
positive number and K must be a positive integer.
POISSON-CUMULATIVE-PROBABILITY is the probability of seeing fewer than
K events over a time period under a Poisson distribution with an
expected number of events over that time period of mu. Mu must be
positive number, and K must be a positive integer.
POISSON-GT-PROBABILITY is the probability of seeing K or more events
over a time period under a Poisson distribution with an expected
number of events over that time period of mu. Mu must be positive
number, and K must be a positive integer.
NORMAL-PDF takes three arguments X, mu and sigma, and returns the
probability density for a normal distribution with mean mu and
standard deviation sigma at point X. Mu and X must be numbers and
sigma must be a positive number.
CONVERT-TO-STANDARD-NORMAL takes three arguments, X, mu and sigma and
converts the values of the random variable X from a normal distribution
with mean mu and standard deviation sigma to a random variable
distributed as the standard Normal (mean 0, sd 1).
PHI is the cumulative density function of the Normal distribution. It
takes one argument, X, and returns the probability that a standard normal
variable is less than X.
Z is the inverse Normal function. It takes one argument, percentile, and
returns the value X such that Phi(x) = percentile. Percentile must be
between 0 and 1.
T-DISTRIBUTION takes two arguments, degrees of freedom and a percentile,
and returns the point which is the indicated percentile in the t
distribution with dof degrees of freedom.
CHI-SQUARE takes two arguments, degrees of freedom and a percentile, and
reutnrs the point which is the indicated percentile in the chi square
distribution with dof degrees of freedom.
CHI-SQUARE-CDF takes two arguments, degrees of freedom and X, and returns
the probability that a Chi-square distributed variable with dof degrees
of freedom is less than X. This is the left hand tail area under the chi
square distribution with dof degrees of freedom up to X, and is generally
used to calculate the significance of a chi square test.
3.3 Confidence interval functions
The following functions calculate confidence intervals around various
estimates of parameters in the binomial, Poisson and Normal
distributions. The all end in "-CI"
BINOMIAL-PROBABILITY-CI takes 3 arguments, N, p and alpha and returns the
upper and lower bounds on the 1-alpha confidence interval on the p
parameter of the binomial distribution. N is the number of trials, p is
the observed probability, and alpha defines the confidence interval.
This function approximates using the Normal theory when N*p*(1-p) >= 10,
unless the keyword argument :exact? is non-nil. N must be a positive
integer, and p and alpha must be numbers between 0-1.
POISSON-MU-CI takes two arguments, an observed number of events X in a
unit of time, and alpha, and returns the upper and lower bounds on the
1-alpha confidence interval on the mu parameter of a Poisson
distribution. Mu must be a positive number and alpha must be between
0-1.
NORMAL-MEAN-CI takes four arguments, mean, standard deviation, N and
alpha, and returns the upper and lower bounds of the 1-alpha confidence
interval on the mean of the Normal distribution with parameters mean and
standard deviation. Mean must be a number, standard deviation a positive
number, N a positive integer and alpha between 0-1.
NORMAL-MEAN-CI-ON-SEQUENCE takes two arguments, a sequence of numbers and
alpha, and returns the 1-alpha confidence interval on the mean of the
sequence, assuming that it was drawn on a Normal distribution.
NORMAL-VARIANCE-CI takes three arguments, a variance, N and alpha, and
returns the 1-alpha confidence interval on the variance of a normal
distribution. Variance must be a positive number, N a positive integer
and alpha between 0-1.
NORMAL-VARIANCE-CI-ON-SEQUENCE takes two arguments, a sequence of numbers
and alpha, and returns the 1-alpha confidence interval on the variance of
the sequence, assuming that it was drawn on a Normal distribution.
NORMAL-SD-CI and NORMAL-SD-CI-ON-SEQUENCE are similar to the
NORMAL-VARIANCE functions, but return the confidence interval on the
standard deviation instead of the variance.
3.4 Parametric Hypothesis tests.
These functions generally take a :tails keyword argument, which
determines whether they test for any difference, or a specifically
positive (or specifically negative) difference. One sample tests
look at a set of observations and test whether they are compatible
with some given parameters. Two sample tests compare two different
sets of observations. The "-ON-SEQUENCE(s)" suffix take sequence
arguments and compute the necessary test statistics. Without the
suffix, the user supplies the test statistic.
Z-TEST takes two required arguments, x-bar and N, and three optional
keyword arguments, mu, sigma and tails; it returns the p value which is
the significance of a one sample test for the mean of a normal
distribution with known variance. Mu is the null hypothesis mean
(default 0), sigma is the standard deviation (default 1), and N is the
number of observations. If tails is :both, the p value reported is the
significance of any difference between x-bar and mu; if tails is
:positive, the significance is of a test that x-bar is greater than mu,
and if negative, the significance of a test that x-bar is less than mu.
Z-TEST-ON-SEQUENCE takes one required argument, sequence, and three
optional keyword arguments, mu, sigma, and tails; it returns the p value
which is the significance of a one sample test for the mean of a normal
distribution with known variance. This is the same as Z-TEST, but
calculates x-bar and N from the sequence.
T-TEST-ONE-SAMPLE takes four arguments, x-bar, sd, n, mu and one optional
keyword argument, tails. It returns the p value which is the
significance of a one sample t test for the mean of a normal distribution
with unknown variance. X-bar is the observed mean, sd is the observed
standard deviation, N is the number of observations, and mu is the test
mean. If tails is :both, the p value reported is the significance of any
difference between x-bar and mu; if tails is :positive, the significance
is of a test that x-bar is greater than mu, and if negative, the
significance of a test that x-bar is less than mu.
T-TEST-ONE-SAMPLE-ON-SEQUENCE takes two arguments, a sequence and a test
mean mu, and an optional keyword argument, tails. It returns the p value
which is the significance of a one sample t test for the mean of a normal
distribution with unknown variance. This is the same as
T-TEST-ONE-SAMPLE, but calculates the x-bar, sd and N from the sequence.
T-TEST-PAIRED takes three arguments, d-bar, sd and N, and one optional
keyword argument, tails. It returns the p value which is the
significance of a paried t test for the means of two normal distributions
in a longitudinal study. D-bar is the mean difference between the before
and after conditions, sd is the standard deviation of the differences, N
is the number of observations. If tails is :both, the p value reported
is the significance of any difference between the means; if tails is
:positive, the significance is of a test that the difference is positive,
and if negative, the significance of a test that difference is negative.
T-TEST-PAIRED-ON-SEQUENCES takes two arguments, before, a sequence of
numbers, and after, an equal length sequence of numbers, such that the
first element of before corresponds to the first element of after, etc.
It returns the p value which is the significance of a paired t test on
the two sequences. This is the same as the above, but the difference
mean, sd and N are calculated from the sequences.
T-TEST-TWO-SAMPLE takes six required arguments, x-bar1, sd1, N1, x-bar2,
sd2 and N2, which are the mean, standard deviation and size of the first
and second sample respectively. Optional keyword arguments include
variances-equal? variance-significance-cutoff and tails. This function
returns the p value which is the significance of the test of whether the
sample means are equal or not. The test done depends on whether the
sample variances are equal. The variance variances-equal? can be t (or
:yes or :equal), nil (or :no or :unequal) or :test. If it is :test, an F
test is conducted on the variances, and if the significance of the F test
is less or equal to the value of the variance-significance-cutoff
(default 0.05), then the variances are assumed to be different. If the
variances are equal, the exact two sample t test is computed. If the
variances are not equal, the Satterthwaite method is used to approximate
the p value. This method has good type I error properties (at the loss
of some power). The tails variable determines if the test is for any
difference, for the mean of the first sample being greater than the
second (:positive tails) or the mean of the first sample being less than
the second (:negative tails). X-bar1 and X-bar2 must be numbers, sd1 and
sd2 positive numbers, N1 and N2 positive integers, variances-equal? must
be one of (t, :yes, :equal, nil, :no, :unequal, or :test), and
variance-significance-cutoff must be a number between 0 and 1.
T-TEST-TWO-SAMPLE-ON-SEQUENCES takes two required variables, sequence1
and sequence2, and optional variances as T-TEST-TWO-SAMPLE. It returns
the p value which is the significance of the test of whether the two
sample means are equal. The sequences must be sequences of numbers.
F-TEST takes four required arguments, variance1, N1, variance2 and N2,
and one optional keyword argument, tails. It returns the p value which
is the significance of a test for the equality of the two variances. If
tails is both, the p value is the significance of any difference. If
tails is :positive, it is the significance of a test that the first
variance is larger than the second; if :negative, that the first variance
is smaller than the second. Variances must be positive numbers and Ns
must be positive integers.
CHI-SQUARE-TEST-ONE-SAMPLE takes three arguments, a sample variance, N
and the test sigma-squared, and one optional keyword argument, tails. It
returns the p value which is the significance of a one sample chi-square
test for the variance of a normal distribution (i.e. whether the observed
sample variance is equal to sigma-squared). Tails specifies whether the
test is for any difference (:both), for variance being less than
sigma-squared (:negative) or for the variance being greater than
sigma-squared (:positive).
BINOMIAL-TEST-ONE-SAMPLE takes three required arguments, p-har, N and p,
and two optional keyword arguments, tails and exact?. It returns the
significance of a one sample test for the equality of an observed
probability (p-hat) to an expected probability (p) under a binomial
distribution with N observations. It uses an exact test of either the
:exact? argument is true or if N*p*(1-p) <= 10, and the Normal theory
approximation otherwise. P-hat and p must be probabilities, and N must
be a positive integer. If tails is :both, than the p value is for any
difference; if it is :positive, it if for the observed probability being
greater than the expected, and if it is :negative, then it is for the
observed probability being less than expected.
BINOMIAL-TEST-TWO-SAMPLE takes four required arguments: p-hat1, N1,
p-hat2 and N2, and the optional keyword arguments tails and exact?. It
returns the p value which is the significance of the test for equality of
the two observed probabilities in the specified number of trials. The
Normal theory approximation will be used when the products of
n*p-hat*(1-p-hat) are greater than 5 for both samples and the value of
the exact? keyword is not t; otherwise, Fisher's exact test is used. The
tails variable specifies whether the test is for any difference (:both),
the first probability being larger than the second (:positive) or the
first being smaller than the second (:negative).
FISHER-EXACT-TEST takes one required arguments, a 2x2 array of numbers,
and returns the probability of observing a different matrix given that
the row and column margins are fixed. Optional :tails argument determines
whether the probability is for any different arrangement (both),
different arrangments where the probability of the least frequently
occurring cell is less than the observed (:positive) or greater than the
observed (:negative). Used to calculate exact p values for the two
sample binomial test.
MCNEMARS-TEST is a two sample test for correlated proportions, used in
longitudinal studies. It ignores the number of concordant pairs (same
outcome), and looks only at the difference between the number of
A-discordant pairs (pairs where A is preferred) vs. B-discordant pairs.
Takes two required arguments, counts of the a- and b- discordant pairs,
respectively, and and optional :exact? argument, and returns the p value
which is the significance of the difference. If N is greater than 20,
and the exact? flag is not true, then we use a chi-square approximation.
Otherwise, we use the exact binomial calculation.
POISSON-TEST-ONE-SAMPLE takes two arguments, a number of observed events
in a period of time, and mu, the number of expected events in the period
of time, and returns the p value which is the significance of difference
between the expected and observed under the Poisson distribution. Tails
determines whether the test will be for any difference (:both), observed
being greater than expected (:positive), or observed being less than
expected (:negative). There is a mediocre normal theory approximation
for large values, but it's not very good, so it is only used when the
:approximation? flag is true.
3.5 Non-parametric hypothesis tests
The following tests do not require underlying distributional
assumptions. Generally, they are less powerful than their
distributional counterparts.
SIGN-TEST takes two required arguments, a plus-count and a minus-count,
and returns the p value of the significance of the difference between
them. This is really just a special case of the binomial one sample test
(with p = 1/2). If the optional keyword argument tail is :both, then the
p value is for any difference between the positive and negative counts.
If it is :positive, it is for a test if the positive-count is larger, and
if :negative, for the positive-count to be smaller. If N is >= 20 and
the optional keyword argument exact? is not true, it will use a Normal
theory approximation.
SIGN-TEST-ON-SEQUENCES takes two equal length sequences, and returns the
p-value which is the significance of the signs of the differences between
their elements. It is the same as the above, but it computes the
differences between each element of the sequences and counts the number
of positive and negative differences.
WILCOXON-SIGNED-RANK-TEST takes a list of difference values, and returns
the p value of the significance of the differences. The values of the
differences are ranked (i.e. are the negative differences smaller in
magnitude than the positive differences?), so it assumes a sontinuous and
symmetric distribution of differences in the null, although not a Normal
distribution. It is equivalent to the Mann-Whitney test. This version
is a Normal theory approximation, and is only valid when N > 15, but does
handle tied ranks correctly.
WILCOXON-SIGNED-RANK-TEST-ON-SEQUENCES takes two equal length sequences
and performs the Wilcoxon signed rank test, as above.
CHI-SQUARE-TEST-RXC takes one required argument, a RxC contingency table,
and returns the p value which is the significance of relationship between
the row and column variable. Any difference in proportion will cause
this test to be significant. Consider using the
chi-square-test-for-trend instead if you are looking for a consistant
change. This is a Normal theory approximation, and requires that no
expected value be less than 1, and that no more than 1/5 of the expected
values be less than 5.
CHI-SQUARE-TEST-FOR-TREND takes two equal sized sequences of numbers
(representing a 2xK table) and assesses if there is an increasing or
decreasing trend in the differences, returning the p value of the
significance. Optionally, provide a list of scores, which represent
numeric relationships among the columns; if not provided, scores default
to be 1 to k.
3.6 Sample size estimates. These functions all take (as keyword
arguments) a desired significance alpha (default 0.05) power (1-beta,
default .95) and tails (default :both), as well as required estimates
of the parameters of a distribution, and return the number of subjects
necessary to find those parameters from a sample.
T-TEST-ONE-SAMPLE-SSE takes three arguments, the estimated experimental
mean (mu), the null mean (mu-null) and the null variance (variance), and
returns the number of subjects necessary to find a difference with
significance alpha and power 1-beta in a one sample t test.
T-TEST-TWO-SAMPLE-SSE takes four arguments, the estimated means and
variances for two normally distributions, and returns the number of
samples necessary to distinguish them with the alpha 1-beta and tails
specified. It is optionally possible to define sample-ratio, which
allows for unequal sized samples (SS2 = sample-ratio * SS1).
T-TEST-PAIRED-SSE takes two arguments, a difference mean and variance,
and returns the number of samples necessary to distinguish the
differences from zero, with alpha, 1-beta and tails as specified.
BINOMIAL-TEST-ONE-SAMPLE-SSE takes two arguments, the estimated p and the
null p, and returns the number osamples needed to test whether they are
different under the binomial distribution with alpha, 1-beta and tails as
specified.
BINOMIAL-TEST-TWO-SAMPLE-SSE takes two estimated probabilities as
arguments, and returns the number of samples needed to distinguish
between them (assuming binomial distributions) with the specified alpha,
1-beta and tails. It is optionally possible to define sample-ratio,
which allows for unequal sized samples (SS2 = sample-ratio * SS1).
BINOMIAL-TEST-PAIRED-SSE takes two arguments, the expected proportion of
discordant pairs (pd) and the expected number of type A discordand pairs
(pa), and returns the number of samples (that is, twice the number of
pairs) needed to find a difference between type A and type B discordant
pairs in McNemar's test parameters alpha, 1-beta and tails.
CORRELATION-SSE takes one argument, rho, the expected correlation
coefficient, and returns the number of samples necessary to distinguish
rho from zero with parameters alpha, 1-beta and tails.
3.7 Correlation and regression
LINEAR-REGRESSION takes a list of points (e.g. ((x1 y1) ... (xk yk))) and
computes the regression equation for a least squares fit of a line. It
returns five values, the intercept, slope, correlation coefficient, R^2,
and p value which is the significance of the difference of the slope from
zero.
CORRELATION-COEFFICIENT takes a list of points and returns just r from
above (also called the Pearson correlation).
CORRELATION-TEST-TWO-SAMPLE takes two correlation coefficients, and two
sample sizes, and returns the p value which is the significance of the
difference between the coefficients. This uses fisher's Z test.
CORRELATION-TEST-TWO-SAMPLE-ON-SEQUENCES takes two lists of points, and
returns the p value which is the significance of the difference between
the correlation coefficients of the two sequences. As above, but
calculates the correlation coefficients and Ns from the lists of points.
SPEARMAN-RANK-CORRELATION takes a list of points (where one or both
variables are ordinal, or have a distribution that is far from normal)
and returnes two values, the spearman correlation coefficient and the p
value which is the significance of its difference from zero.
3.8 Significance functions. These functions take test statistics (and
sometimes other parameters, like degrees of freedom) and return p
values. Chi square and Normal (Gaussian) significance are not separate
functions, but should be calculated from their respective CDF
functions.
T-SIGNIFICANCE takes a t statistic and degrees of freedom (and optionally
tails, default :both), and returns the significance. This is the
computational equivalent of a T table.
F-SIGNIFICANCE takes an f statistic, numerator- and denominator- degrees
of freedom, and returns the significance.
4 Utilities. Some of these functions may be of potential external use,
and are exported.
RANDOM-SAMPLE takes a sequence as an argument, and returns a random
element of that sequence.
BIN-AND-COUNT takes a sequence and N, creates N equal width bins (between
the minimum and maximum of the sequence), and returns a list of length N
containing the number of elements of the sequence in each of the bins
(smallest to largest).
FISHER-Z-TRANSFORM takes a correlation coefficient and returns the Fisher
Z transform of it, which is approximately distributed as the Standard
Normal.
PERMUTATIONS takes two arguments, N & K, and returns the number of ways
to take N things K at a time, when order matters.
CHOOSE takes two arguments, N & K, and returns the number of ways to take
N thinks K at a time, when order does not matter.
MEAN-SD-N takes a sequence of numbers and returns three values, the mean,
standard deviation and size of the sequence.
SQUARE takes a numeric argument and returns its square.
ROUND-FLOAT takes a numeric argument and an optional precision (default
5) and rounds the number to precision significant digits.