There are two main contributions from machine learning to statistical
inference thus far:
* The importance of polynomial time search through a space of possible
hypotheses
* Kernel (or nearest neighbor) methods, which propose a useful set of
techniques for avoiding the "curse of dimensionality"
This note covers the first (stay tuned for more about the second).
--
* Machine Learning, or Inference as Search
An important contribution of machine learning has been to look at
inference as a search through a space of possible hypotheses for one
(or more) that is well supported by a set of data. The idea of a
hypothesis space is not novel in statistics -- robustness analysis
uses the idea -- but the computational considerations associated with
search are not generally considered in statistical analysis.
Take a typical discrimination problem. Imagine there is a boolean
feature vector F of size k (i.e. k input boolean variables) and we are
trying to learn a method to map from instances of F to an boolean output
(the dependent variable) from a set of N examples. A classical
statistical approach would be to use multiple logistic regression or
contingency table methods. If there are many input variables and they
are not independent, this problem is generally considered intractable,
since we have to compute the entire covariance matrix for the input
variables. It is also often the case that we are not sure if each input
variable is actually relevant to the discrimination, so some kind of
dimensionality reduction might be necessary as well.
One way to handle this kind of problem is what statisticians generally
disdain as "fishing expeditions." We try looking at various subsets of
the variables and the interaction terms, hoping to find a combination
that gives good discrimination performance.
* Multiple testing
The reason that "fishing expeditions" are so discouraged is that it
requires that we test many alternate hypotheses. When we do that, we
are violating an assumption underlying most hypotheses testing
methods, and our estimates of the probability of Type I errors
(incorrectly rejecting the null) are too low. The traditional
statistical approach to this problem is known as the Bonferoni
correction. This correction, which is demonstrably correct when the
hypotheses being tested are independent of each other, is in effect an
exponential penalty. If A is the desired significance and we are going
to test K different hypotheses, then we need to find a hypothesis with
a significance level of A* where 1 - A = (1 - A*)^K. This derives
simply from the definition of the probability of a Type I error for
each hypothesis test and the worse case assumption that the type I
errors are mutually exclusive among the hypotheses in the space, given
the data that we have. Unfortunately, this exponential correction
means that it is very difficult to find significant results when one
tests more than a few hypotheses. It is important to remember that
the Bonferroni correction is really an *upper bound* derived from a
very unusual worst case.
For example, the Bonferoni correction is almost certainly too strict
in our discrimination problem example. The hypotheses that we are
testing are a closely related family; they just involve adding
interaction terms for variables we were already
considering. Intuitively, it is quite likely that if the data were
such that we would make a type I error on one of them, then it is also
likely we would make a type I error on at least one other. The fact
that type I errors are not mutually exclusive among the hypotheses
means that the Bonferroni correction is incorrect.
Unfortunately, there is no good model that allows us to quantify the
dependencies among a set of related hypotheses, so we do not have a
good model-based method for determining the correct significance of a
hypothesis found by search. On the other hand, cross validation and
randomization testing do allow us to make unbiased estimates of the
accuracy of a hypothesis discovered by a search and its significance.
Cross-validation and randomization testing are at this point well
known statistical methods -- so what is really new and interesting and
different about machine learning?
* Searching hypothesis spaces
One modest contribution is defining the idea of a hypothesis space, or
a complete set of related hypotheses. We can make various inferences
about such hypotheses spaces, including their size, how subsets of
such spaces can be identified by the observations that are compatible
with them, etc. However, it is not just the idea of a hypothesis space
that is new in machine learning (although Tom Mitchell's 1980's
"version space" conception of it is particularly clear and useful).
The real contribution of many of these machine learning/data mining
tools is the *search* component.
In order to understand the importance of search, we have to look at a
basic consideration in computer science that is absent (generally) from
statistical inference: computational complexity, or the length of time
necessary to perform a calculation.
* Computational Complexity
One of the most basic concepts in computer science is the idea of
computational complexity. That is, what is the number of fundamental
computational operations (e.g. comparisons, additions or
multiplications) that must be performed in order to accomplish a
particular task in the worst case? This complexity is generally
formulated in terms of the size of the input to the program. This
complexity is generally stated in "order" terms, ignoring specific
coefficients and focusing on broad classes of computation complexities,
such as linear, polynomial or exponential. Since the time it takes to
accomplish an elementary operation is generally fixed in most computers
(that's what the "Megahertz" ratings of a PC measures), we can talk about
the number of operations and the time of execution interchangeably.
For example, what is the worst case computational complexity of a program
to count the number of occurrences of a particular number in a list of
numbers? Since we are doing a worst case analysis, it is clear that the
program must at the very least compare the target to each element of the
input, so the number of elementary operations to accomplish this task
must be at least linear in the size of the input. It is not difficult to
show that the obvious program to step through the list and increment a
counter when it finds an element which is equal to the target takes time
linear in the size of its input. Our example program provides a linear
method, and our lower bound says there is no way to improve by more than
a constant factor, so we can say that we have an "big O" optimal
algorithm ("little o" takes into account the actual coefficients, so
there may be a better algorithm in that sense).
A more interesting example is the problem of Boolean satisfiability. This
problem takes as input a boolean formula, and asks if there is any
assignment of its variables to true/false such that the formula as a
whole is true. For example, the boolean formula "A and not B" is
satisfied only when A is assigned true and B is assigned false. As far
as we know, this problem requires time exponential in the size of the
input to solve. No one has come up with an algorithm better than trying
each possible assignment of truth values to variables, and the number of
such assignments is 2^k when k is the number of variables in the
formula. Because this takes time proportional to some number raised to
the power of the size of the input, it is called an exponential
algorithm, and is considered intractable.
In general, tractable algorithms must have run time that is polynomial
in the size of their inputs. For example, we can prove that the best
possible program to sort a list of items must take time proportional
to n times log n, where n is the number of items to sort.
Many problems have an interesting property: although it takes an
exponential time to find a solution to the problem, if one is provided
with a possible solution, it takes only polynomial time to verify that
the provided solution is indeed a solution. This class of problems is
called NP, for non-deterministic polynomial. Boolean satisfiability is
an NP problem; if I give you a correct assignment, then it is easy to
show that verifying it takes polynomial (in fact, linear) time. The
terms NP hard and NP complete, which you may have heard of, are just
subsets of NP problems.
* Inference as polynomial search
Now, let's think about that discrimination problem as a machine learning
person would. First, we'd like to define a hypothesis space which
includes either the correct explanation for the observations or a
reasonable approximation to it. Then, we'd like to describe a method for
searching that space so that given some observations (the data) we have a
good probability of finding that (approximately) correct hypothesis in
the space in a polynomial amount of time. Finally, we'd like to generate
an unbiased estimate of the accuracy of our induced hypothesis and the
probability of having found it by chance.
Let's take these in reverse order. First, we can use cross validation and
randomization testing to develop unbiased estimates of the accuracy of
our induced hypothesis and the chance of having found a hypothesis at
least as predictive as the one we did find simply by chance.
Second, we can use the PAC (probably approximately correct) framework to
prove things about the ability of algorithms to find reasonably good
hypotheses in exponentially large search spaces in polynomial time. What
we try to prove is that an algorithm is able to guarantee that with some
large probability p it is likely to find a hypothesis that is within some
small error e of the correct one, in polynomial time. The computational
complexity is generally dependent on p and e so that as you make them
smaller and smaller, the complexity increases. You can think of these as
approximation algorithms, which are polynomial time algorithms that offer
results that are guaranteed to come within some distance of the optimal
solution. Generally, approximation bounds are on the quality of the
solution (e); in machine learning, we usually have to also introduce a
term describing the probability of finding a solution of that quality
(p). It is fairly hard to prove such things, but there is a growing
collection of demonstrably good induction algorithms out there. There
are also lots of polynomial time algorithms which don't have provable PAC
properties, but have shown themselves to be of significant empirical
value on the basis of cross validated accuracy (and significance) of the
hypotheses that they find.
The final issue, defining the hypothesis space itself, remains as
difficult for machine learning as it is for statistical inference. Human
intuition and knowledge must play the same kind of role in defining
hypothesis spaces as it does in defining individual hypotheses to
test. By allowing us to look at larger collections, we are removing some
of the pressure on intuition to identify particular hypotheses, but that
doesn't mean that the entire universe of possible hypotheses is
searchable.
In fact, there is an interesting set of results called the "No free
lunch" theorems (Wolpert), which demonstrate that all induction methods,
when applied to the entire universe of possible induction problems,
perform exactly the same: they do not generalize at all. Fortunately, we
don't care very much about the entire universe of possible induction
problems. For example, most of us are only interested in induction
problems whose solutions are compatible with the laws of physics.
At any rate, although we cannot guarantee our induction methods will
generalize any better than traditional ones, what we are trying to do in
machine learning is to substitute large amounts of data and computational
power for at least some human intuition. We need the additional data to
avoid the multiple testing problem, and we need the computation to search
through exponentially large spaces of hypotheses. If we have these, then
we can reduce the demands on our intuition from defining specific
hypotheses to test to defining large classes of hypotheses to test.
I'm going to focus now on algorithms that are not generally proven in the
PAC framework, but which have instead shown good practical utility. I'm
going to talk about three different general search methods, and their
applications to inductive inference. The search methods are: greedy
search, beam search, and gradient descent.
Greedy algorithms work by using some approximate decision measure, and
reduce the size of the problem incrementally with it. Perhaps the best
performing general class of machine learning algorithms are those that
use greedy search based on information gain. These algorithms were
pioneered by Ross Quinlan (the algorithms known as ID3 and C4.5), and
have gotten some play in the statistical community as when Breiman added
regression to them (CART). The basic idea in these algorithms is to
build a discrimination tree so that after making a set of decisions based
on the values of the input variables, the algorithm would reach a leaf
node that is fairly "pure" in terms of the output class. Such a tree is
built in greedy fashion: The first split is made by picking the input
variable that gives the "purest" split, using information gain as the
measure of purity. The algorithm is then applied recursively to each
side of the split until the results splits are either completely pure, or
that the number of observations is so small that further splits cannot be
justified.
Note that this approach has a problem shared by all greedy methods: it
may well be the case that a better tree could be constructed if all of
the splits were considered at once, instead of just one at a time. Of
course, there are an exponential number of trees, so it is not possible
to build a polynomial time algorithm that considers them all. Fixed
amounts of "look ahead" can be included in polynomial time (e.g. looking
at pairs of splits rather than single ones), but it is always possible
that a greedy algorithm will miss a good combination. For this reason,
there are no good PAC results for such algorithms, although their real
world performance has been excellent.
Another variation on the idea of greedy search is called "beam search."
In beam search, we are also looking to solve an exponential search
problem by looking at a series of simplified decisions, but instead of
committing to one particular decision at each step, we keep track of some
number of the best solutions seen so far. The number we keep is the
width of the beam, and is generally fairly small. Genetic algorithm
approaches to inductive inference are conducting beam searches through
hypothesis space, with a beam equal to the size of the population.
Although beam search also has yet to get any very interesting PAC
results, genetic algorithms have also shown pretty good empirical
results.
A final type of search that we might use is familiar from numeric
optimization methods: gradient descent. If we can define a measure on
our hypothesis space with certain properties (e.g., having everywhere
non-zero first and second derivatives) we can use that measure to direct
a search for the global optimum, or at least for a good local
optimum. Neural networks define spaces of non-linear combinations of
their inputs with an error measure based on the training data, and then
use gradient descent techniques to find a particular hypothesis that
minimizes the error function. The different types of neural networks
and training techniques are all variations that define different
hypothesis spaces and optimization techniques, but they all depend on the
same basic idea.
In summary, one important set of contributions of machine learning to
statistical inference is the idea of a polynomial time search through an
exponential hypothesis space. If we are clever at how we define our
hypothesis spaces, and careful about making unbiased estimates of the
accuracy and significance of our results, we can use these techniques to
trade data and compute power for fewer demands on our human intuitions.