This type of metalearning is called stacking and is the brainchild of David Wolpert, whom we met in Chapter 3 as the author of the “no free lunch” theorem. An even simpler metalearner is bagging, invented by the statistician Leo Breiman. Bagging generates random variations of the training set by resampling, applies the same learner to each one, and combines the results by voting. The reason to do this is that it reduces variance: the combined model is much less sensitive to the vagaries of the data than any single one, making this a remarkably easy way to improve accuracy. If the models are decision trees and we further vary them by withholding a random subset of the attributes from consideration at each node, the result is a so-called random forest. Random forests are some of the most accurate classifiers around. Microsoft’s Kinect uses them to figure out what you’re doing, and they regularly win machine-learning competitions.
One of the cleverest metalearners is boosting, created by two learning theorists, Yoav Freund and Rob Schapire. Instead of combining different learners, boosting repeatedly applies the same classifier to the data, using each new model to correct the previous ones’ mistakes. It does this by assigning weights to the training examples; the weight of each misclassified example is increased after each round of learning, causing later rounds to focus more on it. The name boosting comes from the notion that this process can boost a classifier that’s only slightly better than random guessing, but consistently so, into one that’s almost perfect.
Metalearning is remarkably successful, but it’s not a very deep way to combine models. It’s also expensive, requiring as it does many runs of learning, and the combined models can be quite opaque. (“I believe you have prostate cancer because the decision tree, the genetic algorithm, and Naïve Bayes say so, although the multilayer perceptron and the SVM disagree.”) Moreover, all the combined models are really just one big, messy model. Can’t we have a single learner that does the same job? Yes we can.
The Master Algorithm
Our unified learner is perhaps best introduced through an extended allegory. If machine learning is a continent divided into the territories of the five tribes, the Master Algorithm is its capital city, standing on the unique spot where the five territories meet. As you approach it from a distance, you can see that the city is made up of three concentric circles, each bounded by a wall. The outer and by far widest circle is Optimization Town. Each house here is an algorithm, and they come in all shapes and sizes. Some are under construction, the locals busy around them; some are gleaming new; and some look old and abandoned. Higher up the hill lies the Citadel of Evaluation. From its mansions and palaces orders issue continuously to the algorithms below. Above all, silhouetted against the sky, rise the Towers of Representation. Here live the rulers of the city. Their immutable laws set forth what can and cannot be done not just in the city but throughout the continent. Atop the central, tallest tower flies the flag of the Master Algorithm, red and black, with a five-pointed star surrounding an inscription that you cannot yet make out.
The city is divided into five sectors, each belonging to one of the five tribes. Each sector stretches down from its Tower of Representation to the city’s outer walls, encompassing the tower, a clutch of palaces in the Citadel of Evaluation, and the streets and houses in Optimization Town they overlook. The five sectors and three rings divide the city into fifteen districts, fifteen shapes, fifteen pieces of the puzzle you need to solve:
You gaze intently at the map, trying to decipher its secret. The fifteen pieces all match quite precisely, but you need to figure out how they combine to form just three: the representation, evaluation, and optimization components of the Master Algorithm. Every learner has these three elements, but they vary from tribe to tribe.
Representation is the formal language in which the learner expresses its models. The symbolists’ formal language is logic, of which rules and decision trees are special cases. The connectionists’ is neural networks. The evolutionaries’ is genetic programs, including classifier systems. The Bayesians’ is graphical models, an umbrella term for Bayesian networks and Markov networks. The analogizers’ is specific instances, possibly with weights, as in an SVM.
The evaluation component is a scoring function that says how good a model is. Symbolists use accuracy or information gain. Connectionists use a continuous error measure, such as squared error, which is the sum of the squares of the differences between the predicted values and the true ones. Bayesians use the posterior probability. Analogizers (at least of the SVM stripe) use the margin. In addition to how well the model fits the data, all tribes take into account other desirable properties, such as the model’s simplicity.
Optimization is the algorithm that searches for the highest-scoring model and returns it. The symbolists’ characteristic search algorithm is inverse deduction. The connectionists’ is gradient descent. The evolutionaries’ is genetic search, including crossover and mutation. The Bayesians are unusual in this regard: they don’t just look for the best model, but average over all models, weighted by how probable they are. To do the weighting efficiently, they use probabilistic inference algorithms like MCMC. The analogizers (or more precisely, the SVM mavens) use constrained optimization to find the best model.
After a long day’s journey, the sun is rapidly nearing the horizon, and you need to hurry before it gets dark. The city’s outer wall has five massive gates, each controlled by one of the tribes and leading to its district in Optimization Town. Let us enter through the Gradient Descent Gate, after whispering the watchword-“deep learning”-to the guard, and spiral in toward the Towers of Representation. From the gate the street ascends steeply up the hill to the citadel’s Squared Error Gate, but instead you turn left toward the evolutionary sector. The houses in the gradient descent district are all smooth curves and densely intertwined patterns, almost more like a jungle than a city. But when gradient descent gives way to genetic search, the picture changes abruptly. Here the houses rise higher, with structure piled on structure, but the structures are spare, almost vacant, as if waiting to be filled in by gradient descent’s curves. That’s it: the way to combine the two is to use genetic search to find the structure of the model and let gradient descent fill in its parameters. This is what nature does: evolution creates brain structures, and individual experience modulates them.