Выбрать главу

Whenever the learner’s “retina” sees a new image, that signal propagates forward through the network until it produces an output. Comparing this output with the desired one yields an error signal, which then propagates back through the layers until it reaches the retina. Based on this returning signal and on the inputs it had received during the forward pass, each neuron adjusts its weights. As the network sees more and more images of your grandmother and other people, the weights gradually converge to values that let it discriminate between the two. Backpropagation, as this algorithm is known, is phenomenally more powerful than the perceptron algorithm. A single neuron could only learn straight lines. Given enough hidden neurons, a multilayer perceptron, as it’s called, can represent arbitrarily convoluted frontiers. This makes backpropagation-or simply backprop-the connectionists’ master algorithm.

Backprop is an instance of a strategy that is very common in both nature and technology: if you’re in a hurry to get to the top of the mountain, climb the steepest slope you can find. The technical term for this is gradient ascent (if you want to get to the top) or gradient descent (if you’re looking for the valley bottom). Bacteria can find food by swimming up the concentration gradient of, say, glucose molecules, and they can flee from poisons by swimming down their gradient. All sorts of things, from aircraft wings to antenna arrays, can be optimized by gradient descent. Backprop is an efficient way to do it in a multilayer perceptron: keep tweaking the weights so as to lower the error, and stop when all tweaks fail. With backprop, you don’t have to figure out how to tweak each neuron’s weights from scratch, which would be too slow; you can do it layer by layer, tweaking each neuron based on how you tweaked the neurons it connects to. If you had to throw out your entire machine-learning toolkit in an emergency save for one tool, gradient descent is probably the one you’d want to hold on to.

So does backprop solve the machine-learning problem? Can we just throw together a big pile of neurons, wait for it to do its magic, and on the way to the bank collect a Nobel Prize for figuring out how the brain works? Alas, life is not that easy. Suppose your network has only one weight, and this is the graph of the error as a function of it:

The optimal weight, where the error is lowest, is 2.0. If the network starts out with a weight of 0.75, for example, backprop will get to the optimum in a few steps, like a ball rolling downhill. But if it starts at 5.5, on the other hand, backprop will roll down to 7.0 and remain stuck there. Backprop, with its incremental weight changes, doesn’t know how to find the global error minimum, and local ones can be arbitrarily bad, like mistaking your grandmother for a hat. With one weight, you could try every possible value at increments of 0.01 and find the optimum that way. But with thousands of weights, let alone millions or billions, this is not an option because the number of points on the grid goes up exponentially with the number of weights. The global minimum is hidden somewhere in the unfathomable vastness of hyperspace-and good luck finding it.

Imagine you’ve been kidnapped and left blindfolded somewhere in the Himalayas. Your head is throbbing, and your memory is not too good, either. All you know is you need to get to the top of Mount Everest. What do you do? You take a step forward and nearly slide into a ravine. After catching your breath, you decide to be a bit more systematic. You carefully feel around with your foot until you find the highest point you can and step gingerly to that point. Then you do the same again. Little by little, you get higher and higher. After a while, every step you can take is down, and you stop. That’s gradient ascent. If the Himalayas were just Mount Everest, and Everest was a perfect cone, it would work like a charm. But more likely, when you get to a place where every step is down, you’re still very far from the top. You’re just standing on a foothill somewhere, and you’re stuck. That’s what happens to backprop, except it climbs mountains in hyperspace instead of 3-D. If your network has a single neuron, just climbing to better weights one step at a time will get you to the top. But with a multilayer perceptron, the landscape is very rugged; good luck finding the highest peak.

This was part of the reason Minsky, Papert, and others couldn’t see how to learn multilayer perceptrons. They could imagine replacing step functions by S curves and doing gradient descent, but then they were faced with the problem of local minima of the error. In those days researchers didn’t trust computer simulations; they demanded mathematical proof that an algorithm would work, and there’s no such proof for backprop. But what we’ve come to realize is that most of the time a local minimum is fine. The error surface often looks like the quills of a porcupine, with many steep peaks and troughs, but it doesn’t really matter if we find the absolute lowest trough; any one will do. Better still, a local minimum may in fact be preferable because it’s less likely to prove to have overfit our data than the global one.

Hyperspace is a double-edged sword. On the one hand, the higher dimensional the space, the more room it has for highly convoluted surfaces and local optima. On the other hand, to be stuck in a local optimum you have to be stuck in every dimension, so it’s more difficult to get stuck in many dimensions than it is in three. In hyperspace there are mountain passes all over the (hyper) place. So, with a little help from a human sherpa, backprop can often find its way to a perfectly good set of weights. It may be only the mystical valley of Shangri-La, not the sea, but why complain if in hyperspace there are millions of Shangri-Las, each with billions of mountain passes leading to it?

Beware of attaching too much meaning to the weights backprop finds, however. Remember that there are probably many very different ones that are just as good. Learning in multilayer perceptrons is a chaotic process in the sense that starting in slightly different places can cause you to wind up at very different solutions. The phenomenon is the same whether the slight difference is in the initial weights or the training data and manifests itself in all powerful learners, not just backprop.

We could do away with the problem of local optima by taking out the S curves and just letting each neuron output the weighted sum of its inputs. That would make the error surface very smooth, leaving only one minimum-the global one. The problem, though, is that a linear function of linear functions is still just a linear function, so a network of linear neurons is no better than a single neuron. A linear brain, no matter how large, is dumber than a roundworm. S curves are a nice halfway house between the dumbness of linear functions and the hardness of step functions.

The perceptron’s revenge

Backprop was invented in 1986 by David Rumelhart, a psychologist at the University of California, San Diego, with the help of Geoff Hinton and Ronald Williams. Among other things, they showed that backprop can learn XOR, enabling connectionists to thumb their noses at Minsky and Papert. Recall the Nike example: young men and middle-aged women are the most likely buyers of Nike shoes. We can represent this with a network of three neurons: one that fires when it sees a young male, another that fires when it sees a middle-aged female, and another that fires when either of those does. And with backprop we can learn the appropriate weights, resulting in a successful Nike prospect detector. (So there, Marvin.)