In an early demonstration of the power of backprop, Terry Sejnowski and Charles Rosenberg trained a multilayer perceptron to read aloud. Their NETtalk system scanned the text, selected the correct phonemes according to context, and fed them to a speech synthesizer. NETtalk not only generalized accurately to new words, which knowledge-based systems could not, but it learned to speak in a remarkably human-like way. Sejnowski used to mesmerize audiences at research meetings by playing a tape of NETtalk’s progress: babbling at first, then starting to make sense, then speaking smoothly with only the occasional error. (You can find samples on YouTube by typing “sejnowski nettalk.”)
Neural networks’ first big success was in predicting the stock market. Because they could detect small nonlinearities in very noisy data, they beat the linear models then prevalent in finance and their use spread. A typical investment fund would train a separate network for each of a large number of stocks, let the networks pick the most promising ones, and then have human analysts decide which of those to invest in. A few funds, however, went all the way and let the learners themselves buy and sell. Exactly how all these fared is a closely guarded secret, but it’s probably not an accident that machine learners keep disappearing into hedge funds at an alarming rate.
Nonlinear models are important far beyond the stock market. Scientists everywhere use linear regression because that’s what they know, but more often than not the phenomena they study are nonlinear, and a multilayer perceptron can model them. Linear models are blind to phase transitions; neural networks soak them up like a sponge.
Another notable early success of neural networks was learning to drive a car. Driverless cars first broke into the public consciousness with the DARPA Grand Challenges in 2004 and 2005, but a over a decade earlier, researchers at Carnegie Mellon had already successfully trained a multilayer perceptron to drive a car by detecting the road in video images and appropriately turning the steering wheel. Carnegie Mellon’s car managed to drive coast to coast across America with very blurry vision (thirty by thirty-two pixels), a brain smaller than a worm’s, and only a few assists from the human copilot. (The project was dubbed “No Hands Across America.”) It may not have been the first truly self-driving car, but it did compare favorably with most teenage drivers.
Backprop’s applications are now too many to count. As its fame has grown, more of its history has come to light. It turns out that, as is often the case in science, backprop was invented more than once. Yann LeCun in France and others hit on it at around the same time as Rumelhart. A paper on backprop was rejected by the leading AI conference in the early 1980s because, according to the reviewers, Minsky and Papert had already proved that perceptrons don’t work. In fact, Rumelhart is credited with inventing backprop by the Columbus test: Columbus was not the first person to discover America, but the last. It turns out that Paul Werbos, a graduate student at Harvard, had proposed a similar algorithm in his PhD thesis in 1974. And in a supreme irony, Arthur Bryson and Yu-Chi Ho, two control theorists, had done the same even earlier: in 1969, the same year that Minsky and Papert published Perceptrons! Indeed, the history of machine learning itself shows why we need learning algorithms. If algorithms that automatically find related papers in the scientific literature had existed in 1969, they could have potentially helped avoid decades of wasted time and accelerated who knows what discoveries.
Among the many ironies of the history of the perceptron, perhaps the saddest is that Frank Rosenblatt died in a boating accident in Chesapeake Bay in 1969 and never lived to see the second act of his creation.
A complete model of a cell
A living cell is a quintessential example of a nonlinear system. The cell performs all of its functions by turning raw materials into end products through a complex web of chemical reactions. We can discover the structure of this network using symbolist methods like inverse deduction, as we saw in the last chapter, but to build a complete model of a cell we need to get quantitative, learning the parameters that couple the expression levels of different genes, relate environmental variables to internal ones, and so on. This is difficult because there is no simple linear relationship between these quantities. Rather, the cell maintains its stability through interlocking feedback loops, leading to very complex behavior. Backpropagation is well suited to this problem because of its ability to efficiently learn nonlinear functions. If we had a complete map of the cell’s metabolic pathways and enough observations of all the relevant variables, backprop could in principle learn a detailed model of the cell, with a multilayer perceptron to predict each variable as a function of its immediate causes.
For the foreseeable future, however, we’ll have only partial knowledge of cells’ metabolic networks and be able to observe only a fraction of the variables we’d like to. Learning useful models despite all this missing information, and despite all the inevitable inconsistencies in the information that is available, calls for Bayesian methods, which we’ll delve into in Chapter 6. The same goes for making predictions for a particular patient, model in hand: the evidence available is necessarily noisy and incomplete, and Bayesian inference makes the best of it. It helps that, if the goal is to cure cancer, we don’t necessarily need to understand all the details of how tumor cells work, only enough to disable them without harming normal cells. In Chapter 6, we’ll also see how to orient learning toward the goal while steering clear of the things we don’t know and don’t need to know.
More immediately, we know we can use inverse deduction to infer the structure of the cell’s networks from data and previous knowledge, but there’s a combinatorial explosion of ways to apply it, and we need a strategy. Since metabolic networks were designed by evolution, perhaps simulating it in our learning algorithms is the way to go. In the next chapter, we’ll see how to do just that.
Deeper into the brain
When backprop first hit the streets, connectionists had visions of quickly learning larger and larger networks until, hardware permitting, they amounted to artificial brains. It didn’t turn out that way. Learning networks with one hidden layer was fine, but after that things soon got very difficult. Networks with a few layers worked only if they were carefully designed for the application (character recognition, say). Beyond that, backprop broke down. As we add layers, the error signal becomes more and more diffuse, like a river branching into smaller and smaller tributaries, until we’re down to individual raindrops that just don’t register. Learning with dozens or hundreds of hidden layers, like the brain, remained a distant dream, and by the mid-1990s, the excitement for multilayer perceptrons had petered out. A hard core of connectionists soldiered on, but by and large the attention of the machine-learning field moved elsewhere. (We’ll survey those lands in Chapters 6 and 7.)
Today, however, connectionism is resurgent. We’re learning deeper networks than ever before, and they’re setting new standards in vision, speech recognition, drug discovery, and other areas. The new field of deep learning is on the front page of the New York Times. Look under the hood, and… surprise: it’s the trusty old backprop engine, still humming. What changed? Nothing much, say the critics: just faster computers and bigger data. To which Hinton and others reply: exactly, we were right all along!