Most of all, the goal of machine learning is to find the best possible learning algorithm, by any means available, and evolution and the brain are unlikely to provide it. The products of evolution have many obvious faults. For example, the mammalian optic nerve attaches to the front of the retina instead of the back, causing an unnecessary-and egregious-blind spot right next to the fovea, the area of sharpest vision.
The molecular biology of living cells is such a mess that molecular biologists often quip that only people who don’t know any of it could believe in intelligent design. The architecture of the brain may well have similar faults-the brain has many constraints that computers don’t, like very limited short-term memory-and there’s no reason to stay within them. Moreover, we know of many situations where humans seem to consistently do the wrong thing, as Daniel Kahneman illustrates at length in his book Thinking, Fast and Slow.
In contrast to the connectionists and evolutionaries, symbolists and Bayesians do not believe in emulating nature. Rather, they want to figure out from first principles what learners should do-and that includes us humans. If we want to learn to diagnose cancer, for example, it’s not enough to say “this is how nature learns; let’s do the same.” There’s too much at stake. Errors cost lives. Doctors should diagnose in the most foolproof way they can, with methods similar to those mathematicians use to prove theorems, or as close to that as they can manage, given that it’s seldom possible to be that rigorous. They need to weigh the evidence to minimize the chances of a wrong diagnosis; or more precisely, so that the costlier an error is, the less likely they are to make it. (For example, failing to find a tumor that’s really there is potentially much worse than inferring one that isn’t.) They need to make optimal decisions, not just decisions that seem good.
This is an instance of a tension that runs throughout much of science and philosophy: the split between descriptive and normative theories, between “this is how it is” and “this is how it should be.” Symbolists and Bayesians like to point out, however, that figuring out how we should learn can also help us to understand how we do learn because the two are presumably not entirely unrelated-far from it. In particular, behaviors that are important for survival and have had a long time to evolve should not be far from optimal. We’re not very good at answering written questions about probabilities, but we are very good at instantly choosing hand and arm movements to hit a target. Many psychologists have used symbolist or Bayesian models to explain aspects of human behavior. Symbolists dominated the first few decades of cognitive psychology. In the 1980s and 1990s, connectionists held sway, but now Bayesians are on the rise.
For the hardest problems-the ones we really want to solve but haven’t been able to, like curing cancer-pure nature-inspired approaches are probably too uninformed to succeed, even given massive amounts of data. We can in principle learn a complete model of a cell’s metabolic networks by a combination of structure search, with or without crossover, and parameter learning via backpropagation, but there are too many bad local optima to get stuck in. We need to reason with larger chunks, assembling and reassembling them as needed and using inverse deduction to fill in the gaps. And we need our learning to be guided by the goal of optimally diagnosing cancer and finding the best drugs to cure it.
Optimal learning is the Bayesians’ central goal, and they are in no doubt that they’ve figured out how to reach it. This way, please…
CHAPTER SIX: In the Church of the Reverend Bayes
The dark hulk of the cathedral rises from the night. Light pours from its stained-glass windows, projecting intricate equations onto the streets and buildings beyond. As you approach, you can hear chanting inside. It seems to be Latin, or perhaps math, but the Babel fish in your ear translates it into English: “Turn the crank! Turn the crank!” Just as you enter, the chant dissolves into an “Aaaah!” of satisfaction, and a murmur of “The posterior! The posterior!” You peek through the crowd. A massive stone tablet towers above the altar with a formula engraved on it in ten-foot letters:
P ( A|B ) = P ( A ) P(B|A) / P(B)
As you stare uncomprehendingly at it, your Google Glass helpfully flashes: “Bayes’ theorem.” Now the crowd starts to chant “More data! More data!” A stream of sacrificial victims is being inexorably pushed toward the altar. Suddenly, you realize that you’re in the middle of it-too late. As the crank looms over you, you scream, “No! I don’t want to be a data point! Let me gooooo!”
You wake up in a cold sweat. Lying on your lap is a book entitled The Master Algorithm. Shaking off the nightmare, you resume reading where you had left off.
The theorem that runs the world
The path to optimal learning begins with a formula that many people have heard of: Bayes’ theorem. But here we’ll see it in a whole new light and realize that it’s vastly more powerful than you’d guess from its everyday uses. At heart, Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down. For example, if you test positive for AIDS, your probability of having it goes up. Things get more interesting when you have many pieces of evidence, such as the results of multiple tests. To combine them all without suffering a combinatorial explosion, we need to make simplifying assumptions. Things get even more interesting when we consider many hypotheses at once, such as all the different possible diagnoses for a patient. Computing the probability of each disease from the patient’s symptoms in a reasonable amount of time can take a lot of smarts. Once we know how to do all these things, we’ll be ready to learn the Bayesian way. For Bayesians, learning is “just” another application of Bayes’ theorem, with whole models as the hypotheses and the data as the evidence: as you see more data, some models become more likely and some less, until ideally one model stands out as the clear winner. Bayesians have invented fiendishly clever kinds of models. So let’s get started.
Thomas Bayes was an eighteenth-century English clergyman who, without realizing it, became the center of a new religion. You may well ask how that could happen, until you notice that it happened to Jesus, too: Christianity as we know it was invented by Saint Paul, while Jesus saw himself as the pinnacle of the Jewish faith. Similarly, Bayesianism as we know it was invented by Pierre-Simon de Laplace, a Frenchman who was born five decades after Bayes. Bayes was the preacher who first described a new way to think about chance, but it was Laplace who codified those insights into the theorem that bears Bayes’s name.