You’re not the only one in dire straits-so are we. We’ve only just set out on our road to the Master Algorithm and already we seem to have run into an insurmountable obstacle. Is there any way to learn something from the past that we can be confident will apply in the future? And if there isn’t, isn’t machine learning a hopeless enterprise? For that matter, isn’t all of science, even all of human knowledge, on rather shaky ground?
It’s not like big data would solve the problem. You could be super-Casanova and have dated millions of women thousands of times each, but your master database still wouldn’t answer the question of what this woman is going to say this time. Even if today is exactly like some previous occasion when she said yes-same day of week, same type of date, same weather, and same shows on TV-that still doesn’t mean that this time she will say yes. For all you know, her answer is determined by some factor that you didn’t think of or don’t have access to. Or maybe there’s no rhyme or reason to her answers: they’re random, and you’re just spinning your wheels trying to find a pattern in them.
Philosophers have debated Hume’s problem of induction ever since he posed it, but no one has come up with a satisfactory answer. Bertrand Russell liked to illustrate the problem with the story of the inductivist turkey. On his first morning at the farm, the turkey was fed at 9:00 a.m., but being a good inductivist, he didn’t jump to conclusions. He first collected many observations on many different days under many different circumstances. Having been fed consistently at 9:00 a.m. for many consecutive days, he finally concluded that yes, he would always be fed at 9:00 a.m. Then came the morning of Christmas eve, and his throat was cut.
It would be nice if Hume’s problem was just a cute philosophical conundrum we could ignore, but we can’t. For example, Google’s business is based on guessing which web pages you’re looking for when you type some keywords into the search box. Their key asset is massive logs of search queries people have entered in the past and the links they clicked on in the corresponding results pages. But what do you do if someone types in a combination of keywords that’s not in the log? And even if it is, how can you be confident that the current user wants the same pages as the previous ones?
How about we just assume that the future will be like the past? This is certainly a risky assumption. (It didn’t work for the inductivist turkey.) On the other hand, without it all knowledge is impossible, and so is life. We’d rather stay alive, even if precariously. Unfortunately, even with that assumption we’re not out of the woods. It takes care of the “trivial” cases: If I’m a doctor and patient B has exactly the same symptoms as patient A, I assume that the diagnosis is the same. But if patient B’s symptoms don’t exactly match anyone else’s, I’m still in the dark. This is the machine-learning problem: generalizing to cases that we haven’t seen before.
But perhaps that’s not such a big deal? With enough data, won’t most cases be in the “trivial” category? No. We saw in the previous chapter why memorization won’t work as a universal learner, but now we can make it more quantitative. Suppose you have a database with a trillion records, each with a thousand Boolean fields (i.e., each field is the answer to a yes/no question). That’s pretty big. What fraction of the possible cases have you seen? (Take a guess before you read on.) Well, the number of possible answers is two for each question, so for two questions it’s two times two (yes-yes, yes-no, no-yes, and no-no), for three questions it’s two cubed (2 × 2 × 2 = 23), and for a thousand questions it’s two raised to the power of a thousand (21000). The trillion records in your database are one-gazillionth of 1 percent of 21000, where “gazillionth” means “zero point 286 zeros followed by 1.” Bottom line: no matter how much data you have-tera- or peta- or exa- or zetta- or yottabytes-you’ve basically seen nothing. The chances that the new case you need to make a decision on is already in the database are so vanishingly small that, without generalization, you won’t even get off the ground.
If this all sounds a bit abstract, suppose you’re a major e-mail provider, and you need to label each incoming e-mail as spam or not spam. You may have a database of a trillion past e-mails, each already labeled as spam or not, but that won’t save you, since the chances that every new e-mail will be an exact copy of a previous one are just about zero. You have no choice but to try to figure out at a more general level what distinguishes spam from nonspam. And, according to Hume, there’s no way to do that.
The “no free lunch” theorem
Two hundred and fifty years after Hume set off his bombshell, it was given elegant mathematical form by David Wolpert, a physicist turned machine learner. His result, known as the “no free lunch” theorem, sets a limit on how good a learner can be. The limit is pretty low: no learner can be better than random guessing! OK, we can go home: the Master Algorithm is just flipping coins. Seriously, though, how is it that no learner can beat coin flipping? And if that’s so, how come the world is full of highly successful learners, from spam filters to (any day now) self-driving cars?
The “no free lunch” theorem is a lot like the reason Pascal’s wager fails. In his Pensées, published in 1669, Pascal said we should believe in the Christian God because if he exists that gains us eternal life, and if he doesn’t we lose very little. This was a remarkably sophisticated argument for the time, but as Diderot pointed out, an imam could make the same argument for believing in Allah. And if you pick the wrong god, the price you pay is eternal hell. On balance, considering the wide variety of possible gods, you’re no better off picking a particular one to believe in than you are picking any other. For every god that says “do this,” there’s another that says “no, do that.” You may as well just forget about god and enjoy life without religious constraints.
Replace “god” with “learning algorithm” and “eternal life” with “accurate prediction,” and you have the “no free lunch” theorem. Pick your favorite learner. (We’ll see many in this book.) For every world where it does better than random guessing, I, the devil’s advocate, will deviously construct one where it does worse by the same amount. All I have to do is flip the labels of all unseen instances. Since the labels of the observed ones agree, there’s no way your learner can distinguish between the world and the antiworld. On average over the two, it’s as good as random guessing. And therefore, on average over all possible worlds, pairing each world with its antiworld, your learner is equivalent to flipping coins.
Don’t give up on machine learning or the Master Algorithm just yet, though. We don’t care about all possible worlds, only the one we live in. If we know something about the world and incorporate it into our learner, it now has an advantage over random guessing. To this Hume would reply that that knowledge must itself have come from induction and is therefore fallible. That’s true, even if the knowledge was encoded into our brains by evolution, but it’s a risk we’ll have to take. We can also ask whether there’s a nugget of knowledge so incontestable, so fundamental, that we can build all induction on top of it. (Something like Descartes’ “I think, therefore I am,” although it’s hard to see how to turn that one into a learning algorithm.) I think the answer is yes, and we’ll see what that nugget is in Chapter 9.