In the meantime, the practical consequence of the “no free lunch” theorem is that there’s no such thing as learning without knowledge. Data alone is not enough. Starting from scratch will only get you to scratch. Machine learning is a kind of knowledge pump: we can use it to extract a lot of knowledge from data, but first we have to prime the pump.
Machine learning is what mathematicians call an ill-posed problem: it doesn’t have a unique solution. Here’s a simple ill-posed problem: Which two numbers add up to 1,000? Assuming the numbers are positive, there are five hundred possible answers: 1 and 999, 2 and 998, and so on. The only way to solve an ill-posed problem is to introduce additional assumptions. If I tell you the second number is triple the first, bingo: the answer is 250 and 750.
Tom Mitchell, a leading symbolist, calls it “the futility of bias-free learning.” In ordinary life, bias is a pejorative word: preconceived notions are bad. But in machine learning, preconceived notions are indispensable; you can’t learn without them. In fact, preconceived notions are also indispensable to human cognition, but they’re hardwired into the brain, and we take them for granted. It’s biases over and beyond those that are questionable.
Aristotle said that there is nothing in the intellect that was not first in the senses. Leibniz added, “Except the intellect itself.” The human brain is not a blank slate because it’s not a slate. A slate is passive, something you write on, but the brain actively processes the information it receives. Memory is the slate it writes on, and it does start out blank. On the other hand, a computer is a blank slate until you program it; the active process itself has to be written into memory before anything can happen. Our goal is to figure out the simplest program we can write such that it will continue to write itself by reading data, without limit, until it knows everything there is to know.
Machine learning has an unavoidable element of gambling. In the first Dirty Harry movie, Clint Eastwood chases a bank robber, repeatedly firing at him. Finally, the robber is lying next to a loaded gun, unsure whether to spring for it. Did Harry fire six shots or only five? Harry sympathizes (so to speak): “You’ve got to ask yourself one question: ‘Do I feel lucky?’ Well, do you, punk?” That’s the question machine learners have to ask themselves every day when they go to work: Do I feel lucky today? Just like evolution, machine learning doesn’t get it right every time; in fact, errors are the rule, not the exception. But it’s OK, because we discard the misses and build on the hits, and the cumulative result is what matters. Once we acquire a new piece of knowledge, it becomes a basis for inducing yet more knowledge. The only question is where to begin.
Priming the knowledge pump
In the Principia, along with his three laws of motion, Newton enunciates four rules of induction. Although these are much less well known than the physical laws, they are arguably as important. The key rule is the third one, which we can paraphrase thus:
Newton’s Principle: Whatever is true of everything we’ve seen is true of everything in the universe.
It’s not an exaggeration to say that this innocuous-sounding statement is at the heart of the Newtonian revolution and of modern science. Kepler’s laws applied to exactly six entities: the planets of the solar system known in his time. Newton’s laws apply to every last speck of matter in the universe. The leap in generality between the two is staggering, and it’s a direct consequence of Newton’s principle. This one principle is all by itself a knowledge pump of phenomenal power. Without it there would be no laws of nature, only a forever incomplete patchwork of small regularities.
Newton’s principle is the first unwritten rule of machine learning. We induce the most widely applicable rules we can and reduce their scope only when the data forces us to. At first sight this may seem ridiculously overconfident, but it’s been working for science for over three hundred years. It’s certainly possible to imagine a universe so varied and capricious that Newton’s principle would systematically fail, but that’s not our universe.
Newton’s principle is only the first step, however. We still need to figure out what is true of everything we’ve seen-how to extract the regularities from the raw data. The standard solution is to assume we know the form of the truth, and the learner’s job is to flesh it out. For example, in the dating problem you could assume that your friend’s answer is determined by a single factor, in which case learning just consists of checking each known factor (day of week, type of date, weather, and TV programming) to see if it correctly predicts her answer every time. The problem, of course, is that none of them do! You gambled and failed. So you relax your assumptions a bit. What if your friend’s answer is determined by a conjunction of two factors? With four factors, each with two possible values, there are twenty-four possibilities to check (six pairs of factors to pick from times two choices for each factor’s value). Now we have an embarrassment of riches: four conjunctions of two factors correctly predict the outcome! What to do? If you’re feeling lucky, you can just pick one of them and hope for the best. A more sensible option, though, is democracy: let them vote, and pick the winning prediction.
If all conjunctions of two factors fail, you can try all conjunctions of any number of factors. Machine learners and psychologists call these “conjunctive concepts.” Dictionary definitions are conjunctive concepts: a chair has a seat and a back and some number of legs. Remove any of these and it’s no longer a chair. A conjunctive concept is what Tolstoy had in mind when he wrote the opening sentence of Anna Karenina: “All happy families are alike; each unhappy family is unhappy in its own way.” The same is true of individuals. To be happy, you need health, love, friends, money, a job you like, and so on. Take any of these away, and misery ensues.
In machine learning, examples of a concept are called positive examples, and counterexamples are called negative examples. If you’re trying to learn to recognize cats in images, images of cats are positive examples and images of dogs are negative ones. If you compiled a database of families from the world’s literature, the Karenins would be a negative example of a happy family, and there would be precious few positive examples.
Starting with restrictive assumptions and gradually relaxing them if they fail to explain the data is typical of machine learning, and the process is usually carried out automatically by the learner, without any help from you. First, it tries all single factors, then all conjunctions of two factors, then all conjunctions of three, and so on. But now we run into a problem: there are a lot of conjunctive concepts and not enough time to try them all out.
The dating example is a little deceptive because it’s very small (four variables and four examples). But suppose now that you run an online dating service and you need to figure out which couples to match. If each user of your system has filled out a questionnaire with answers to fifty yes/no questions, each potential match is characterized by one hundred attributes, fifty from each member of the prospective couple. Based on the couples that have gone on a date and reported the outcome, can you find a conjunctive definition for the concept of a “good match”? There are 3100 possible definitions to try. (The three options for each attribute are yes, no, and not part of the concept.) Even with the fastest computer in the world, the couples will all be long gone-and your company bankrupt-by the time you’re done, unless you’re lucky and a very short definition hits the jackpot. So many rules, so little time. We need to do something smarter.