Learning an MLN means discovering formulas that are true in the world more often than random chance would predict, and figuring out the weights for those formulas that cause their predicted probabilities to match their observed frequencies. Once we’ve learned an MLN, we can use it to answer questions like “What is the probability that Bob has the flu, given that he’s friends with Alice and she has the flu?” And guess what? It turns out that the probability is given by an S curve applied to the weighted sum of features, much as in a multilayer perceptron. And an MLN with long chains of rules can represent a deep neural network, with one layer per link in the chain.
Of course, don’t be deceived by the simple MLN above for predicting the spread of flu. Picture instead an MLN for diagnosing and curing cancer. The MLN represents a probability distribution over the states of a cell. Every part of the cell, every organelle, every metabolic pathway, every gene and protein is an entity in the MLN, and the MLN’s formulas encode the dependencies between them. We can ask the MLN, “Is this cell cancerous?” and probe it with different drugs and see what happens. We don’t have an MLN like this yet, but later in this chapter I’ll envisage how it might come about.
To recap: the unified learner we’ve arrived at uses MLNs as the representation, posterior probability as the evaluation function, and genetic search coupled with gradient descent as the optimizer. If we want, we can easily replace the posterior by some other accuracy measure, or genetic search by hill climbing. We’ve ascended a high peak, and now we can enjoy the view. I wouldn’t be so rash as to call this learner the Master Algorithm, however. For one, the proof of the pudding is in the eating, and although over the last decade this algorithm (or variations of it) has been successfully applied in many areas, there are many more to which it hasn’t, and so it’s not yet clear just how general purpose it is. Second, there are some important problems that it doesn’t solve. But before we look at them, let’s look at what it can do.
From Hume to your housebot
You can download the learner I’ve just described from alchemy.cs.washington.edu. We christened it Alchemy to remind ourselves that, despite all its successes, machine learning is still in the alchemy stage of science. If you do download it, you’ll see that it includes a lot more than the basic algorithm I’ve described but also that it is still missing a few things I said the universal learner ought to have, like crossover. Nevertheless, let’s use the name Alchemy to refer to our candidate universal learner for simplicity.
Alchemy addresses Hume’s original question by having another input besides the data: your initial knowledge, in the form of a set of logical formulas, with or without weights. The formulas can be inconsistent, incomplete, or even just plain wrong; the learning and probabilistic reasoning will take care of that. The key point is that Alchemy doesn’t have to learn from scratch. In fact, we can even tell Alchemy to keep the formulas unchanged and learn only the weights. In this case, giving Alchemy the appropriate formulas can turn it into a Boltzmann machine, a Bayesian network, an instance-based learner, and many other models. This explains why we can have a universal learner despite the “no free lunch” theorem. Rather, Alchemy is like an inductive Turing machine, which we can program to behave as a very powerful or a very restricted learner; it’s up to us. Alchemy provides a unifier for machine learning in the same way that the Internet provides one for computer networks, the relational model for databases, or the graphical user interface for everyday applications.
Of course, even if you use Alchemy with no initial formulas (and you can), that doesn’t make it knowledge-free. The choice of formal language, score function, and optimizer implicitly encodes assumptions about the world. So it’s natural to ask whether we can have an even more general learner than Alchemy. What did evolution assume when it began its long journey from the first bacteria to all the life-forms around today? I think there’s a simple assumption from which all else follows: the learner is part of the world. This means that the learner as a physical system obeys the same laws as its environment, whatever they are, and therefore already “knows” them implicitly and is primed to discover them. In the next section, we’ll see what this can mean concretely and how to embody it in Alchemy. But for the moment, let’s note that it’s perhaps the best answer we can ever give to Hume’s question. On the one hand, assuming the learner is part of the world is an assumption-in principle, the learner could obey different laws from those the world obeys-so it satisfies Hume’s dictum that learning is only possible with prior knowledge. On the other hand, it’s an assumption so basic and hard to disagree with that perhaps it’s all we need for this world.
At the other extreme, knowledge engineers-the most determined critics of machine learning-have good reason to like Alchemy. Instead of a basic model structure or a few rough guesses, Alchemy can input a large, lovingly assembled knowledge base, if it’s available. Because probabilistic rules can interact in much richer ways than deterministic ones, manually encoded knowledge goes a longer way in Markov logic. And since knowledge bases in Markov logic don’t have to be self-consistent, they can be very large and accommodate many different contributors without falling apart-a goal that has so far eluded knowledge engineers.
Most of all, though, Alchemy addresses the problems that each of the five tribes of machine learning has worked on for so long. Let’s look at each of them in turn.
Symbolists combine different pieces of knowledge on the fly, in the same way that mathematicians combine axioms to prove theorems. This contrasts sharply with neural networks and other models with a fixed structure. Alchemy does it using logic, as symbolists do, but with a twist. To prove a theorem in logic, you need to find only one sequence of axiom applications that produces it. Because Alchemy reasons probabilistically, it does more: it finds multiple sequences of formulas that lead to the theorem or its negation and weighs them to compute the theorem’s probability of being true. This way it can reason not just about mathematical universals, but about whether “the president” in a news story means “Barack Obama,” or what folder an e-mail should be filed in. The symbolists’ master algorithm, inverse deduction, postulates new logical rules needed to serve as steps between the data and a desired conclusion. Alchemy introduces new rules by hill climbing, starting with the initial rules and constructing rules that, combined with the initial ones and the data, make the conclusions more likely.
Connectionists’ models are inspired by the brain, with networks of S curves that correspond to neurons and weighted connections between them corresponding to synapses. In Alchemy, two variables are connected if they appear together in some formula, and the probability of a variable given its neighbors is an S curve. (Although I won’t show why, it’s a direct consequence of the master equation we saw in the previous section.) The connectionists’ master algorithm is backpropagation, which they use to figure out which neurons are responsible for which errors and adjust their weights accordingly. Backpropagation is a form of gradient descent, which Alchemy uses to optimize the weights of a Markov logic network.