You did it! Or did you? Absorbing SVMs into neural networks and neural networks into graphical models: that worked. So did absorbing genetic programs into logic. But combining logic and graphical models? Something is amiss there. Belatedly, you see the problem: logic has a dimension that graphical models lack and vice versa. The sculptures in the five chambers matched because they were simple allegories, but the reality doesn’t. Graphical models don’t let us represent rules involving more than one object, like Friends of friends are friends; all their variables have to be properties of the same object. They also can’t represent arbitrary programs, which pass sets of variables from one subroutine to another. Logic can easily do both of these things, but on the other hand it can’t represent uncertainty, ambiguity, or degrees of similarity. And without a representation that can do all of these things, you don’t have a universal learner.
You rack your brains for a solution, but the more you try, the harder it gets. Perhaps unifying logic and probability is just beyond human ability. Exhausted, you fall asleep. A deep growl jolts you awake. The hydra-headed complexity monster pounces on you, jaws snapping, but you duck at the last moment. Slashing desperately at the monster with the sword of learning, the only one that can slay it, you finally succeed in cutting off all its heads. Before it can grow new ones, you run up the stairs.
After an arduous climb, you reach the top. A wedding is in progress. Praedicatus, First Lord of Logic, ruler of the symbolic realm and Protector of the Programs, says to Markovia, Princess of Probability, Empress of Networks: “Let us unite our realms. To my rules thou shalt add weights, begetting a new representation that will spread far across the land.” The princess says, “And we shall call our progeny Markov logic networks.”
Your head is spinning. You go outside to the balcony. The sun has risen over the city. You gaze out over the rooftops to the countryside beyond. Forests of servers stretch away in all directions, humming quietly, waiting for the Master Algorithm. Convoys move along the roads, carrying gold from the data mines. Far to the west, the land gives way to a sea of information, dotted with ships. You look up at the flag of the Master Algorithm. You can now clearly see the inscription inside the five-pointed star:
P = e w•n / Z
What could this mean, you wonder?
Markov logic networks
In 2003, I started thinking about the problem of how to unify logic and probability, together with my student Matt Richardson. At first we made little progress because we were trying to do it with Bayesian networks, and their rigid form-a strict order on variables, conditional distributions of children given parents-is incompatible with the flexibility of logic. But the day before Christmas Eve, I realized there was a much better way. If we switched to Markov networks, we could use any logical formula as a template for Markov network features, and that would unify logic and graphical models. Let’s see how.
Recall that a Markov network is defined by a weighted sum of features, much like a perceptron. Suppose we have a collection of photos of people. We pick a random one and compute features of it like The person has gray hair, The person is old, The person is a woman, and so on. In a perceptron, we pass the weighted sum of these features through a threshold to decide whether, say, the person is your grandmother or not. In a Markov network, we do something very different (at least at first sight): we exponentiate the weighted sum, turning it into a product of factors, and this product is the probability of choosing that particular picture from the collection, regardless of whether your grandmother is in it. If you have many pictures of old people, the weight of that feature goes up. If most of them are of men, the weight of The person is a woman goes down. The features can be anything we want, making Markov networks a remarkably flexible way to represent probability distributions.
Actually, I lied: the product of factors is not yet a probability because the probabilities of all pictures must add up to one, and there’s no guarantee that the products of factors for all pictures will do so. We need to normalize them, meaning divide each product by the sum of all of them. The sum of all the normalized products is then guaranteed to be one because it’s just a number divided by itself. The probability of a picture is thus the weighted sum of its features, exponentiated and normalized. If you look back at the equation in the five-pointed star, you’ll probably start to get an inkling of what it means. P is a probability, w is a vector of weights (notice it’s in boldface), n is a vector of numbers, and their dot product • is exponentiated and divided by Z, the sum of all products. If we let the first component of n be one if the first feature of the image is true and zero otherwise, and so on, w•n is just a shorthand for the weighted sum of features we’ve been talking about all along.
So the equation gives the probability of an image (or whatever) according to a Markov network. But it’s more general than that because it’s not just the equation of a Markov network; rather, it’s the equation of a Markov logic network, as we call it. In a Markov logic network, or MLN for short, the numbers in n don’t have to be just zero or one, and they don’t refer to features-they refer to logical formulas. At the end of Chapter 8, we saw how we can go beyond Markov networks to relational models, which are defined in terms of feature templates, not just features. Alice and Bob both have the flu is a feature specific to Alice and Bob. X and Y both have the flu is a feature template, which can be instantiated with Alice and Bob, Alice and Chris, and any other two people. A feature template is a powerful thing because it can summarize billions of features or more in a single short expression. But we need a formal language to define feature templates, and we have one readily available: logic.
An MLN is just a set of logical formulas and their weights. When applied to a particular set of entities, it defines a Markov network over their possible states. For example, if the entities are Alice and Bob, a possible state is that Alice and Bob are friends, Alice has the flu, and so does Bob. Let’s suppose the MLN has two formulas: Everyone has the flu and If someone has the flu, so do their friends. In standard logic, this would be a pretty useless pair of statements: the first would rule out any state with even a single healthy person, and the second would be redundant. But in an MLN, the first formula just means that there’s a feature X has the flu for every person X, with the same weight as the formula. If people are likely to have the flu, the formula will have a high weight, and so will the corresponding features. A state with many healthy people is less probable than one with few, but not impossible. And because of the second formula, a state where someone has the flu and their friends don’t is less probable than one where healthy and infected people fall into separate clusters of friends.
At this point you can probably guess what the n in the master equation is: its first component is the number of true instances of the first formula in the state, the second is the number of true instances of the second formula, and so on. If we’re looking at a group of ten friends and seven of them have the flu, the first component of n is seven, and so on. (Shouldn’t the probability be different if seven out of twenty instead of seven out of ten friends have the flu? Yes, and it is, because of Z.) In the limit, if we let all the weights go to infinity, Markov logic reduces to standard logic because violating a single instance of a formula then causes the probability to collapse to zero, making the state impossible. On the probabilistic side, an MLN reduces to a Markov network when all the formulas talk about a single object. So Markov logic includes both logic and Markov networks as special cases, and it’s the unification we were looking for.