Читать онлайн "The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World" - Domingos Pedro - RuLit

In retrospect, we can see that Naïve Bayes, Markov chains, and HMMs are all special cases of Bayesian networks. The structure of Naïve Bayes is:

Markov chains encode the assumption that the future is conditionally independent of the past given the present. HMMs assume in addition that each observation depends only on the corresponding state. Bayesian networks are for Bayesians what logic is for symbolists: a lingua franca that allows us to elegantly encode a dizzying variety of situations and devise algorithms that work uniformly in all of them.

We can think of a Bayesian network as a “generative model,” a recipe for probabilistically generating a state of the world: first decide independently whether there’s a burglary and/or an earthquake, then based on that decide whether the alarm goes off, and then based on that whether Bob and Claire call. A Bayesian network tells a story: A happened, and it led to B; at the same time, C also happened, and B and C together caused D. To compute the probability of a particular story, we just multiply the probabilities of all of its different strands.

One of the most exciting applications of Bayesian networks is modeling how genes regulate each other in living cells. Billions of dollars have been spent trying to discover pairwise correlations between individual genes and specific diseases, but the yield has been disappointingly low. In retrospect, this is not so surprising: a cell’s behavior is the result of complex interactions among genes and the environment, and a single gene has limited predictive power. But with Bayesian networks, we can uncover these interactions, provided we have the requisite data, and with the spread of DNA microarrays, we increasingly do.

After pioneering the application of machine learning to spam filtering, David Heckerman turned to using Bayesian networks in the fight against AIDS. The AIDS virus is a tough adversary because it mutates rapidly, making it difficult for any one vaccine or drug to pin it down for long. Heckerman noticed that this is the same cat-and-mouse game that spam filters play with spam and decided to apply a lesson he had learned there: attack the weakest link. In the case of spam, weak links include the URLs you have to use to take payment from the customer. In the case of HIV, they’re small regions of the virus protein that can’t change without hurting the virus. If he could train the immune system to recognize these regions and attack the cells displaying them, he just might have an AIDS vaccine. Heckerman and coworkers used a Bayesian network to help identify the vulnerable regions and developed a vaccine delivery mechanism that could teach the immune system to attack just those regions. The delivery mechanism worked in mice, and clinical trials are now in preparation.

It often happens that, even after we take all conditional independences into account, some nodes in a Bayesian network still have too many parents. Some networks are so dense with arrows that when we print them, the page turns solid black. (The physicist Mark Newman calls them “ridiculograms.”) A doctor needs to simultaneously diagnose all the possible diseases a patient could have, not just one, and every disease is a parent of many different symptoms. A fever could be caused by any number of conditions besides the flu, but it’s hopeless to try to predict its probability given every possible combination of conditions. All is not lost. Instead of a table specifying the node’s conditional probability for every state of its parents, we can learn a simpler distribution. The most popular choice is a probabilistic version of the logical OR operation: any cause alone can provoke a fever, but each cause has a certain probability of failing to do so, even if it’s usually sufficient. Heckerman and others have learned Bayesian networks that diagnose hundreds of infectious diseases in this way. Google uses a giant Bayesian network of this type in its AdSense system for automatically choosing ads to place on web pages. The network relates a million content variables to each other and to twelve million words and phrases via over three hundred million arrows, all learned from a hundred billion text snippets and search queries.

On a lighter note, Microsoft’s Xbox Live uses a Bayesian network to rate players and match players of similar skill. The outcome of a game is a probabilistic function of the opponents’ skill levels, and using Bayes’ theorem we can infer a player’s skill from the outcomes of his games.

The inference problem

There’s a big snag in all of this, unfortunately. Just because a Bayesian network lets us compactly represent a probability distribution doesn’t mean we can also reason efficiently with it. Suppose you want to compute P(Burglary | Bob called, Claire didn’t). By Bayes’ theorem, you know this is just P(Burglary) P(Bob called, Claire didn’t | Burglary) / P(Bob called, Claire didn’t), or equivalently, P(Burglary, Bob called, Claire didn’t) / P(Bob called, Claire didn’t). If you had the full table with the probabilities of all states, you could obtain both of these probabilities by adding up the corresponding lines in the table. For example, P(Bob called, Claire didn’t) is the sum of the probabilities of all the lines where Bob calls and Claire doesn’t. But the Bayesian network doesn’t give you the full table. You could always construct it from the individual tables, but that takes exponential time and space. What we really want is to compute P(Burglary | Bob called, Claire didn’t) without building the full table. That, in a nutshell, is the problem of inference in Bayesian networks.

In many cases we can do this and avoid the exponential blowup. Suppose you’re leading a platoon in single file through enemy territory in the dead of night, and you want to make sure that all your soldiers are still with you. You could stop and count them yourself, but that wastes too much time. A cleverer solution is to just ask the first soldier behind you: “How many soldiers are behind you?” Each soldier asks the next the same question, until the last one says “None.” The next-to-last soldier can now say “One,” and so on all the way back to the first soldier, with each soldier adding one to the number of soldiers behind him. Now you know how many soldiers are still with you, and you didn’t even have to stop.

Siri uses the same idea to compute the probability that you just said, “Call the police” from the sounds it picked up from the microphone. Think of “Call the police” as a platoon of words marching across the page in single file. Police wants to know its probability, but for that it needs to know the probability of the; and the in turn needs to know the probability of call. So call computes its probability and passes it on to the, which does the same and passes the result to police. Now police knows its probability, duly influenced by every word in the sentence, but we never had to construct the full table of eight possibilities (the first word is call or isn’t, the second is the or isn’t, and the third is police or isn’t). In reality, Siri considers all words that could appear in each position, not just whether the first word is call or not and so on, but the algorithm is the same. Perhaps Siri thinks, based on the sounds, that the first word was either call or tell, the second was the or her, and the third was police or please. Individually, perhaps the most likely words are call, the, and please. But that forms the nonsensical sentence “Call the please,” so taking the other words into account, Siri concludes that the sentence is really “Call the police.” It makes the call, and with luck the police get to your house in time to catch the burglar.