In machine learning, as elsewhere in computer science, there’s nothing better than getting such a combinatorial explosion to work for you instead of against you. What’s clever about genetic algorithms is that each string implicitly contains an exponential number of building blocks, known as schemas, and so the search is a lot more efficient than it seems. This is because every subset of the string’s bits is a schema, representing some potentially fit combination of properties, and a string has an exponential number of subsets. We can represent a schema by replacing the bits in the string that aren’t part of it with *. For example, the string 110 contains the schemas ***, **0, *1*, 1**, *10, 11*, 1*0, and 110. We get a different schema for every different choice of bits to include; since we have two choices for each bit (include/don’t include), we have 2 n schemas. Conversely, a particular schema may be represented in many different strings in a population, and is implicitly evaluated every time they are. Suppose that a hypothesis’s probability of surviving into the next generation is proportional to its fitness. Holland showed that, in this case, the fitter a schema’s representatives in one generation are compared to the average, the more of them we can expect to see in the next generation. So, while the genetic algorithm explicitly manipulates strings, it implicitly searches the much larger space of schemas. Over time, fitter schemas come to dominate the population, and so unlike the drunkard, the genetic algorithm finds its way home.
One of the most important problems in machine learning-and life-is the exploration-exploitation dilemma. If you’ve found something that works, should you just keep doing it? Or is it better to try new things, knowing it could be a waste of time but also might lead to a better solution? Would you rather be a cowboy or a farmer? Start a company or run an existing one? Go steady or play the field? A midlife crisis is the yearning to explore after many years spent exploiting. On an impulse, you fly to Vegas, ready to gamble away your life’s savings on the chance of becoming a millionaire. You enter the first casino and face a row of slot machines. The one to play is the one that gives you the best payoff on average, but you don’t know which that is. You have to try each one enough times to figure it out. But if you do this for too long, you waste your money on losing machines. Conversely, if you jump the gun and pick a machine that looked good by chance on the first few turns but is in fact not the best one, you waste your money playing it for the rest of the night. That’s the exploration-exploitation dilemma. Each time you play, you have to choose between repeating the best move you’ve found so far, which gives you the best payoff, or trying other moves, which gather information that may lead to even better payoffs. With two slot machines, Holland showed that the optimal strategy is to flip a biased coin each time, where the coin becomes exponentially more biased as you go along. (Don’t sue me if it doesn’t work for you, though. Remember the house always wins in the end.) The better a slot machine looks, the more you should play it, but never completely give up on the other one, in case it turns out to be the best one after all.
A genetic algorithm is like the ringleader of a group of gamblers, playing slot machines in every casino in town at the same time. Two schemas compete with each other if they include the same bits and differ in at least one of them, like *10 and *11, and n competing schemas are like n slot machines. Every set of competing schemas is a casino, and the genetic algorithm simultaneously figures out the winning machine in every casino, following the optimal strategy of playing the better-seeming machines with exponentially increasing frequency. Pretty smart.
In The Hitchhiker’s Guide to the Galaxy, an alien race builds a massive supercomputer to answer the ultimate question, and after a long time the computer spits out “42.” But the computer also points out that the aliens don’t know what the question is, so they build an even bigger computer to figure that out. This computer-otherwise known as planet Earth-is unfortunately destroyed to make way for a space freeway minutes before finishing its multimillion-year computation. We can only guess at the question now, but perhaps it was: Which slot machine should you play?
Survival of the fittest programs
For the first few decades, the genetic algorithms community consisted mainly of John Holland, his students, and their students. Circa 1983, the biggest problem genetic algorithms had been able to solve was learning to control gas pipeline systems. But then, at around the same time neural networks were making their comeback, interest in evolutionary computation took off. The first international conference on genetic algorithms was held in Pittsburgh in 1985, and a Cambrian explosion of genetic algorithm variants was under way. Some of these tried to model evolution more closely-the basic genetic algorithm was only a very crude approximation, after all-and others radiated in very different directions, crossing over evolutionary ideas with computer science concepts that would have bemused Darwin.
One of Holland’s more remarkable students was John Koza. In 1987, while flying back to California from a conference in Italy, he had a lightbulb moment. Instead of evolving comparatively simple things like If… then… rules and gas pipeline controllers, why not evolve full-blown computer programs? And if that’s the goal, why stick with bit strings as the representation? A program is really a tree of subroutine calls, so better to directly cross over those subtrees than to shoehorn them into bit strings and run the risk of destroying perfectly good subroutines when you cross them over at a random point.
For example, suppose you want to evolve a program to compute the duration of a planet’s year, T, from its average distance to the sun, D. According to Kepler’s third law, T is the square root of D cubed, times a constant C that depends on the units you use for time and distance. A genetic algorithm should be able to discover this by looking at Tycho Brahe’s data on planetary motions like Kepler did. In Koza’s approach, D and C are the leaves of a program tree, and the operations that combine them, like multiplication and taking the square root, are the internal nodes. The following program tree correctly computes T:
In genetic programming, as Koza called his method, we cross over two program trees by randomly swapping two of their subtrees. For example, crossing over these two trees at the highlighted nodes yields the correct program for computing T as one of the children:
We can measure a program’s fitness (or lack thereof) by the distance between its output and the correct one on the training data. For example, if the program says an Earth year is three hundred days, that would subtract sixty-five points from its fitness. Starting with a population of random program trees, genetic programming uses crossover, mutation, and survival to gradually evolve better programs until it’s satisfied.