Since our goal is to produce the best spam filter we can, as opposed to faithfully simulating real natural selection, we can cheat liberally by modifying the algorithm to fit our needs. One way in which genetic algorithms routinely cheat is by allowing immortality. (Too bad we can’t do that in real life.) That way, a highly fit individual doesn’t simply compete to reproduce within its own generation, but also with its children, and then its grandchildren, great-grandchildren, and so on, as long as it remains one of the fittest individuals in the population. In contrast, in the real world the best a highly fit individual can do is pass on half its genes to many children, each of which will probably be less fit because of the genes it inherited from its other parent. Immortality avoids this backsliding and with any luck, lets the algorithm reach the desired fitness sooner. Of course, since the fittest humans in history as measured by number of descendants are the likes of Genghis Khan-ancestor to one in two hundred men alive today-perhaps it’s not so bad that in real life immortality is verboten.
If we want to evolve a whole set of spam-filtering rules, not just one, we can represent a candidate set of n rules by a string of n × 20,000 bits (20,000 for each rule, assuming ten thousand different words in the data, as before). Rules containing 00 for some word effectively disappear from the rule set, since they don’t match any e-mails, as we saw before. If an e-mail matches any rule in the set, it’s classified as spam; otherwise it’s legit. We can still let fitness be the percentage of correctly classified e-mails, but to combat overfitting, we’ll probably want to subtract from it a penalty proportional to the total number of active conditions in the rule set.
We can get even fancier by allowing rules for intermediate concepts to evolve, and then chaining these rules at performance time. For example, we could evolve the rules If the e-mail contains the word loan then it’s a scam and If the e-mail is a scam then it’s spam. Since a rule’s consequent is no longer always spam, this requires introducing additional bits in rule strings to represent their consequents. Of course, the computer doesn’t literally use the word scam; it just comes up with some arbitrary bit string to represent the concept, but that’s good enough for our purposes. Sets of rules like this, which Holland called classifier systems, are one of the workhorses of the machine-learning tribe he founded: the evolutionaries. Like multilayer perceptrons, classifier systems face the credit-assignment problem-what is the fitness of rules for intermediate concepts?-and Holland devised the so-called bucket brigade algorithm to solve it. Nevertheless, classifier systems are much less widely used than multilayer perceptrons.
Compared to the simple model in Fisher’s book, genetic algorithms are quite a leap forward. Darwin lamented his lack of mathematical ability, but if he had lived a century later he probably would have yearned for programming prowess instead. Indeed, capturing natural selection by a set of equations is extremely difficult, but expressing it as an algorithm is another matter, and can shed light on many otherwise vexing questions. Why do species appear suddenly in the fossil record? Where’s the evidence that they evolved gradually from earlier species? In 1972, Niles Eldredge and Stephen Jay Gould proposed that evolution consists of a series of “punctuated equilibria,” alternating long periods of stasis with short bursts of rapid change, like the Cambrian explosion. This sparked a heated debate, with critics of the theory nicknaming it “evolution by jerks” and Eldredge and Gould retorting that gradualism is “evolution by creeps.” Experience with genetic algorithms lends support to the jerks. If you run a genetic algorithm for one hundred thousand generations and observe the population at one-thousand-generation intervals, the graph of fitness against time will probably look like an uneven staircase, with sudden improvements followed by flat periods that tend to become longer over time. It’s also not hard to see why. Once the algorithm reaches a local maximum of fitness-a peak in the fitness landscape-it will stay there for a long time until a lucky mutation or crossover lands an individual on the slope to a higher peak, at which point that individual will multiply and climb up the slope with each passing generation. And the higher the current peak, the longer before that happens. Of course, natural evolution is more complicated than this: for one, the environment may change, either physically or because other organisms have themselves evolved, and an organism that was on a fitness peak may suddenly find itself under pressure to evolve again. So, while helpful, current genetic algorithms are far from the end of the story.
The exploration-exploitation dilemma
Notice how much genetic algorithms differ from multilayer perceptrons. Backprop entertains a single hypothesis at any given time, and the hypothesis changes gradually until it settles into a local optimum. Genetic algorithms consider an entire population of hypotheses at each step, and these can make big jumps from one generation to the next, thanks to crossover. Backprop proceeds deterministically after setting the initial weights to small random values. Genetic algorithms, in contrast, are full of random choices: which hypotheses to keep alive and cross over (with fitter hypotheses being more likely candidates), where to cross two strings, which bits to mutate. Backprop learns weights for a predefined network architecture; denser networks are more flexible but also harder to learn. Genetic algorithms make no a priori assumptions about the structures they will learn, other than their general form.
Because of all this, genetic algorithms are much less likely than backprop to get stuck in a local optimum and in principle better able to come up with something truly new. But they are also much more difficult to analyze. How do we know a genetic algorithm will get somewhere meaningful instead of randomly walking around like the proverbial drunkard? The key is to think in terms of building blocks. Every subset of a string’s bits potentially encodes a useful building block, and when we cross over two strings, those building blocks come together into a larger one, which in turn becomes grist for the mill. Holland likes to use police sketches to illustrate the power of building blocks. In the days before computers, a police artist could quickly put together a portrait of a suspect from eyewitness interviews by selecting a mouth from a set of paper strips depicting typical mouth shapes and doing the same for the eyes, nose, chin, and so on. With only ten building blocks and ten options for each, this system would allow for ten billion different faces, more than there are people on Earth.