Выбрать главу

Also keep in mind that the hierarchy shown above is a hierarchy of concepts. These recognizers are not physically placed above each other; because of the thin construction of the neocortex, it is physically only one pattern recognizer high. The conceptual hierarchy is created by the connections between the individual pattern recognizers.

An important attribute of the PRTM is how the recognitions are made inside each pattern recognition module. Stored in the module is a weight for each input dendrite indicating how important that input is to the recognition. The pattern recognizer has a threshold for firing (which indicates that this pattern recognizer has successfully recognized the pattern it is responsible for). Not every input pattern has to be present for a recognizer to fire. The recognizer may still fire if an input with a low weight is missing, but it is less likely to fire if a high-importance input is missing. When it fires, a pattern recognizer is basically saying, “The pattern I am responsible for is probably present.”

Successful recognition by a module of its pattern goes beyond just counting the input signals that are activated (even a count weighted by the importance parameter). The size (of each input) matters. There is another parameter (for each input) indicating the expected size of the input, and yet another indicating how variable that size is. To appreciate how this works, suppose we have a pattern recognizer that is responsible for recognizing the spoken word “steep.” This spoken word has four sounds: [s], [t], [E], and [p]. The [t] phoneme is what is known as a “dental consonant,” meaning that it is created by the tongue creating a burst of noise when air breaks its contact with the upper teeth. It is essentially impossible to articulate the [t] phoneme slowly. The [p] phoneme is considered a “plosive consonant” or “oral occlusive,” meaning that it is created when the vocal tract is suddenly blocked (by the lips in the case of [p]) so that air no longer passes. It is also necessarily quick. The [E] vowel is caused by resonances of the vocal cord and open mouth. It is considered a “long vowel,” meaning that it persists for a much longer period of time than consonants such as [t] and [p]; however, its duration can be quite variable. The [s] phoneme is known as a “sibilant consonant,” and is caused by the passage of air against the edges of the teeth, which are held close together. Its duration is typically shorter than that of a long vowel such as [E], but it is also variable (in other words, the [s] can be said quickly or you can drag it out).

In our work in speech recognition, we found that it is necessary to encode this type of information in order to recognize speech patterns. For example, the words “step” and “steep” are very similar. Although the [e] phoneme in “step” and the [E] in “steep” are somewhat different vowel sounds (in that they have different resonant frequencies), it is not reliable to distinguish these two words based on these often confusable vowel sounds. It is much more reliable to consider the observation that the [e] in “step” is relatively brief compared with the [E] in “steep.”

We can encode this type of information with two numbers for each input: the expected size and the degree of variability of that size. In our “steep” example, [t] and [p] would both have a very short expected duration as well as a small expected variability (that is, we do not expect to hear long t’s and p’s). The [s] sound would have a short expected duration but a larger variability because it is possible to drag it out. The [E] sound has a long expected duration as well as a high degree of variability.

In our speech examples, the “size” parameter refers to duration, but time is only one possible dimension. In our work in character recognition, we found that comparable spatial information was important in order to recognize printed letters (for example the dot over the letter “i” is expected to be much smaller than the portion under the dot). At much higher levels of abstraction, the neocortex will deal with patterns with all sorts of continuums, such as levels of attractiveness, irony, happiness, frustration, and myriad others. We can draw similarities across rather diverse continuums, as Darwin did when he related the physical size of geological canyons to the amount of differentiation among species.

In a biological brain, the source of these parameters comes from the brain’s own experience. We are not born with an innate knowledge of phonemes; indeed different languages have very different sets of them. This implies that multiple examples of a pattern are encoded in the learned parameters of each pattern recognizer (as it requires multiple instances of a pattern to ascertain the expected distribution of magnitudes of the inputs to the pattern). In some AI systems, these types of parameters are hand-coded by experts (for example, linguists who can tell us the expected durations of different phonemes, as I articulated above). In my own work, we found that having an AI system discover these parameters on its own from training data (similar to the way the brain does it) was a superior approach. Sometimes we used a hybrid approach; that is, we primed the system with the intuition of human experts (for the initial settings of the parameters) and then had the AI system automatically refine these estimates using a learning process from real examples of speech.

What the pattern recognition module is doing is computing the probability (that is, the likelihood based on all of its previous experience) that the pattern that it is responsible for recognizing is in fact currently represented by its active inputs. Each particular input to the module is active if the corresponding lower-level pattern recognizer is firing (meaning that that lower-level pattern was recognized). Each input also encodes the observed size (on some appropriate dimension such as temporal duration or physical magnitude or some other continuum) so that the size can be compared (with the stored size parameters for each input) by the module in computing the overall probability of the pattern.

How does the brain (and how can an AI system) compute the overall probability that the pattern (that the module is responsible for recognizing) is present given (1) the inputs (each with an observed size), (2) the stored parameters on size (the expected size and the variability of size) for each input, and (3) the parameters of the importance of each input? In the 1980s and 1990s, I and others pioneered a mathematical method called hierarchical hidden Markov models for learning these parameters and then using them to recognize hierarchical patterns. We used this technique in the recognition of human speech as well as the understanding of natural language. I describe this approach further in chapter 7.

Getting back to the flow of recognition from one level of pattern recognizers to the next, in the above example we see the information flow up the conceptual hierarchy from basic letter features to letters to words. Recognitions will continue to flow up from there to phrases and then more complex language structures. If we go up several dozen more levels, we get to higher-level concepts like irony and envy. Even though every pattern recognizer is working simultaneously, it does take time for recognitions to move upward in this conceptual hierarchy. Traversing each level takes between a few hundredths to a few tenths of a second to process. Experiments have shown that a moderately high-level pattern such as a face takes at least a tenth of a second. It can take as long as an entire second if there are significant distortions. If the brain were sequential (like conventional computers) and was performing each pattern recognition in sequence, it would have to consider every possible low-level pattern before moving on to the next level. Thus it would take many millions of cycles just to go through each level. That is exactly what happens when we simulate these processes on a computer. Keep in mind, however, that computers process millions of times faster than our biological circuits.