The 4ESS switch used its new software to monitor its fellow switches as they recovered from faults. As other switches came back on line after recovery, they would send their "OK" signals to the switch. The switch would make a little note to that effect in its "status map," recognizing that the fellow switch was back and ready to go, and should be sent some calls and put back to regular work. Unfortunately, while it was busy bookkeeping with the status map, the tiny flaw in the brand-new software came into play. The flaw caused the 4ESS switch to interacted, subtly but drastically, with incoming telephone calls from human users. If -- and only if -- two incoming phone-calls happened to hit the switch within a hundredth of a second, then a small patch of data would be garbled by the flaw. But the switch had been programmed to monitor itself constantly for any possible damage to its data. When the switch perceived that its data had been somehow garbled, then it too would go down, for swift repairs to its software. It would signal its fellow switches not to send any more work. It would go into the fault- recovery mode for four to six seconds. And then the switch would be fine again, and would send out its "OK, ready for work" signal.
However, the "OK, ready for work" signal was the *very thing that had caused the switch to go down in the first place.* And *all* the System 7 switches had the same flaw in their status-map software. As soon as they stopped to make the bookkeeping note that their fellow switch was "OK," then they too would become vulnerable to the slight chance that two phone-calls would hit them within a hundredth of a second. At approximately 2:25 p.m. EST on Monday, January 15, one of AT&T's 4ESS toll switching systems in New York City had an actual, legitimate, minor problem. It went into fault recovery routines, announced "I'm going down," then announced, "I'm back, I'm OK." And this cheery message then blasted throughout the network to many of its fellow 4ESS switches.
Many of the switches, at first, completely escaped trouble. These lucky switches were not hit by the coincidence of two phone calls within a hundredth of a second. Their software did not fail -- at first. But three switches -- in Atlanta, St. Louis, and Detroit -- were unlucky, and were caught with their hands full. And they went down. And they came back up, almost immediately. And they too began to broadcast the lethal message that they, too, were "OK" again, activating the lurking software bug in yet other switches. As more and more switches did have that bit of bad luck and collapsed, the call-traffic became more and more densely packed in the remaining switches, which were groaning to keep up with the load. And of course, as the calls became more densely packed, the switches were *much more likely* to be hit twice within a hundredth of a second.
It only took four seconds for a switch to get well. There was no *physical* damage of any kind to the switches, after all. Physically, they were working perfectly. This situation was "only" a software problem. But the 4ESS switches were leaping up and down every four to six seconds, in a virulent spreading wave all over America, in utter, manic, mechanical stupidity. They kept *knocking* one another down with their contagious "OK" messages. It took about ten minutes for the chain reaction to cripple the network. Even then, switches would periodically luck-out and manage to resume their normal work. Many calls -- millions of them -- were managing to get through. But millions weren't. The switching stations that used System 6 were not directly affected. Thanks to these old-fashioned switches, AT&T's national system avoided complete collapse. This fact also made it clear to engineers that System 7 was at fault.
Bell Labs engineers, working feverishly in New Jersey, Illinois, and Ohio, first tried their entire repertoire of standard network remedies on the malfunctioning System 7. None of the remedies worked, of course, because nothing like this had ever happened to any phone system before.
By cutting out the backup safety network entirely, they were able to reduce the frenzy of "OK" messages by about half. The system then began to recover, as the chain reaction slowed. By 11:30 pm on Monday January 15, sweating engineers on the midnight shift breathed a sigh of relief as the last switch cleared-up.
By Tuesday they were pulling all the brand-new 4ESS software and replacing it with an earlier version of System 7. If these had been human operators, rather than computers at work, someone would simply have eventually stopped screaming. It would have been *obvious* that the situation was not "OK," and common sense would have kicked in. Humans possess common sense -- at least to some extent. Computers simply don't.
On the other hand, computers can handle hundreds of calls per second. Humans simply can't. If every single human being in America worked for the phone company, we couldn't match the performance of digital switches: direct-dialling, three-way calling, speed-calling, call- waiting, Caller ID, all the rest of the cornucopia of digital bounty. Replacing computers with operators is simply not an option any more.
And yet we still, anachronistically, expect humans to be running our phone system. It is hard for us to understand that we have sacrificed huge amounts of initiative and control to senseless yet powerful machines. When the phones fail, we want somebody to be responsible. We want somebody to blame. When the Crash of January 15 happened, the American populace was simply not prepared to understand that enormous landslides in cyberspace, like the Crash itself, can happen, and can be nobody's fault in particular. It was easier to believe, maybe even in some odd way more reassuring to believe, that some evil person, or evil group, had done this to us. "Hackers" had done it. With a virus. A trojan horse. A software bomb. A dirty plot of some kind. People believed this, responsible people. In 1990, they were looking hard for evidence to confirm their heartfelt suspicions.
And they would look in a lot of places.
Come 1991, however, the outlines of an apparent new reality would begin to emerge from the fog. On July 1 and 2, 1991, computer-software collapses in telephone switching stations disrupted service in Washington DC, Pittsburgh, Los Angeles and San Francisco. Once again, seemingly minor maintenance problems had crippled the digital System 7. About twelve million people were affected in the Crash of July 1, 1991. Said the New York Times Service: "Telephone company executives and federal regulators said they were not ruling out the possibility of sabotage by computer hackers, but most seemed to think the problems stemmed from some unknown defect in the software running the networks."
And sure enough, within the week, a red-faced software company, DSC Communications Corporation of Plano, Texas, owned up to "glitches" in the "signal transfer point" software that DSC had designed for Bell Atlantic and Pacific Bell. The immediate cause of the July 1 Crash was a single mistyped character: one tiny typographical flaw in one single line of the software. One mistyped letter, in one single line, had deprived the nation's capital of phone service. It was not particularly surprising that this tiny flaw had escaped attention: a typical System 7 station requires *ten million* lines of code. On Tuesday, September 17, 1991, came the most spectacular outage yet. This case had nothing to do with software failures -- at least, not directly. Instead, a group of AT&T's switching stations in New York City had simply run out of electrical power and shut down cold. Their back-up batteries had failed. Automatic warning systems were supposed to warn of the loss of battery power, but those automatic systems had failed as well.