I'm Feeling Lucky_ The Confessions of Google Employee Number 59 - Douglas Edwards [86]
By lunchtime, traffic had subsided enough that Larry and Sergey gave the okay to turn Google.com back on. Schwim and Jim returned to Exodus to finish installing the last of the servers, and within four hours they had brought an additional three hundred machines online, ending the immediate crisis.
As Jim and Schwim left the controlled environment of the data center and headed out into the warm evening air, they received another call. Netscape's engineering team was at the Tied House Brewery in Mountain View, celebrating the partnership, and they wanted Google's tech team to join them.
"They threw us a great post-launch party," Jim remembers. "And the thing that came up over and over again was, 'I can't believe you guys shut down your own site just to serve our traffic.'" The Googlers in attendance noted well that their sacrifice had paid off handsomely. The deal with Netscape promised to blossom into a beautiful friendship. Google gained not only trust, but also access to a whole new set of data in Netscape's query stream—data we could analyze and compare with our own traffic. Most important, the company's first major crisis battle-hardened it. Larry and Sergey would never again underestimate the challenges of occupying new territory. Though it seemed epic at the time, the battle of Netscape would go down as a minor skirmish once Google fully engaged the major players in the war for search supremacy.
That day was coming.
What's Going Down?
A little after midnight one Saturday night in the fall of 1999, Jim's phone interrupted his sleep again. Again it was Sergey.
"The site's down. What's up?" he wanted to know.
"Not me," Jim replied with a yawn. "You woke me."
A circuit breaker at Exodus had flipped, taking down Google's main switch, an inexpensive little piece of Hewlett-Packard hardware through which all of Google's traffic flowed. Exodus had set up the switch before Google moved the first racks into its cage, and had done it in a hurry. The device had been placed on the floor under one of the racks and was cabled in such a way that it had to stay there. It was known to all the techs by the designation "Switch on the ground." There was no backup, and when it crashed Google went offline until someone did something about it.
"Sergey had been at a party. He came home and noticed we were down," recalls Jim, who logged in, figured out the problem, and had Exodus turn the circuit on again. Google was offline for about half an hour.
"We should probably be monitoring our site, huh?" said Sergey when Jim called to let him know it was back up.
Jim spent the rest of Saturday night and Sunday morning writing a script to monitor Google. His script checked the site every five seconds to make sure it was operational and called a phone number if something went wrong. The next week everyone in operations got a pager.
Google had gone dark for a second time, but no tempers flared and no heads rolled. "If Larry and Sergey were upset about anything," Jim told me, "it was, Why didn't any of us think of that? We're a bunch of bright people here and none of us even thought to monitor our own site."
The pager alert system created problems of its own. "Claus,"* a logs engineer, was one of the first to be hooked up, and he watched carefully as our traffic numbers kept redlining, threatening to crash the logs system. The logs were money—we billed advertisers on the basis of the data they contained—so he set up his own scripts to crunch the numbers and to call his pager when they were done. That happened about three times an hour, every hour, all day long. According to engineer Chad Lester, Claus "kept Google alive in the early days. He'd be sleeping at his desk in twenty-minute intervals between pages. One month he got a pager bill in the thousands of dollars."
Google renegotiated its pager service contract but never compromised