A couple of weeks ago at Container World I had a discussion with Steve Newman, who these days is the founder and CEO of Scalyr, a log management tool startup that helps IT folks make sense of the reams of information created by server logs. Those with a long memory might associate him with another startup that was acquired by Google more than a dozen years ago, called Writely.
This was somewhere around 2004, in the early days of the Web 2.0 phenomenon. In addition to Newman, the trio consisted of Sam Schillace, who's now VP of engineering with Google Maps, and Claudia Carpenter, who works alongside Newman at Scalyr as a software developer and UI designer.
"We were friends and we'd gotten together to build a startup, we just didn't know what," Newman explained. "We'd been prototyping a different idea, a terrible idea that I won't even bother describing, and one of my co-founders came into the office one day, which was actually my attic, said 'I have an idea' and basically described Google Docs."
Considering the success that Docs was to eventually have, you might be excused for thinking that the trio immediately dropped the bad idea on which they were working to go with a sure thing. Not so. As they say, hindsight is 20/20. Foresight is a little more cloudy.
"It took about two weeks for him to convince the other two of us that it was actually a good idea," Newman remembered.
"We put the first prototype together in about 100 days, which I remember because I had just read some article that said any new thing should launch in 90 days because if you take longer you're just scared to launch, so that was the goal. We missed 90. It took 100 days to put that first prototype together. We did a very soft launch -- no press, no anything -- but just a few people were finding it here and there."
The launch was in 2005, before software-as-a-service had become much of a thing and a full five years before Microsoft released an online version of Word as part of Office 365. "The problem we thought we were solving was emailing Microsoft Word files as attachments," Newman said.
On Sept. 1 of that year, about six weeks after Writely's soft launch, Michael Arrington wrote an article on the nascent browser-based word processor on TechCrunch, a "little" site he'd founded only a few months earlier. Newman remembers that the article had something of what was in those days called "the Slashdot effect."
"One night before I went to bed I checked and we had about 90 registered users on the site, many of whom I knew personally," he said. "When I woke up, we had about a 1,000 registered users. What happened overnight was the TechCrunch article. From that point it was just a mad scramble to keep up. It was one of those classic internet rocketship rides. Pretty soon we got an email from Google interested in acquiring us, which we almost deleted as spam. One thing led to another and I think 10 months after 'I've got an idea' we'd been acquired."
This led to Newman and his team moving from the attic office to set up shop at Google to work to turn the little homegrown web-based word processor into something much larger. Google had acquired another startup that had built a web-based spreadsheet, and the Writely crew worked with that team to integrate the two projects together to create the full fledged office collaboration suite Google Docs.
"Getting to be at Google in 2006 and in subsequent years, launching Google Docs, building it out and scaling it up, was just a fascinating ride," Newman said. "Google was a wonderful place to learn how to do things at large scale, how to ship something to millions of users, how to keep it working when millions of people are using it.
"When you're operating at that scale, all kinds of strange, unusual things that normally you wouldn't have to worry about suddenly become things you do have to worry about. People used to say at Google, if something is a one in a billion chance it'll happen every day. It's really true. You start to have to be a lot more careful about engineering, about the product design, about how everything works together, because the fluke event is a routine occurrence."
Newman and his team soon learned that keeping everything working together went far beyond their own project. Their code had to work within Google's preexisting infrastructure.
"I don't think we were using the word at the time, but really we were working in the cloud, the internal cloud Google had, so you didn't just build and run your own system," he explained. "There were a lot of other things going on at Google that you could rely on, such as the data storage, authenticating who your users are, directing the traffic as millions of people are coming to your site -- all those things were built by other teams at Google. We just had to do our little part and then we could rely on all these other pieces, which was great but it made things more complicated.
"Before we were acquired we had about 200,000 registered users and we were struggling to keep up. At Google, we quickly had millions and millions of registered users and we'd just sort of push a button every time we'd need more servers and the whole thing just scaled up so easily. It was an incredible platform for running at large scale, but also it was just this very complicated environment."
Complications that could come from the most unexpected places.
"Not long after we had relaunched Writely as Google Docs we had a 20 minute outage because Brazil was playing in the semi finals of the World Cup and at halftime everyone in Brazil went and logged on to their social network to gossip about the game," he said. "If it's 2006 and you live in Brazil, then your social network is Orkut, which was a Google property. The surge in traffic overloaded the network in Orkut's data center, which also happened to be the data center that Google Docs was using. That sort of gives you the flavor of how complicated things can get.
"There's always something going on. It's not usually half time at the World Cup, but someone does something, or some team releases a new version of something, and it somehow interacts with what you're doing and now you have another problem to investigate. 'Why is the spellchecker suddenly broken? Why is the site slow this afternoon? I know it's not the same reason it was slow yesterday afternoon, because I already checked that. It's some new reason.' All of these things going on."
Oddly, the adjusting to scale in complex environments that Newman and his team experienced at Google eventually led to the creation of the log management tool startup he helms today.
"Trying to sift through all the data, the logs and the other data that we were gathering, to try to figure these things out, was a huge frustration," he said. "After I left Google, I would talk to other people around the industry and would hear a lot of similar stories. Especially now, as everyone is moving to the cloud, which is really the same kind of environment.
"When you have a complicated system you have to do these long investigations, and the investigations mean sifting through server logs and application logs and other data that you're constantly collecting from these systems. When you're running at large scale, that's an enormous amount of data if you're trying to track down when did this error message originate or let's look at the graph of how long it takes the site to load and let's run that graph back for a week so I can see when this started to get worse, or whatever the specific question you have that's going to hopefully get one step closer to the origin of your problem."
Although a thorough search of server logs almost always eventually led to finding the cause of a particular problem, Newman found sifting through the reams of data was almost prohibitively time consuming.
"It might take five minutes for you to ask a question such as show me when this error started surfacing or when the site got slow and it takes five minutes to get an answer. The answer just leads to your next question, because you're going to have to follow eight steps to get to the root cause and you're going to have 10 guesses at each of those steps before you get on the right guess. So now you're asking 80 questions and each question takes five minutes to answer. Five minutes is long enough to lose your train of thought, go check your email or get a cup of coffee. Eighty cups of coffee by the time you get to the end is a problem. This is the sort of thing we were experiencing at Google."
It was problems like these that Newman was attempting to solve six years ago when he founded the log management tool platform Scalyr.
"The original idea for Scalyr was very simple," he said. "It was let's turn that five minutes into one second."