Microsoft's David Campbell Discusses Big Data

Near the end of 2012 on the Microsoft campus in Redmond, Washington, I got a chance to talk with David Campbell, a technical fellow at Microsoft. Campbell has been instrumental in shaping the direction of SQL Server since the release of SQL Server 7. The database industry is in an evolutionary phase. The growth of several new technologies like Big Data and in-memory databases are significantly changing the role of enterprise database platforms like SQL Server.

To help a bit with the background of our discussion, you should know a little about Big Data and today's in-memory database implementations. Big Data typical describes unstructured data that has a great deal of volume of possibility, or sometimes it describes data that changes frequently in high velocity. Hadoop is an open-source technology that has emerged to deal with Big Data. Hadoop clusters are able to partition the data processing out to multiple nodes using a processing technology called MapReduce. Microsoft, in conjunction with the company Hortonworks, created a Hadoop implementation named HDInsight that runs on Windows Server as well as Windows Azure. Another key technology that we discussed is code-named Hekaton. It's an in-memory database technology that Microsoft is working on for future versions of SQL Server. In this interview, Campbell talks about the evolution of SQL Server and how it's changing to seamlessly integrate with these new data technologies and where SQL Server is headed in the future.

Michael Otey: Tell us a little bit about your background. How long have you been at Microsoft? What did you do before you started there?

David Campbell: I've been at Microsoft a long, long time. I started in August 1994, along with a whole bunch of other people from what would be considered enterprise computing companies. I came from Digital Equipment Corporation. The curious thing about my background is that I actually have a degree in robotics, not computer science.

Otey: Do you find that there's a connection between the two?

Campbell: Actually, I do. The interesting connection is that folks in the robotics curriculum have more training in the control theory and feedback and in some of the mathematics around engineering in different forms—linear algebra and whatnot. What's so interesting is how this played out several times in my career. The control theory actually played a role in what we did with SQL Server 7.0, when we took the product from having like 120 different "knobs" that people would set down to 20. What we did was actually have the system tune itself using adaptive feedback. Now with the whole push toward Big Data and machine learning, I'm pulling the linear algebra textbooks off the shelf and dusting them off.

Otey: You obviously were instrumental in getting the original SQL Server 7.0 version launched. Since then, what kind of projects have you worked on with the SQL Server team?

Campbell: In 2006, I was part of a small group that started doing the redevelopment process for SQL Server. During that phase, we kicked off a number of incubations, one of which became Velocity, the distributed coherent cache. We also went through several different iterations for what ultimately became PowerPivot. Another one was the genesis for Hekaton, which was announced at PASS.

Otey: You've been working on the development side, too, right?

Campbell: Yes. I was part of a small group that said, "OK, with respect to the Entity Framework, let's think about it not only for object relational mapping, but other ways that we can leverage it." One of the things that came out of that was OData. Frankly, OData has taken off and is now wildly successful within Microsoft and elsewhere.

Otey: What's the relationship between OData and SQL Server? How can businesses use OData?

Campbell: Today you'll find a lot of people using OData in Web services. There's been some back and forth on how to simplify it. We started early on with what looked like RPC over XML and SOAP, and then WCF. It then sort of morphed into HTTP and REST, which is what everyone has settled on. OData makes it easy for people to encode logic and calls of various forms in the uniform interface of HTTP and REST. It's quite difficult for people to be able to query data in that form, so people wind up basically encoding parameterized queries as calls. What OData brings is a uniform interface, which is faithful to REST but allows tooling. You can navigate a catalog. You can issue queries. And for a large class of queries, you can specify simple SELECT, JOIN, and other T-SQL statements directly as HTTP against a server.

Otey: What kinds of projects are you working on these days? What's your current focus?

Campbell: I'm responsible for all of the developers in the division across our suite of technologies in SQL Server, which includes the relational engine, SQL Server.XE, and our BI technologies. It also includes the area called Information Services, which is where a lot of our higher-level services—OData is one example—sit. We're certainly focusing a lot on Big Data.

Otey: What sort of things have you been working on with Big Data? I remember at the PASS Summit there was an announcement about PolyBase. Would you like to explain a little bit about what that is?

Campbell: An interesting way to start that conversation is to note how people in the Big Data community and the traditional-scale, data-warehousing community are beginning to recognize the value of both sides of the equation. If you start from the data warehousing side, it's sometimes hard for people to appreciate where the value is on the scale of the MapReduce side. One of the ways I've come to describe it is that it's not so much about building a formal, logical model only used for a data warehouse. Rather, it's more about saying, "I can project computation on this data that I've just collected," and then basically produce information in a variety of forms from it.

For example, for a couple of years, I've been carrying a little GPS data logger that collects just telemetry data. What I can do is run computations over the GPS telemetry data and answer interesting questions. I can basically really quickly compute where I spent every night just by determining my average latitude-longitude between midnight and 5:00 a.m. local time to determine where I am on the globe. So, I'm taking information from one domain—a trail of GPS coordinates—and turning it into information in a different domain—the states where I was every night.

Otey: What do you think are some of the driving forces that are making Big Data important today and making it possible for organizations to really get into it?

Campbell: There are two significant, fundamental drivers. The first driver is that the cost of data acquisition has gone to zero.

Otey: Zero?

Campbell: Yes. For example, the GPS data logger produces data for free. I don't need to type the GPS coordinates in. If you go back 30 or 40 years, most of the data in information systems came as a result of human fingertips on keyboards. You can still pay people in the U.S. to type stuff in for you. No whether it's manuscript typing or keypad data entry, it's roughly a dollar per kilobyte. If you scale that out, it works out to a billion dollars per terabyte or a trillion dollars per petabyte. These days, much of the data—even analog phenomena like your voice traveling in the digital network or pictures—is born digital.

Otey: That makes perfect sense. I had never really considered the cost of data created manually versus in an automatic method. That's very interesting.

Campbell: The second driver is the cost of raw storage itself. How many hard drives do I need to store a terabyte of data, and what are they going to cost me? If you go back 30 years and scale the cost based on inflation, it would be $660 million to get enough disk drives to hold a terabyte. Today, it's somewhere between $50 and $100, depending on how you want to slice it.

Otey: As I understand it, PolyBase is a sort of bridge between SQL Server and Big Data. Can you explain a little bit about how that works?

Campbell: Sure, I'll explain it in the context of the example I gave earlier. I could take the telemetry data and have it recognize when I drive to work from home, so I can produce data that answers the question "What's my average commute time as a function of when I leave the house?" PolyBase can produce data that is more structured, like the example I just gave. By mapping the results to a SQL table, you can start issuing SQL queries. There's a tremendous amount of value in there. That's what PolyBase is all about.

Stepping back just a little bit, you might be wondering why we got involved with Hadoop in the first place. Two or three years ago, as early adopters started to see this phenomenon play out, people would ask us what our Hadoop integration story was during data warehousing bids. The market had already chosen Hadoop, but there wasn't a great port from Hadoop to Windows. So we said, "Let's start there."

Although the early adopters will put up with "crossing the chasm" sort of models, when it goes into the mainstream, people want solutions that are manageable. So, we worked with the community to address important issues. How do we architect this so that we can build great solutions for everyone who participates in the ecosystem? For people who choose to use Hadoop on Windows, how can we do a great job with system-centered integration, a great job with Virtual Machine Manager, a great job with Active Directory integration, and such? I sometimes use the phrase "domesticating Hadoop."

Otey: So, you're simplifying the management of Hadoop, making it fit into the existing Windows infrastructure management so that it's easy for organizations to digest. I can certainly see that. It'll be very foreign to most SQL Server shops. Certainly HDInsight seems to be the thing that helps Hadoop blend into the management infrastructure. What is the timing for HDInsight? When is it expected to be available for customers?

Campbell: We have a preview of HDInsight on Azure available now. We're in the process of validating the preview with customers. It will ship once we get all the capacity and capability in place. I hope you'll see it sometime next year.

Otey: So you're in an early-adopter phase?

Campbell: Yes. In fact, I was in this room on Friday with some folks from a company that's using HDInsight for a transportation management solution. They said that one of the things that's amazing for them is its elasticity. Using the preview program, they can spin up a cluster to solve a complex scheduling problem for transportation management, even if the customer wants it in an hour. They pointed out that, with most of the existing solutions, customers can really only do one run a day. With HDInsight, if it makes sense for a customer to do five, six, or even eight runs a day, they can just scale up the cluster to the appropriate size to pull that off. The elasticity that's being demonstrated in the preview right now is ultimately going to be really, really valuable.

Otey: As I understand it, the initial implementation of PolyBase is going to be with Parallel Data Warehouse. Is that right?

Campbell: Yes, it will be in the next native version of PDW.

Otey: Can you explain how SQL Server, HDInsight, and Big Data all fit together?

Campbell: In Parallel Data Warehouse, we have scaled, distributed, reliable storage. Its built-in Data Movement Service lets us move data around very efficiently on the cluster. And we have the ability to factor and run queries over all of that.

What we're doing, basically, is overlaying Hadoop on top of what we've done for PDW. This allows us to do several things. If you have a very complex MapReduce job that someone has hand authored, you could deposit the results of that job in an HDFS table, map that table to a SQL view, and then run queries against the view.

Take, for example, someone who's doing machine learning, such as the Parkinson's Voice Initiative. I don't know if the Parkinson's Voice Initiative is using Hadoop per se, but you can imagine it. They're detecting Parkinson's using tremors in recorded voices. I wouldn't know how to express the SQL query for that, but if you did know how to author and run that job, you could run it over a bunch of files that contain voices, take the output, and then run a number of queries against that output. So you can think of it ultimately as the MapReduce piece being synthesized and generated for a whole class of problems.

Otey: Does it matter if PolyBase is running on the on-premises version of HDInsight or if it's running on HDInsight Services for Windows Azure?

Campbell: There are couple different answers here. I'll start with on-premises. You could do that. One of the challenges you have going there is bandwidth. What level of bandwidth is required? How sensitive are you to latency and such? Another question people ask quite often is "I already have a Linux Hadoop cluster. Can I go against that?" Our goal is to interoperate across those clusters as well. To answer the other part of the question, "Will it all come together on Azure?" Ultimately, it will.

Otey: If you have an existing Hadoop cluster that's all open source and not HDInsight, will PolyBase work with it as well?

Campbell: To a degree. As we do more and more in terms of synthesis of queries, it's a matter of being able to understand what's there to a deeper degree to do a better job. But certainly we want to make this whole suite of interoperability tools available to people who have an existing Hadoop cluster.

Otey: Can you explain what Hekaton is about?

Campbell: Let me step back. As I mentioned earlier, back in 2006, I was part of a small group that started doing the redevelopment process for SQL Server. Some of the thinking we started back then ultimately led to Hekaton. The original conversation was actually kind of interesting because we said, "How do we make the most of modern hardware as modern hardware evolves?" When we entertained that discussion in 2006, SQL Server was a major product. And so people said, "Well, what does it mean for compatibility? What does it mean for this and that?" At that point we said, "We don't know. But we have to get going on figuring out what it means for us."

Otey: So, the idea with the next generation was to take advantage of some of the advances in technology and to bring SQL Server to where it needed to be for the next generation of the software?

Campbell: Yes. But here's where it gets to an interesting angle right back at Hekaton again. We could have done one of two things. We could have said, "Well, to maximize these new capabilities—the row-based transactional processing system and memory transactional system—we're going to have to throw out SQL Server and do something completely new. Instead, we said, "It makes sense for us to put them in the context of SQL Server—still have it be SQL Server and let all that value shine through."

And frankly, there's a ton of hard engineering work in there to make it be SQL Server. But if you look at the customers that we showed at PASS, such as bwin Games, you'll see that they got the speed up way beyond what they ever would have hoped for. And it's still SQL Server.

As you can well imagine, these customers have major implementations that have to be scalable. Everything needs to come together in one place, and it needs to be consistent. So they had done many, many, many transactional processing tricks to get SQL Server to go really, really fast. We co-designed, if you will, Hekaton with them, and then we gave them versions to work on. They were hoping to get 2x performance improvements. They were amazed to get 10x improvements right away with no other changes. And the beautiful thing for them is that it's still SQL Server.

Otey: As I understand it, with Hekaton, you're taking certain tables that are hot tables or stored procedures, and moving them into memory?

Campbell: Yes.

Otey: But SQL Server still sees them as normal tables?

Campbell: For the people who know SQL Server, this is the part that a lot of them don't understand. If you get under the covers, what does that really mean? Well, if you have enough memory, you can have your database in memory, right? But the fact is that you can a) be assured that it's in memory and b) use a set of programming techniques to make maximum use of the modern hardware. With Hekaton, we've had to change the concurrency control system, but still make it be SQL Server. There's a whole class of data structures, known as lock-free data structures, to get things to scale on modern hardware. So it's not just bringing it into memory. On the stored procedure side, we have an interpreter inside SQL Server that executes stored procedures. They're compiled, but they're compiled in the code that the interpreter runs. In the case of Hekaton stored procedures, we actually generate native code and optimize it specifically for that particular query. Again, it's still SQL Server because I still write a T-SQL store procedure. But it gets compiled into code that works alongside the optimized data structures to get the speed up.

Otey: This is just standard T-SQL code—the developers don't have to use anything different, right?

Campbell: That's right. If you look at the BI side of the house, we took a little bit more of a radical approach with the tabular model and PowerPivot. But on the SQL Server side, we said, "Okay, we can make Hekaton and still have it be SQL Server." That degree of compatibility offers a ton of value. It also helps you make the migration. You don't have to rewrite your application, and you don't need to go do a whole bunch of other things. You don't even have to move it to a different storage platform. You basically walk up to it with this tool, decide what's the best in terms of optimizing first, and away you go.

Otey: As I understand it, you can even use Hekaton with existing hardware, provided the hardware has the necessary capabilities. So, you could put it right on what you're using today and get performance advantages.

Campbell: Exactly.

Otey: That sounds pretty awesome. It's a good way for customers to take advantage of these new technologies. So, what do you see as the future direction in the database world? Certainly Big Data is part of it. The move to in-memory and higher performance is another part of it. Are there any other significant trends going on?

Campbell: I think one of the most significant trends is "How do I get from having this ambient data—this stuff that is just around—and turn it into insight really easily?" People say, "Why do I want a spreadsheet that can have millions of rows of data?" They think that's crazy. But being able to slice and dice and transform hundreds of millions of rows of data on laptop or desktop machine allows you to get to the insight really, really quickly. What can you learn from this data without spending several months designing a data model, loading it, staging it, and tuning it? After you know where the insights are, you can do a variety of things to scale up. One of the things that ships with SQL Server 2012 is the ability to take a PowerPivot and VertiPaq model you built in Excel and upsize it to Analysis Services so you can throw more data at it.

Understanding what's there and where the value is in this ambient data is something we made really approachable. Then taking the friction out of "How do I make it into a production solution?" from that point is another piece of the solution.

So, to answer your question about the new frontiers, I think it's realizing more and more value from accessible data. Accessible data not only includes things that I have in my business but also what's available publicly or from others.

In many of the Big Data talks I give, I start with this story: About three years ago, we were talking about transactional processing systems with one of the large commercial U.S. airlines. In the middle of this conversation, one of the guys stopped and said, "You know what? Our businesses are just killing each other. Everyone's just racing for the bottom—cheaper and cheaper flights. Nobody cares about anything we can do other than having cheap seats." He said, "We've come to realize that the only way that we're going to survive is to do a better job of yield management, a better job of fuel-price hedging, and a better job of upselling stuff to our customers than our competitors." He then paused for a second and said, "And that all requires us to do new things with data that we don't know how to do today."

It's interesting because they went from a transactional processing discussion to "Where are we going to get data that will give us insight that's going to keep our business afloat?" in the span of a couple minutes. They started asking questions such as "Where are we going to get the fuel futures pricing data to be able to hedge our fuel purchases?" and "Where are we going to get the meteorological models to be able to determine whether we should move the planes from Logan to JFK tomorrow because a storm is coming?" This is the work we're doing. We have a team in Boston that recently did a demo at the 2012 Supercomputing Conference. They actually built a predictive model that we ultimately might offer to people. They gathered airline data and a lot of meteorological data, and brought it all together to produce insights. For example, if you're planning on taking a flight from Seattle through Detroit to Boston in November, what are your chances of being delayed due to weather, historically? That's sort of this new world of data that we talk about—bringing together your data and the world's data to get new insights and new value.

Otey: So businesses are turning data that they probably couldn't have used before into information that they can use to solve fundamental business problems. And they're using familiar tools, such as Excel and PowerPivot, to get insights, which means they don't have the hurdles of learning an arcane new BI tool. I can see that it's helping to bring information more into the hands of the users and making everything more accessible. There are certainly a lot of interesting trends.

Campbell: One question that I've gotten over the years as we make these shifts is "How will it affect my job?" For example, when we did the work in SQL Server 7.0 to take the 100+ "knobs" down to 20, there were a lot of DBAs who were concerned about their jobs. But that shift brought new opportunities in the BI and OLAP space. In this new world, there will also be a ton of opportunities for people in and around the product—both existing opportunities, where they can still retain their skills, like with Hekaton, and new opportunities in the Big Data and insight space. It's a great time to be a data guy or gal.

Otey: So they shouldn't be worried about how Big Data is going to affect their jobs because it's not a replacement for SQL Server at all? Instead, it opens up other possibilities?

Campbell: It does. The last three years have, without question, been the most exciting time of my career—and I've been in the database space in one form or another for over 25 years. There's just so much happening.

Otey: That's true. I would say that, in the technology frame in general, there have been more changes in the past few years than I've probably ever seen. The rate of change is faster than ever before.

Campbell: Yep, there's a lot of change happening. And there's a ton of opportunity for producing value, no matter where you sit. So being in the data space at this point is a great thing.

Comments

Plain text