Examining the Hype Around NoSQL

NoSQL is increasingly being hailed within development circles as a next-generation database that fixes all the performance, scalability, and complexity problems that many organizations encounter when using relational databases. Facebook, Google, and now Twitter have started using NoSQL, so it's easy to see how many businesses might think that NoSQL is the solution to very real and costly problems associated with relational databases. But NoSQL isn't magical, nor is it without limitations.

While NoSQL delivers powerful capabilities, it requires a number of very serious compromises that can be detrimental for overall business use. Most of the scalability, performance, and complexity problems associated with relational databases are rooted in a lack of understanding, and NoSQL's approach, essentially throwing the baby out with the bathwater, means that organizations looking to NoSQL for painless solutions will be sadly mistaken. In this article, I will identify some of the core trade-offs and limitations that businesses must address when considering NoSQL.

How NoSQL Works
The key to understanding NoSQL is to realize that it isn't a product. It's a paradigm, or an approach to storing data. Currently, over 20 different NoSQL implementations are available. When most people talk about NoSQL, they're usually discussing the more common implementations, such as Cassandra, Hadoop, CoucheDB, MemcacheDB, MongoDB, Google's BigTable, Voldemort, and others. Consequently, any discussion of how NoSQL works has to be prefaced by the caveat that implementation details and architectural considerations vary widely from one option to the next. Consequently, comparing all NoSQL implementations to relational databases as NoSQL solutions take many different approaches to data storage.

At a high level, NoSQL implementations share the goal of increased performance and scalability through jettisoning what some consider unwanted and unnecessary capabilities found in today's relational databases. The problem, of course, is that jettisoning those features comes at a high cost, at least for normal business considerations.

For example, a key tenet of most NoSQL databases is that they throw out atomicity, consistency, isolation, durability (ACID) in favor of Basically Available data with Soft state that becomes Eventually consistent (BASE). On the plus side, developers are freed from issues of managing locking and blocking. However, on the negative side, consistency and durability issues risk causing problems with end-user interactions. Many IT professionals with real-world experience almost completely dismiss the use of NoSQL because of its inability to manage complex business needs, workflows, or interactions (more on this shortly).

Likewise, NoSQL gets rid of schema and works directly with data through APIs that developers typically find easier to use because they don't have to use SQL for Create, Read, Update, Delete (CRUD) operations. NoSQL also gets rid of SQL JOINs, which help developers eliminate impedance mismatch. These benefits mean that developers can typically create solutions in less time as there is less complexity to manage. The downside, though, is that applications typically lose complex filtering options and aggregates along with ad hoc and analytical reporting capabilities. This means that NoSQL solutions really don't fit the bill for applications where businesses need to regularly analyze data.

Finally, a key to NoSQL's performance and scalability is that most NoSQL implementations store data exclusively in RAM. By sharing RAM across multiple servers, NoSQL picks up the ability for easy scale-out operations while also benefiting from increased redundancy or higher-availability through fault tolerance. NoSQL's scale-out strategy also picks up additional cost benefits because most implementations are open source. Licensing considerations are significantly less than those involving relational databases attempting scale-out architectures. The problem is that when businesses focus only on the benefits of scalability and performance, they can miss just how expensive those benefits really are. Let's start by addressing performance and scalability, and then we'll look at the other considerations.

Performance and Scalability: RDBMs vs. NoSQL
I've spent over a decade largely focused on performance-related issues for relational databases and working with a wide variety of platforms, including MySQL, Oracle, and SQL Server. I know how critical performance is to business, and I'm not oblivious to the performance problems that relational databases encounter. However, except in an extremely narrow set of highly specialized circumstances, NoSQL is not the answer to those problems. Rather, NoSQL's performance and scalability strengths come at too high a cost for NoSQL to be considered for general business use.

NoSQL performance and scalability. Putting data into RAM via document, graph, or key-value architectures is a fantastic idea. Effectively, it amounts to keeping highly selective data directly in memory where it can be quickly accessed. Likewise, spreading data over multiple machines to keep it in RAM (instead of dropping it to disk and incurring I/O or paging overhead) is another highly efficient approach to data storage. Not only does it allow for cost-efficient, scale-out capabilities that can readily grow with the need to handle huge amounts of data, but it can also be a great way to achieve fault tolerance and availability. In fact, this approach to high-performance data retrieval makes so much sense that it is commonly used in the form of caching tiers that sit atop relational databases. In this way, relational databases achieve most of the benefits of NoSQL implementations without the costly tradeoffs.

RDMBS performance and scalability. As one vocal critic of NoSQL, Ted Dzuiba, succinctly stated, "You are not Google." An unhealthy fixation on scalability at the cost of everything else is just an expensive distraction that gets in the way of work. While relational databases do not scale in the same way as NoSQL solutions do, that's not to say that they can't scale. NASDAQ, Walmart, and plenty of other large organizations still manage to use relational databases for massive amounts of data storage and retrieval under highly performant circumstances. Just because Twitter is able to store 140 character Tweets in a scalable manner doesn't mean that NoSQL is what your business needs. (Making fun of Twitter's simplicity may sound like a cheap shot, but it underlines how different Twitter's storage needs are from that of most businesses.)

Typically, most organizations run into relational database performance problems when the amount of data being stored exceeds the amount of RAM available. This doesn't mean that relational databases don't perform well. With successfully implemented indexing strategies, relational databases can keep pace with NoSQL performance, especially when combined with intelligently managed caching.

Given NoSQL's heavy fixation on simple CRUD operations and its serious problems related to filtering and aggregating data, I'll go on record as saying that relational databases can outperform NoSQL databases from an overall business perspective. Figuratively speaking, NoSQL is a steroid-ridden, muscle-bound, body builder who can lift weights like crazy but just can't keep up with an agile, conditioned, well-rounded tri-athlete when it comes to reporting, ad hoc queries, analysis, or anything other than simple CRUD operations and massive scalability.

Of course relational databases can suffer performance problems. But businesses need to address those problems with knowledge, understanding, and expertise rather than hastily turning to the nuclear NoSQL option in hopes that it will obliterate their problems without any negative consequences.

Complexity: RDBMS vs NoSQL
Proponents of NoSQL regularly cite how NoSQL reduces complexity, both in terms of architecture and coding. But that's primarily because NoSQL throws out most of the core features, benefits, and strengths of relational databases. For example, when it comes to storage or architectural complexity, NoSQL doesn't get rid of complexity. It just reorganizes that complexity and abstracts it away from developers and administrators. One common way NoSQL does this is by hosting services in the cloud. Make no mistake, that's a great approach to solving architectural complexity and scalability as it puts the onus for delivery in someone else's lap.

On the other hand, even when cloud-based service providers have significant hardware and infrastructure to handle typical workloads, most of them ensure uptime by limiting what their customers can do on their massive, multi-tenant systems. For example, most cloud providers terminate any query taking longer than 10 seconds. This policy keeps everything snappy and responsive for day-to-day operations, but it also means that reporting and analytics are crippled in NoSQL solutions. Even when NoSQL is hosted locally, where policies don't prohibit reporting queries, developers still have to address efficient ways to query large amounts of data for analytical processing. Even NoSQL proponents admit this issue is difficult and expensive to address.

Coding Complexity
As a developer, I can see why other developers would like to shy away from the nuances of SQL. It is an old language, and there are definite issues with the impedance mismatch that is so frequently cited in terms of converting data from set-based tuples into objects for use by developers. However, object relational mapping does a great job of overcoming that mismatch. I also realize that if I'm not comfortable with JavaScript, XML, or the languages that I'm using to interact with some key facet of my application, trying to bypass the need for those additional language skills typically comes at too high of a cost. In other words, if developers lack the ability to properly write unit tests, the answer is not to throw out unit testing and assume that the code is flawless. Instead, the approach most businesses take is to invest additional resources to make sure that their developers are as efficient as possible. Consequently, I fail to see why this same approach wouldn't make sense when it comes to working with SQL.

Many developers legitimately argue that straddling multiple languages just to interact with data is time consuming and expensive, and it results in extra work. But storage complexity doesn't go away with NoSQL. That complexity just shifts around a bit. For example, in the same way that a junior-level developer can write an SQL query from hell, he or she is also capable of making expensive mistakes when trying to iterate over multi-dimensional arrays. Granted, that's not a perfect argument for the complexities of SQL, but plenty of businesses do just fine with SQL even with all its liabilities and complexities. These same businesses also manage to exploit SQL's benefits and capabilities as well. So arguing that businesses are better off without SQL just doesn't add up.

Business Considerations
NoSQL is all about trade-offs, as all technologies are. In most cases, the trade-offs required to meet NoSQL's narrow set of goals make it non-viable for most business considerations. However, in some highly specialized cases, NoSQL will be the right tool to use. Even so, the following are just a few considerations that businesses will want to address in order to better make sure that they're using the right storage options.

Standards compliance. Most businesses today probably don't care about the ANSI compatibility of the relational databases that store their business information. Nor do they care about OLE DB, ODBC, and/or JDBC accessibility standards for their data. But that doesn't change the fact that a standard for relational databases does exist and that vendors compete with each other to achieve greater and greater compliance with those standards. More importantly, these interfaces and standards have made possible a plethora of third-party business solutions and integrations vendors who can offer additional solutions and capabilities that enable businesses to extend the importance, reach, and value of their data.

NoSQL, on the other hand, takes such a highly focused and specialized approach to the data it stores that access to that data from other endpoints really isn't even an option. This limitation brings substantial negative consequences for businesses looking to extend their infrastructure and data through the use of third-party offerings or capabilities, to say nothing of hampering internal initiatives as well. In effect, NoSQL data risks becoming heavily siloed, a fact that carries tremendous business implications.

Disaster recovery, archiving, auditing, and compliance. Many NoSQL implementations address high availability (HA) through fault tolerance and redundancy. However, HA isn't the same as disaster recovery, since having bad data available in multiple nodes still represents a disaster for most organizations. NoSQL comes up short compared to relational databases, which can use their transactionally consistent log files to roll back (or roll out) bad data caused by software glitches or user error.

Similarly, since relational databases are much more mature when it comes to addressing common business needs, they typically have built-in or extensive third-party support for key business considerations, such as data archiving, auditing, and regulatory compliance. That doesn't mean that NoSQL can't be configured to meet some of these needs, but out-of-the-box support is effectively nonexistent.

Business Intelligence vs. Fixed-Format Reporting
Business intelligence (BI) initiatives continue to be beneficial because they bypass costly and time-consuming development cycles tasked with creating fixed-layout reports. By aggregating data into data warehouses or marts, BI solutions then expose that data to information workers and management in ways that permit them to freely iterate over data to find patterns, trends, and potential problems in what amounts to real time.

NoSQL, on the other hand, turns all these advances on their head. Not only are simple ad hoc queries against NoSQL databases almost impossible, but all reports against NoSQL data need to be run through costly development cycles, either to create fixed-format reports or to expose data for Extract Transform and Load (ETL) operations.

This is also true when it comes to ancillary cases where NoSQL is used in a supporting role for things like storing analytics or logging; any data worth collecting is data that management will eventually want to report against. And without existing standards or support for key considerations like incremental updates or standard (OLEDB or ODBC) extraction connectivity, data stored in NoSQL databases ends up being heavily siloed, and inaccessible without significant additional effort.

Development Considerations
While NoSQL looks promising to many developers, assuring them that they can kick clunky SQL language considerations to the curb, NoSQL is missing too many features to make it viable for most business needs. Here are a few development considerations that rarely get mentioned in the hype surrounding NoSQL's benefits:

Versioning. Managing complex applications is difficult, especially when business needs are constantly shifting and developers have to roll out changes to code and data structures on systems that require HA. By properly separating physical schema from logical schema (i.e., using views and sprocs), developers working with relational databases can execute phased rollouts in even complex web-farm environments, which allow multiple versions of data and code to exist at the same time as web servers are pulled out of rotation and updated. This process is non-trivial but possible.

With NoSQL, any indirection needed to handle versioning requirements needs to be architected and implemented directly by developers. (For example, something as simple as using different stored procedures won't work.) More importantly, while relational databases have to maintain transactional consistency during schema modifications, NoSQL implementations typically have very poor support for changes to underlying structure. In most cases, NoSQL services have to be restarted in order to facilitate any changes to underlying data structure. NoSQL's inability to deftly handle versioning requirements means that it just won't meet the constantly changing needs of most business applications today.

No transactions. NoSQL proponents are correct to point out that transactional complexity (especially in terms of locking and blocking considerations) can be tricky to manage. However, most business applications require complexity in order to manage typical workflows. For example, when it comes to payment processing or order placement, developers commonly need to perform operation A, then B, and then C. But they need to be able to roll back all operations if operation C fails or encounters an error. Without the ability to easily enlist in transactions facilitated by the underlying data storage platform, NoSQL developers are on their own to ensure that complex operations complete or roll back as required.

Consequently, developers in business environments are better off learning about transactional overhead, locking, and blocking and putting these features to work for them rather than taking the NoSQL approach where transactions aren't available.

Aggregates and normalization. NoSQL implementations are architected to provide highly performant CRUD operations against objects, documents, or graphs. This makes normal day-to-day operations very performant and scalable for end users of highly specialized applications. But if management suddenly discovers the need to determine the total number of orders placed by customers referred by affiliates in a certain state to evaluate tax implication, NoSQL runs into problems.

For starters, NoSQL support for aggregates or the ability to do SUM(), MAX(), AVG(), or GROUP BY is virtually nonexistent. Instead, NoSQL developers are left with two choices. They can iterate over data and calculate aggregate information with custom code. This choice tends to be very expensive to implement and means that developers have to learn specifics about the most efficient ways to traverse large amounts of NoSQL data without causing performance problems. Or, if developers know in advance that they'll need aggregated information, they can roll this information into their objects as needed. So, for example on a social networking site, a FriendsCount property for a User object could just be incremented or decremented as needed, as opposed to taking the relational approach of doing a COUNT() operation.

The problem, of course, is that developers won't always know what kinds of aggregations their applications will need in the future. And when they do, managing those details manually without transactional support becomes tedious. Moreover, since NoSQL implementations are object, graph, or document based instead of set based, they fail to deliver on ad hoc requests for aggregated details. Granted, relational databases may need complex queries or additional, even temporary, indexes to be able to address these needs, but they're much better suited to answer the kinds of questions that people couldn't even conceive of when the system was created.

Two Conclusions About NoSQL
In its quest for high-end performance and scalability, NoSQL requires too many costly trade-offs to be used in regular business solutions. I have addressed only a few of those trade-offs at a very high level. Of course, this doesn't mean NoSQL is merely a passing fad.

Instead, when it comes to analyzing NoSQL, I can easily draw two conclusions: First, there is no silver bullet for businesses with relational database performance and scalability problems. They will be better served by addressing those issues instead of hoping NoSQL will magically make those problems disappear (through expensive rewrites). Second, NoSQL's existence will help drive relational database vendors toward addressing additional performance and scalability considerations. As it stands, relational database performance and scalability is not only good enough, but it can also be fantastic when correctly exploited. Relational databases aren't perfect, and NoSQL does address some very real edge cases where relational databases could be better.

Eventually, I think that relational databases will address some of the use cases that NoSQL tackles. In fact, it's arguable that Microsoft's Windows Azure or SQL Server Parallel Data Warehouse appliances along with Oracle's Grid computing (or Transportable Tablespaces) are examples of ways that key vendors are starting to address these issues today. One thing is for sure, though: Assuming that the benefits NoSQL provides for highly specialized workloads will translate over to normal business needs without expensive and potentially grave consequences is a serious mistake.

Organizations looking to use NoSQL or relational databases need to understand their storage requirements and the long-term implications before deciding which platform to use.

For a different perspective on NoSQL, and information to help you get started using it, see Mohammad Azam's article, Exploring the MongoDB Document Database: A Primer for .NET Developers.

Michael K. Campbell ([email protected]) is a contributing editor for SQL Server Magazine and a consultant with years of SQL Server DBA and developer experience. He enjoys consulting, development, and creating free videos for www.sqlservervideos.com.

Comments

Plain text