Troubleshooting Performance Problems in Entity Framework Applications

The ADO.NET Entity Framework has been widely adopted by developers because it allows them to interact with SQL Server databases through abstractions, which provides two primary advantages. First, the abstractions let developers avoid the tedium of having to wire up database calls. Second, the abstractions shield developers from needing knowledge about the size, cardinality, distribution, and indexing of the data with which they're working. However, shielding developers from this complexity can produce problems. Some operations are nontrivial from a performance perspective. In many cases, these nontrivial operations can't be hidden without incurring some stiff performance penalties. Stated differently, a cardinal rule of performance tuning is that bad code typically equates to bad performance. Because the Entity Framework generates large amounts of frequently complex code, it would be naive to assume that it can properly optimize the T-SQL code it generates for every operation.

Related: The Entity Framework in Action

When it comes to addressing Entity Framework performance problems, there are a few key items that DBAs must recognize. First and foremost, DBAs need to understand that the Entity Framework's core concept of entities differs fundamentally from SQL Server projections. DBAs also need to understand what techniques are available to address Entity Framework performance problems. These techniques include:

Preemptively sizing and testing new Entity Framework applications
Using plan guides to tune queries in existing applications
Using stored procedures to tackle performance problems in existing applications
Using the hourglass approach within stored procedures to work with problematic queries that have a lot of aggregations and a substantial amount of filtering
Using stored procedures to address concurrency problems

Understanding Entities

Architecturally, the Entity Framework was designed to help developers overcome a phenomenon commonly referred to as impedance mismatch. In other words, it helps developers overcome the high degree of friction that results when translating set-based data to and from programming objects (i.e., entities) with complex serialization graphs. Although SQL Server tables can cleanly store data that can be efficiently mapped to programming objects through the Entity Framework's intrinsic mapping capabilities, there's a fundamental difference between the two systems: the Entity Framework thinks in terms of entities, whereas SQL Server thinks in terms of projections. The best way to demonstrate this difference is with an example.

One of SQL Server's greatest strengths is projection, which is the ability to return only the columns (or portions of columns) needed to fulfill a particular query. Figure 1 shows a simple example of projection, in which a query returns the top 10 users based on the number of comments they've made. Even though this query is joining a table with six columns and a table with eight columns, the query returns only two columns-the bare amount of data needed to answer the question "Which users have the most comments?" Just in terms of network chatter alone, projection provides huge performance benefits.

Figure 1: Example of SQL Server projection

The Entity Framework, however, commonly takes a much different approach to answering the same question because it isn't interested in merely answering the question. Instead, it's designed to return full details about the entities (i.e., entire rows from tables) that satisfy the question. So, for example, the query generated by the Entity Framework would likely retrieve all 14 columns (i.e., all the columns from both tables) so that developers could work with the results in terms of entities rather than projections. Although this approach might initially seem inefficient, developers commonly need to iterate over returned results for additional processing, outputting information to a screen, and so on. Therefore, it's arguable that getting everything in one pass avoids multiple roundtrips.

The key takeaway for DBAs is that the Entity Framework thinks in terms of entire rows rather than in terms of projections. As such, it's essential they realize that a covering index will lose much of its power and efficiency when dealing with Entity Framework applications and databases. Granted, covering indexes can still address commonly executed Entity Framework queries with multiple predicates, but a lookup is almost always going to be required to retrieve entire rows instead of allowing mere projections from the columns covered in a particular covering index.

In this way, the Entity Framework effectively shunts impedance mismatch onto SQL Server and the underlying hardware in order to streamline developer productivity. This means that the Entity Framework effectively offloads the "pain" associated with impedance mismatch to DBAs. Thus, as Entity Framework applications deal with more and more data and more and more users, they're likely to exhibit performance penalties-just as other non-optimized and untuned workloads typically do as they get larger and more heavily used.

In Entity Framework solutions, scalability isn't about size or growth. Instead, scalability represents a solution's ability to proportionately match response times with growth rates. Think of a scale model, for example. Size isn't as important of a concern as proportion. In this sense, Entity Framework databases can grow without problem. However, when the ratio of how long a query takes to execute doesn't scale with the degree to which the underlying data has grown, there's a scalability problem.

In this regard, an Entity Framework application is really no different than most other applications. Most applications perform just fine when their underlying databases are small and there aren't many users, but as concurrency and database size increase (especially to the point where the database can no longer be entirely contained in RAM), performance problems start to appear with increasing frequency. Left unchecked, they can escalate into performance nightmares.

Preemptively Sizing and Testing Applications

Because performance problems might not materialize until Entity Framework applications accrete larger amounts of data, one powerful technique that I often recommend is preemptively sizing and testing the applications' databases during development. (This technique works only with new projects.) Artificially bloating Entity Framework databases with large amounts of test data is a great way to help spot performance problems early on. A tool that I commonly recommend for this purpose is Red Gate Software's SQL Data Generator, which I reviewed in "SQL Data Generator." Using this tool, it's easy to insert gobs of synthesized data into test systems to simulate the kinds of distributions, sizes, and cardinalities of data anticipated through accretion. More important, the process of preemptively sizing and testing Entity Framework databases (and all new development projects for that matter) reminds developers that the 3MB databases that "fly" on their desktops behave much differently when they become 10GB databases running on heavily used servers.

Using Plan Guides

When DBAs can't preemptively size and test applications because they're already in production, they need to look at options for tuning unruly queries generated by the Entity Framework. Because the Entity Framework relies on the sp_executesql stored procedure, problematic queries are easy to identify. In some cases, DBAs can tune the problematic queries using Plan Guides.

However, Plan Guides tend to increase complexity and make databases more brittle, so they should really only be used as a last resort. Instead, wherever possible, I've found that the use of stored procedures typically provides much better results when it comes to tackling performance problems and problems associated with high degrees of concurrency.

Using Stored Procedures for Performance Problems

To deal with performance problems, DBAs can replace poorly performing queries (which are typically SELECT statements) with stored procedures. However, doing so can bring about other problems and complexities. For starters, many developers use the Entity Framework specifically to get rid of the need to use stored procedures. Consequently, proposing that developers use stored procedures as a means to address performance problems can be met with reactions ranging from incredulity to outright hostility. Thus, I'm not advocating that DBAs require developers to use stored procedures when using the Entity Framework. Instead, my goal is to point out that stored procedures can be a powerful tool for dealing with performance problems.

A bigger problem than culture, however, is that while the Entity Framework supports using stored procedures, leveraging them as a means to address performance problems isn't as simple as telling developers to swap out some Language-Integrated Query (LINQ) operations with calls to stored procedures. The biggest problem with using stored procedures is the need to address mapping requirements that preserve the interfaces and functionality that developers require within their Entity Framework applications. For example, if you replace a poorly performing Entity Framework-generated SELECT statement with a stored procedure that uses a common table expression (CTE), the stored procedure is useless on its own because the Entity Framework-generated SELECT statement was doing much more than just retrieving entire rows and mapping them to objects. It was also mapping a data context-a transparent set of change-tracking capabilities that developers love. The data context allows the Entity Framework to save any changes developers might make to data (or to entities) while in memory. As a result, in addition to swapping out the poorly performing Entity Framework-generated SELECT statement, you need to "swap in" corresponding stored procedures or interfaces that the Entity Framework can map for use in persisting any changes made to that data while in memory.

For a great overview of how the Entity Framework supports stored procedures through mapping, I strongly recommend Chapter 7 of Julia Lerman's Programming the Entity Framework (O'Reilly Media, 2009.) It details the ins and outs of properly managing (or mapping) context when working with stored procedures.

Using the Hourglass Approach

In my experience, performance problems are most pronounced in queries written for ad-hoc reports or operations. In these kinds of scenarios, it's not uncommon to find a lot of aggregations and a substantial amount of filtering occurring in the same query. Consequently, in those rare cases when the query optimizer makes less-than-ideal execution plan choices, DBAs can use what I commonly call the hourglass approach in the stored procedures they'll be using to replace problematic LINQ queries.

In the hourglass approach, you optimize expensive aggregate and reporting queries by breaking operations into smaller chunks (or distinct operations) so that the optimizer is able to use fewer resources and make better performance choices. For example, you can use a CTE or temporary table to reduce the number of aggregations and amount of filtering needed. For most DBAs, this approach is really nothing new. If you're unfamiliar with it, I outline the basics in the DevProConnections article "Generating High-Performance SQL Server Query Results."

When using the hourglass approach, you need to be aware of one caveat: After you address the core performance problem, you still have to make sure that the stored procedure's output is wide enough to feed the entities being generated from these queries. You also need to make sure that the Entity Framework application is able to map the data context if needed. So, once again, the issue of projection versus entities is a consideration in performance tuning.

Using Stored Procedures for Concurrency Problems

Like any other type of application, Entity Framework applications are prone to locking, blocking, and deadlocks when large numbers of concurrent operations need to work with large amounts of data. As such, I won't address those concerns here, other than to again point out that preemptively sizing and testing Entity Framework applications during development can help highlight potential problems before they occur in production.

Instead, I'll discuss another problem that Entity Framework applications will encounter if they use SQL Server tables as queues (i.e., as common storage for multiple concurrent threads, operations, or processes). To be fair, this particular problem is tough whether the Entity Framework is involved or not because it's actually not a performance problem per se. Instead, it's actually an architectural problem that typically stems from a misunderstanding of concurrency and isolation.

The problem is that the Entity Framework simply can't facilitate queue tables without heavy coordination between DBAs and developers. That's because the proper technique for allowing multiple threads or processes to simultaneously mark rows for processing can't be done without the simultaneous use of both READPAST and UPDLOCK table hints. (Snapshot isolation can also be used in some scenarios to address this problem, but most developers aren't even aware of this option.)

Unfortunately, I occasionally encounter nightmare scenarios where developers have "hacked" the Entity Framework to use READ UNCOMMITTED (i.e., NOLOCK) as a way to allow "free" concurrency to queue tables, only to find that enabling dirty-reads allows multiple processes or queue-readers to simultaneously process the same rows, usually with disastrous consequences. Although changing or toggling to a different isolation level is possible, it's the proper use of locking hints that facilitates concurrency in queue tables-not blindly tinkering with isolation levels.

The proper way to facilitate concurrency in queue tables is to use:

Stored procedures that utilize UPDLOCK hints for INSERT operations against queue tables.
Stored procedures that utilize UPDLOCK and READPAST locks when marking rows during processing. These locks let other threads, operations, or processes know that they need to keep their hands off these particular rows.
Stored procedures that are mapped against UPDATE operations to handle .SaveChanges() when called from within Entity Framework applications that have pulled in rows through the stored procedures utilizing UPDLOCK and READPAST locks.

Consequently, the only real way to properly establish functioning queue tables within Entity Framework applications is for DBAs and developers to work together to effectively replace any references to the queue tables with stored procedure calls (in terms of INSERT, SELECT, UPDATE, and DELETE operations). Anything less can result in locking and blocking problems that will lead to timeouts and errors when under significant load.

Making Peace with the Entity Framework

The Entity Framework has given me a few belly laughs in terms of some of the convoluted queries that I've seen it generate. Consequently, I think many DBAs will be skeptical of Entity Framework applications when they begin to encounter them. However, while the Entity Framework does present some unique performance-tuning challenges (which might require nontrivial fixes), I've found that it also addresses a host of other problems (e.g., logic bugs, data-coercion problems) and worst practices (e.g., using dynamic SQL that's susceptible to performance problems and SQL injection) that commonly occur when developers who aren't SQL Server savvy tackle database interactions on their own. Armed with knowing that the Entity Framework thinks in terms of entities rather than projections and armed with the five tuning techniques I presented, I'm guessing that many DBAs will learn to make some sort of peace with the Entity Framework.

Comments

Plain text