DATA TALES #11: The Case of the Ballooning Tables

Over recent months I have written a series of articles on how the overall size of a financial services database was tamed by the application of table compression, XML compression, and PDF size reduction. I have applied this approach at many sites but recently came across one where the outcome seemed to constantly be getting worse rather than better. Every time I tried to improve the situation, it got worse. Let’s discuss why.

Transforming the Database

I’ve been writing about how important table compression is and the outcomes that we’ve achieved with it. One of the first steps I always take before applying table compression though, is to work out which tables (and if partitioned, which partitions) should have ROW or PAGE compression. In the 3rd article of the Database on a Diet series, I described code that can be used to determine an appropriate mix of compression. Using ROW compression for all tables (or partitions) or PAGE compression for all tables are both inappropriate so it’s important to apply those lessons. But because the decisions that the code that I supplied makes are based on DMVs that get reset on each system restart, they should only be applied on systems that have been in use for some time without restarts.
For this database at a large financial organization, all the signs were good. The server had been running non-stop for over seven months, and the tables were large enough to be of interest. I ran the scripts to work out a good mix of PAGE and ROW. Because of the size of the tables though, I knew that I couldn’t perform all the required rebuilds of the tables and indexes during a single outage. So we planned a series of rebuilds over a series of weekends.
Before doing anything like this on large production systems, it’s important to make sure that it will work, first time. We had a copy of their production database to run tests on. It was identical to their real production database and the only difference was that the personal details of clients were masked.

Masking Personally-Identifiable or Sensitive Data

Before continuing with this story, I feel the need to make a few comments about “masking” production data to remove personally-identifiable or sensitive information. It is very hard to organize suitable databases for developers and testers to work with. Generally, I see organizations try one of the following:

Just use a copy of the production database. I’m hoping that this is uncommon now as it usually totally breaches any concepts of protecting sensitive data. Unfortunately, I do see it happening at way too many companies.
Starting with a database that only has reference data and letting developers and testers enter the other data that’s needed. The outcome of this is invariably databases that are nothing like the actual production databases, and rarely are out-of-the-ordinary options exercised. For example, you won’t find odd parsing issues in application code unless special characters are used in places where you didn’t immediately expect them.
Using generated data. This is a good option, particularly for greatly expanding the range of values being tested. For example, it’s one thing to use an nvarchar column to hold someone’s name, but an entirely different thing to use a wide variety of unicode characters in that column during testing. One problem with this method is that the data often doesn’t look real to the developers and testers. Some tools can be extended so you can add your own generators. For example, I could create a US phone number generator or an Australian phone number generator. The choice of tooling in this area though isn’t great.
Using a “masked” copy of the production database. While this generally provides a database that’s quite like the original, the masking process can be fraught with issues. For example, I recently was given a masked database to work with but found all the client details in the metadata of the PDF images that were stored within the database. Ironically, the PDFs themselves had been replaced.

When it comes to compression though, masking the data in a database can lead to bizarre outcomes. For example, if many values are changed to a single value, you might see compression ratios in test that are completely unrealistic.

The Ballooning Database

We had done testing of the compression and rebuild process on the masked database that we were provided. We worked out roughly how long each rebuild would take and, based on our hardware being lower spec than the client’s hardware, presumed that the rebuilds on site would be no worse than those in our test system.

So when the first weekend arrived, a few tables were rebuilt. And that’s where the fun began. A rebuild that took at most 2 hours on our system took over 6 hours on the client system. This led to longer outages than planned. We were using offline rebuilds rather than online to minimize the time taken and because we had an outage arranged.

Later, we checked and double-checked the scripts and deployment options and could find nothing wrong.

There is always a small chance of compressed data being larger than the same uncompressed value but we had never seen that in any of our testing on SQL Server table compression. At first, we had not checked the overall size of the compressed table and just presumed that it would be much smaller, matching the outcomes from our testing. This seemed reasonable as this table was not masked and so should have been identical in production to in our test environment.

The next weekend we tried smaller tables. Once again, the time taken was much longer than expected. And this time we also noticed that the size of the overall database had increased. That wasn’t something we would have expected. So, we started investigating the individual tables. The initial large table that had been rebuilt wasn’t 12% of the original size as we had expected but was now around double the original size.
On investigation, we found that the other tables that had been compressed were also larger (not smaller) than they originally were.
At that point we realized that something odd was going on. The nature and the layout of the data could not account for this.

Each partition of a table and index can have a different compression strategy so we queried sys.partitions to check that the data was in fact compressed:

Sure enough, it was page compressed as expected.
It was only when we checked sys.indexes that we happened to notice that the fill_factor was 10. We were more than just surprised.
Again, we double-checked our scripts and realized that we had not specificed the FILLFACTOR value.

Fill Factors

The default fill factor on a SQL Server instance is 0. This is the value that’s used when no other value has been specified. Functionally it’s the same as a value of 100. The aim is to fill pages. Generally we like to see pages completely full whenever possible, apart from some scenarios where inserts are occurring throughout tables. We tend to try to minimize the situations where that occurs.
When you create a table, the default FILLFACTOR is also 0. This means that you haven’t changed it. It will then work the same as if you had specified 100, so long as the default value at the server level hasn’t been changed.
It’s important to understand that changing a FILLFACTOR is a trade-off between out-of-order INSERT performance and SELECT performance. The lower the FILLFACTOR value, the faster you can insert data into the middle of a table (rather than at the end of the table).

FILLFACTOR and SELECT Performance

The downside of a lower FILLFACTOR is that you now need to read more pages to get the same amount of data. For example, if you had a value of 50, you now need to read twice as many pages to get the same number of rows.
It’s important to consider very carefully how much data your system reads as opposed to how much it writes. Whenever I have discussions with clients on what percentage of their I/O operations are writes, they always guess values like 20% or 30%. Every time I measure it (on typical OLTP systems), I find the real value is more like 1% or 2%.
This means that if I’m going to make a design choice that makes writes faster but slows down reads, there had better be a very good reason for doing so. In general, I want to optimize for the 98% use case, not for the 2% use case. There are exceptions though where the time to perform those INSERT operations is critical. In those cases however, I’d be looking at a table redesign to try to have appended INSERT operations at several positions within a table, rather than random inserts over the table.
One of the things that I don’t like about the design of SharePoint is that they default every index to 70% FILLFACTOR. They do that, presumably, because they have used clustered uniqueidentifier (GUID) values on the tables. That seems a poor design choice that we could discuss another day but they use the 70% value to get around the INSERT performance hit. However, they are immediately throwing away a large percentage of their SELECT performance.

The Culprit

In this case, we checked the value at the server and there was our answer:

Inexplicably, someone had set the default fill factor to 10 at the server level. A value of 10 means that SQL Server tries to construct it with 90% free space in each page.

I struggle to think of any realistic scenario where that setting would make sense.

But it forced us to learn a very important lesson (again):

Even if you are working with a near-identical database in your testing, the outcomes can be completely different if the server-level settings aren’t the same.

This doesn’t only apply to settings like these; it applies to less obvious settings like trace flags. It’s important that your testing reflects the production environment.

And it was time to have a “chat” with whoever at the client site decided that setting the default fill factor to 10 was a good idea.

Comments

Plain text