In a typical content database, documents—their metadata and the binary large object (BLOB) that contains their content—tend to consume significant real estate. The content database becomes bloated by the BLOBs it stores.
Relating this to SharePoint specifically, Microsoft says, "Typically, as much as 80 percent of data for an enterprise-scale deployment of SharePoint Foundation consists of file-based data streams that are stored as BLOB data. These BLOB objects comprise data associated with SharePoint files."
This “80 percent estimate” doesn't tell the full story, however. What this 80 percent estimate fails to illuminate is the scale of the impact of BLOB storage. In a typical collaborative environment, you might find that the amount of storage required for a single document is significantly more than you would expect—in fact many multiples of the document’s size.
Few writers thoroughly examine the total impact of a document and its BLOB storage through the entire lifecycle of the BLOB. What I hope to do this week is to create a new "pivot"—a new perspective—on the question of storage, capacity, and planning.
And so, I present to you (thanks in no small part to my colleague and fellow MVP, Randy Williams, who contributed greatly to this while traveling to an event in Holland nonetheless!): The Life and Times of a Document and Its Impact on Storage.
Document and Metadata Storage
SharePoint supports documents up to 2GB in size, a software boundary that results from a 32-bit pointer used in SQL Server. There is no way to exceed that limit and to store larger documents in the content database—with or without BLOB externalization.
SharePoint includes a file size upload limitation that's configurable per web application. The default maximum upload size is 50 MB—considerably smaller than the 2 GB hard limit.
This lower limit reflects practical concerns including network performance, the performance of transferring large files over HTTP, and user expectations for performance of file transfer. Many organizations retain this default upload size or raise the limit slowly and with careful testing.
Each document in a SharePoint library has metadata associated with it. Some metadata is user-configured, such as columns in the library.
Other metadata is used internally by SharePoint. The amount of metadata associated with a document will vary based primarily upon user-configured metadata.
It's easy to understand that scenarios with larger documents, necessarily, see a higher ratio of BLOB-to-metadata storage, and scenarios with smaller documents and more metadata will see lower BLOB-to-metadata ratios. The 80 percent estimate is based on an average across multiple SharePoint environments.
But here’s the rub: a document is rarely if ever stored only once.
When version history is enabled for a document library, any change to the document or its metadata result in additional storage utilization. Two points are often misunderstood and have significant impact on storage.
1. No differential compression is used within SharePoint. So, when a new version is saved, the amount of storage represents the entire size of the file—not just the “differences” between versions. Conceptually, two versions of a document with minor changes will occupy 2 x (document size + metadata) of storage.
2. A version is created if the document is modified or if metadata is modified. So if a document is uploaded to a library and is never changed, but the metadata associated with that document is changed five times over the course of a month, the storage occupied by that document is approximately 5 x (document size + metadata).
When versioning is enabled, the impact of a document on storage is multiplied by the number of versions of that document.
Therefore, it's critical to enforce limits to version history—unlimited version retention can lead to significant database bloat.
Recycle Bin Contents
When a document is deleted, the document and its versions are retained based on the web application’s settings for the SharePoint Recycle Bin. A user can restore a document she deleted from the Recycle Bin.
When a user empties the Recycle Bin, the document and its versions continue to be retained, and can be restored by site collection administrators, from what is referred to as the second-stage Recycle Bin.
Each site collection has a Recycle Bin. However, the Recycle Bin has two configurable settings, both scoped to a web application. These settings apply to all site collection Recycle Bins in the web application.
The first Recycle Bin setting specifies the total number of days that a deleted document will be retained by the Recycle Bin. This setting applies from the moment the document is deleted.
It doesn't matter whether the document is in the user Recycle Bin or the second-stage Recycle Bin. X days after a document was originally deleted by the user, it's deleted from the Recycle Bin and the document is removed from the content database.
The second setting applies a storage quota to the second-stage Recycle Bin. When items are moved to the second-stage Recycle Bin, they count against this quota. When the quota is reached, the oldest items in the second-stage Recycle Bin are removed to make room for newly-deleted items.
The quota is configured as relative to the quota of the site collection. So if a site collection is subject to a 50GB quota, and the second-stage Recycle Bin is limited to 50 percent of the quota, then the second-stage Recycle Bin for that site collection is effectively capped at 25GB.
Therefore, the total impact of storage of a document on a content database must take into account the fact that, until a document is purged from the second-stage Recycle Bin, the document—its BLOB and metadata—and those of the document’s versions—continue to impact the content database.
Audited activities generate entries in the audit log. The amount of storage required for auditing can be significant, particularly if you are auditing view activities. However, audit entry size and the size of audit logs is not related to document size, or to whether BLOBs are stored in SQL or are externalized. Therefore, while you should consider auditing when estimating total storage requirements for a content database, we will not examine auditing in more depth in this white paper.
Office Web Apps Cache
To improve performance of SharePoint when the Microsoft Word web app and Microsoft PowerPoint web app are used, the web apps create renditions of a document in a cache called the Office Web Apps cache.
When a document is rendered, it can be pulled from the cache. A document is re-rendered only if it doesn't exist in the cache, or if the document has changed after the rendition in the cache was created. A timer job removes documents from the cache after a configurable expiration period.
If a web application is associated with the Microsoft Word or PowerPoint web apps, one content database will contain the cache for all content in the web application. In a document-heavy web application, the cache can grow quite large.
By default, the cache is capped at 100GB. It's best practice to configure Office Web Apps to use a separate, dedicated content database in a SharePoint web application, and to manage the size of the cache to optimize performance and storage. You can learn more about this at the Microsoft TechNet site.
The size of the Office Web Apps cache isn't dependent on whether BLOBs are stored in SQL Server or are externalized. It's based purely upon the number and size of documents, frequency of access to those documents, and on administrator configuration.
So while the Office Web Apps cache should be considered as part of the estimate of storage required for a web application, it will not, if in a dedicated content database, affect the storage required for other content databases in a web application.
A document indirectly affects the storage required by service applications. For example, access to a document might be tracked by the Web Analytics service application.
Tagging, commenting and rating activities consume approximately 9KB per entry in the social tagging database of the User Profile service application.
Such data are relatively negligible, aren't dependent on whether a document’s BLOB is stored in SQL Server or is externalized, and aren't directly dependent on the document’s size.
However, the Search service application is affected directly by both the number of documents and their size. The crawl database, properties database, and index partitions each have a relationship to the number and size of documents.
Search capacity planning is both a science and an art, but very rough estimates from typical implementations fall around 20 percent of the total size of indexed content (the corpus).
So if you are indexing 1TB of typical content, you can expect approximately 200GB of storage utilization by search-related databases and the index. For more information, see Microsoft's take on this.
Search and other service database sizes aren't dependent on whether BLOBs are stored in SQL Server or are externalized.
SQL Server logs all activity to the transaction log for a database before committing the transaction to the data portion of the database. Transaction logs grow until a log backup , at which point space used by the log is cleared, but the file size does not shrink.
You can shrink a SharePoint transaction log manually, which can be helpful if a transaction log has grown out of control, but it's best practice to manage transaction log size by managing transaction log backups.
When a document is uploaded, or modified, the document BLOB and metadata are written, first, to the transaction log. Then the transaction is committed to the appropriate tables in the content database itself.
Therefore, the true impact of a document on total content database size, including the transaction log, can be approximated as document size x (creation + modifications) x 2 during the window between log backups.
The transaction log size is directly related to BLOB storage. If BLOBs are externalized, and aren't stored in the content database, then the BLOB is also not written to the transaction log.
As you can see, the storage required for just one document can vary greatly, based on version retention, modification of the document or its associated metadata, web application settings such as Recycle Bin, auditing settings, the use of Office Web Apps and other service applications, and even backup policies.
In a typical, highly collaborative scenario, an active document may be consuming storage equivalent to many multiples of the document’s actual size.
Obviously, this sets the stage for a discussion of how to most effectively manage and optimize your SharePoint storage. We’ll tackle that in the next couple of weeks!