Does Single-Instance Storage Matter Anymore?

Larger, faster, cheaper systems lessen its importance

Single-instance storage was a hyped feature of Exchange Server 4.0 when it shipped in March 1996. Single-instance storage is a system's ability to keep one copy of content that multiple users share. In the case of Exchange, single-instance storage holds one copy of message contents and attachments within the Store. Individual mailboxes access the content through a set of database pointers (i.e., references) and other properties, such as message senders and their recipients, that let the content take different identities. For example, a message can be filed in the Inbox folder in one mailbox and in a different folder in another mailbox.

Within a mailbox store, one large database table (the message table) holds message content. The pointers to the messages reside in message-folder tables, and Exchange maintains a separate message-folder table for every folder in the store. For example, the details of your Inbox folder reside in a message-folder table. When you want to read a message, Exchange takes the properties of the message from the message-folder table and the content from the message table and provides this information as a single data stream to the client, which then displays the result.

Single-instance storage keeps an access or usage count for each piece of content and increases this count by one for each mailbox that shares the content. Single-instance storage decreases the count as content is removed from mailboxes; when the count reaches zero, single-instance storage deletes the content from the Store.

Exchange isn't the first email system to use single-instance storage. ALL-IN-1, a corporate messaging system that Digital Equipment has sold since 1984, uses a similar scheme. The major difference between the implementations is that Exchange holds everything—content, pointers, and item properties—within one database, whereas ALL-IN-1 uses a database for the pointers and properties and individual files for messages and attachments. In both cases, single-instance storage is designed into the architecture to reduce the demand for disk space and eliminate redundancy. PC LAN—based systems such as Microsoft Mail (MS Mail) and Lotus cc:Mail usually deliver separate copies of messages to each mailbox—an approach that's perfectly adequate when a server never has to process more than 50 copies of a message. However, as servers scale up to support hundreds or thousands of mailboxes, creating individual copies imposes a huge drain on system resources and can swamp the capacity of the I/O subsystem to handle the workload. Matters only become worse as messages and attachments become larger.

The Evolution of Single-Instance Storage
At one point, anyone who created an Exchange implementation plan had to take single-instance storage into consideration. Single-instance storage was considered a major bonus of the Exchange architecture, especially as servers took on the load of multiple—PC LAN post offices that they replaced. However, I think that the need to conserve hardware resources was a major influence on Microsoft's decision to incorporate single-instance storage into Exchange.

Consider that the systems in use from 1996 to 1997 were much smaller than today; disk space was more restricted and a lot more expensive, and you had to conserve network bandwidth. Now, systems are a lot faster and come equipped with more memory, the software makes better use of features such as multiple CPUs, copious disk space is available, and network bandwidth is cheaper. I haven't met many administrators or system designers recently who think about single-instance storage when they assess system design. The world has changed, and even Exchange has now undermined the feature through the introduction of multiple mailbox stores in Exchange 2000 Enterprise Server.

Remember that multiple recipients can share messages if all the mailboxes reside in one mailbox store. When you create multiple stores, messages have to be delivered to every store that hosts a mailbox on the recipient list, which results in content duplication (because messages are stored in multiple databases) and additional I/O traffic (because data has to be committed to multiple databases).

Monitoring the Sharing Ratio
Simply put, the sharing ratio reflects how many mailboxes reference messages in a mailbox store. Some messages have a high sharing ratio—for example, a message sent to 10 mailboxes in the same store has a sharing ratio of 11:1 (10 for the other recipients plus 1 for the sender). Some messages have low sharing ratios; the best example is a message sent to an external recipient, such as an SMTP address. In this case, the sharing ratio is 1:1 because only one mailbox (the sender) references the message. When I discuss sharing ratio, I'm not interested in the ratio for any individual message. Instead, I look at the overall sharing ratio for a mailbox store. Thus, the sharing ratio is the average number of references to a message in the store.

What factors influence the single-instance storage or sharing ratio that you'll see on a server? Here are a few factors that come to mind:

  • Messages sent to many users or large distribution lists (DLs) increase the ratio because more users share a single copy of a message. If you can arrange for users who tend to send messages to one another to share a server, you'll have a higher sharing ratio. Apart from achieving a higher sharing ratio, keeping messages on one server whenever possible reduces network traffic and speeds message delivery.

  • For much the same reason, the sharing ratio tends to be higher on larger servers than on smaller servers.

  • Fewer people tend to share messages sent to external Internet recipients than those sent to internal recipients. This statement is a generalization, but if you think of the messages you send to Internet recipients, you'll probably find that you address most messages to one recipient. Incoming Internet messages are often addressed to one recipient on the target server, which further reduces the sharing ratio.

  • Mailboxes that you transfer between servers by using the standard Move Mailbox option (in both Exchange 2000 Server and Exchange 5.5) preserve single-instance storage as much as possible. Note that sharing is not preserved when you move mailboxes from an Exchange 5.5 server to an Exchange 2000 server or vice versa. In single-instance storage, Exchange uses message properties to check whether a message already exists in the store on a target server. If the message exists, Exchange creates only a new pointer. However, if the message doesn't exist (because Exchange never delivered it to a mailbox on the target server or single-instance storage has since deleted all copies), Exchange creates a new copy of the content. An exception occurs when you use the Exmerge utility to transfer mailboxes. This utility always creates a new copy of message content. Exmerge, which is a simple export-import utility that's designed only to extract or import data from mailboxes, doesn't perform a check.

  • Exchange 2000 servers that run multiple mailbox stores have lower sharing ratios than those with one mailbox store. The reason is simple. As soon as you split mailboxes across stores, you increase the potential that Exchange must deliver a message to multiple databases. The more stores you have on a server, the lower the overall sharing ratio is. The implementation of multiple stores offsets the higher sharing ratio that you see with large servers.

To check the sharing ratio on your server, you use the performance counters for the MSExchangeIS Mailbox object. Both a separate counter for each mailbox store and a total counter are available. Performance Monitor calculates the sharing by dividing the total number of entries in the message table by the total number of entries in the message-folder table. One row in the message table represents a message no matter how many folders the message appears in. One row in the message-folder table represents every folder in every mailbox. For example, if a message is delivered to 20 users whose mailboxes are in the same store, the counter creates one entry in the message table. Entries already exist for the 20 Inboxes in the message-folder table, and the counter updates them with a pointer to the new message. Gradually, as users delete messages, the counter first moves the pointers to the entry for the Deleted Items folder in the message-folder table, then removes them when users empty the Deleted Items folder.

I monitored single-instance storage on an Exchange 2000 server with two mailbox stores, which Figure 1 shows. The highlighted row at the bottom of the window indicates a high sharing ratio (5.655), which means that on average, each message in the Store has 5.655 references to it. This ratio is extraordinarily high; it indicates that many messages go to many mailboxes on that server. Alternatively, a high ratio can also mean that your users are human pack rats who don't delete messages as often as they should. Sharing ratios I've seen across hundreds of Exchange servers deployed at Compaq range from approximately 1.2 to the value you see in Figure 1. Anecdotal evidence that I've gathered at conferences or through discussions suggests that a range of 1.5 to 2.5 is considered usual.

Unlike other performance counters, such as CPU utilization, these counters don't monitor realtime activity. Exchange understandably doesn't dedicate valuable resources to constant monitoring of the number of messages and folders in a store because the values are unlikely to vary dramatically. If you want to keep track of the sharing ratio on a server, record the value at a regular interval (weekly or monthly is enough).

Exchange also provides a Sharing Instance Ratio counter for items in the public store. On my Exchange 2000 server, the counter reported a figure of 22, which implies that each item in the store was referenced 22 times. Mailboxes don't exist in a public store, so imagining how one item can be referenced more than once is hard unless that item resides in multiple public folders. However, attaining an average sharing ratio of 22 means that there's a lot of cross-indexing in public folders. Only 123 instances of public folders exist in the public store, so I can't work out how this result occurred.

Other counters don't make sense either. For example, an average sharing ratio of 0.000 appeared following a cluster transition when I moved the Information Store service between two physical nodes, so I assume the value is the result of a glitch in the cluster transition code.

Does Anyone Care Anymore?
Except for people like me—dinosaurs who remember the days when disks were expensive and paying attention to how you organized data was important—no one seems to care about sharing ratios anymore. The introduction of multiple mailbox stores in Exchange 2000 seems to imply that single-instance storage is now one of the esoteric backwater features of Exchange, part of the basic architecture that has become less important over time.

Microsoft had to do something to address the problem that Exchange databases had just become too big (when you pass 100GB, a database becomes harder and harder to maintain), and splitting the load across multiple stores is a good solution. In addition, increased server performance, including the all-important ability to process I/Os efficiently, makes it possible for servers to support multiple stores. However, Exchange has lost some of the beauty of its original design. Unfortunately, the architecture, utilities, and management disciplines don't exist to enable single-instance storage to last much longer.

Hide comments


  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.