New Ways to Enable High Availability for File Shares

What's the coolest feature in Windows Server? My guess is that file-sharing services didn't make your top five. But that might change with Windows Server 2012. File and Storage Services combined with the new Server Message Block (SMB 3.0, formally known as SMB 2.2) protocol introduce some truly great new features that completely change how file sharing works and that can be used in a highly available configuration. In this article, I'll focus on two new capabilities of file services in a failover cluster: SMB Transparent Failover and SMB Scale-Out. I'll show how you can use these capabilities together to provide a file services environment that can be used for the most demanding workloads, including hosting Microsoft SQL Server databases and Hyper-V virtual machines (VMs).

File Services in a Failover Cluster Environment

Before I focus on the new features, I want to quickly describe how file services work in a failover cluster environment, which allows highly available file servers and, more specifically, file shares. A Server 2012 failover cluster consists of as many as 64 servers (up from 16 in Windows Server 2008 R2) that have the Failover Clustering feature installed and are configured to share a common set of storage and services.

The services that are defined in a cluster can be moved between the servers (aka nodes) in the cluster. These services consist of various resources, such as IP address, network name, storage, and the actual service, such as a file server, print server, VM, Microsoft Exchange Server mailbox server, and so on. Services can be moved between nodes in the cluster in a planned situation or in an unplanned scenario, such as a server failure. In the latter case, services that ran on the failed server are automatically redistributed among the remaining nodes in the cluster.

Figure 1 shows a four-node cluster and a file server resource. The file server offers a single file share, which stores its content on an NTFS-formatted LUN. The LUN is a block of space from the shared storage, to which all the nodes in the cluster can connect. The file server, and thus the file share, is online initially from the third node of the cluster. This node also mounts the LUN, which contains the file server content. During any failure, the file server moves to the fourth node, which also mounts the LUN, which is required to offer the share content. The LUN must be mounted by whichever node is offering the file server that corresponds to the content because NTFS is a shared-nothing file system and can't be accessed concurrently by more than one node. Therefore, when a file server moves to another node, the LUN must be moved between nodes as well. A file server is online by only one node at a time.

Figure 1: Basic failover cluster with a service moving between the nodes

SMB Transparent Failover

The previous example involves challenges to using a file share that is moved between nodes in the cluster in planned and unplanned scenarios. First, when a file on a file share is used by an application, handles are typically created to allow an application to access the file and potentially to lock the file to stop another application trying to write at the same time. In addition, the handle defines how data is accessed and specifically whether data can be buffered on the file server, which might help to enhance performance. With Server 2008 R2 and earlier, any handles and locks are lost when the file server moves to another node. In general, this behavior doesn’t cause a huge problem for regular users accessing Microsoft Word documents. However, that wouldn't be true if this was a database used by SQL Server.

The second challenge involves the time that is needed for a file server client to recognize that a file server is no longer available and to start taking recovery steps. TCP/IP timeout values can typically cause an interruption of about 40 seconds -- unacceptable when server applications store data on file shares. For those 40 seconds, all activity that requires file I/O to the share pauses -- an event commonly known as a brownout. Removing these challenges is vital for SMB. If server applications such as SQL Server and Hyper-V are going to use SMB file shares, they can't lose data handles or suffer 40-second pauses in I/O!

The new SMB Transparent Failover feature addresses both issues. The feature enables continuously available file shares for SMB 3.0 clients, removing the loss of handles during a failover and reducing the time needed to detect that a file server has moved to another node, thus reducing brownouts.

Keeping file shares available. SMB Transparent Failover consists of several configuration changes and new technologies. One benefit that file servers traditionally offer clients is buffering of data writes to disk. This element provides faster acknowledgments to client write requests because the file server caches the write operation in its volatile memory (meaning that if the server loses power, it loses the data), tells the clients that the data is written so that the client can carry on its work, then performs the write in the most optimal way. Certain applications always open handles with this caching disabled, through the use of the FILE_FLAG_WRITE_THROUGH attribute when creating the handle, ensuring that data is always written to the actual disk before receiving acknowledgment and avoiding any volatile cache. SMB Transparent Failover sets the FILE_FLAG_WRITE_THROUGH as the default for all created handles, eliminating the use of volatile memory cache. Now, there might be some slight performance implications because the cache is no longer used, but the assurance of data integrity is a good trade for the possibility of a slight performance degradation.

The second change that SMB Transparent Failover makes is how the OS manages file handles. File handles typically are stored in the memory of the file server. However, if a node fails and the file server moves to another node in the cluster, the handles are lost -- bad news for the using application. In addition to storing the handle state in memory, SMB Transparent Failover backs up the handle state in the Resume Key Database, in the System Volume Information folder of the disk on which the file resides and that the handle is referencing. Storing the handle information on disk maintains the handle state when the file server moves between nodes in the cluster. However, because disk access is multitudes times slower than memory, heavy metadata–generating workloads such as creating, deleting, renaming, extending, opening, and closing files cause additional I/O in the Resume Key Database, removing available I/O from normal disk usage. But again, this tradeoff is acceptable to ensure that handles are maintained when moving file servers between nodes. (See the sidebar "What About Performance?" for my rationale on this exchange.)

Reducing brownouts. To meet the second challenge and reduce the time that an SMB client takes to realize that its TCP connection has died, the cluster must be proactive. The cluster must notify SMB clients that connect to a cluster-hosted share whenever the hosting file server moves to another node. That way, the client can more quickly reconnect. Enter the new SMB Witness capability, which operates something like this:

SMB Client: "I want to connect to this share on your ServerA."

SMB ServerA: "OK, you are connected. This share is hosted on a cluster; let your SMB Witness process know."

SMB Client Witness: "Great! Tell me about all the nodes in the cluster."

SMB ServerA: "Here is a list of all the nodes in the cluster: ServerA, ServerB, ServerC . . ."

SMB Client Witness: "Hey, ServerB. I am connecting to this share with this IP address on ServerA. I want to register with you so that you can tell me if something happens to ServerA or if the file server moves."

SMB ServerB: "Sure, I'll let you know."

After this exchange, if anything happens to that file server in the cluster, the SMB client is notified proactively via its SMB Witness process and can reconnect far more quickly than TCP/IP timeouts would allow. The new time to detect and react to a failure or file server move is likely in the range of 5 to 7 seconds instead of 40 seconds.

To enable SMB Transparent Failover, you don't need to do a thing. When you use the Failover Cluster Manager, Server Manager, or Windows PowerShell to create a file share on a Server 2012 cluster file server, SMB Transparent Failover is enabled by default on that share. (Note that this isn't the case when you create the share by using Explorer or the Net Share command, neither of which understand SMB Transparent Failover.) Windows 8 or Server 2012 clients, which are SMB 3.0-compatible, will then use the SMB Witness capability and will open sessions to use write-through handles.

You can use PowerShell to confirm that this process is happening. In my lab, I have two nodes in a cluster with a file server resource and a share. I connected from my client machine, and from an elevated PowerShell window I executed the following command on a node in the cluster:

PS C:\> get-smbwitnessclient | select clientname, fileservernodename, witnessnodename

clientname fileservernodename witnessnodename

---------- ------------------ ---------------

savdalwks08 WIN8FC01 WIN8FC02

As you can see, the output shows the name of my client computer (savdalwks08), the file server to which the client is connected (Win8FC01), and the node with which it has registered for notification (the witness, Win8FC02). (Another option is to use the Get-SmbOpenFile PowerShell cmdlet and look at the ContinuouslyAvailable property.)

To view a list of all the administrator-created shares and to determine whether they are configured for continuous availability, use the following PowerShell code:

PS C:\> Get-SmbShare | Where {$_.Scoped -eq "true" -and $_.Special -ne "True"} | Sort ClusterType | Format-Table Name, ScopeName, ClusterType, ContinuouslyAvailable, Path

Name ScopeName ClusterType ContinuouslyAvailable Path

---- --------- ----------- --------------------- ----

NonCSVData WIN8FSTRAD Traditional True E:\Shares\NonCSVData

DataCSV WIN8FSSCOUT ScaleOut True C:\ClusterStorage\Vo...

SMB Scale-Out

Using file servers in a cluster hasn't changed fundamentally since its introduction. Only one node in a cluster can mount and host shares for a particular NTFS-formatted LUN at any one time. This single-node offering of services can limit scalability and introduce delays because LUNs must be dismounted, moved, and mounted when the file server resource moves. This necessity has led storage and file services architects to make some sub-optimal design decisions when planning their clusters, to avoid nodes sitting idle.

Consider an organization that wants to share one NTFS volume but requires the share to be highly available. This scenario requires at least two hosts in a cluster, but only one host can actually offer the share. To avoid this active/passive situation in which one host does nothing, the storage administrators divide the LUN into two, create two NTFS volumes (one on each volume), then create two file servers in the cluster, each with its own share. This setup allows each node to offer one share and to host the other node's share during a failure. This way, both hosts are working -- but the storage is now divided in ways the organization might not want. In addition, if you don't divide the content correctly, one share might get more traffic than the other, causing an imbalance and potentially forcing you to move data around. And this is with just two nodes. Now imagine that you have four nodes, as Figure 2 shows, or eight nodes; that's a lot of separate LUNs, NTFS volumes, and shares just to keep all the nodes in the cluster busy.

Figure 2: Required compromise with traditional clustered file servers

The root of the problem is that NTFS volumes don't share and can't be used by more than one node simultaneously. This issue was partially solved in Server 2008 R2, which introduced Cluster Shared Volumes (CSVs). I wrote about CSVs in " Introduction to Cluster Shared Volumes," so I'm not going to discuss it in detail here. Basically, CSV enables a single NTFS-formatted LUN to be written to and read from all nodes in the cluster simultaneously, through some clever behind-the-scenes mechanics. CSVs in Server 2008 R2 were supported only for the storage of Hyper-V VMs running on the Hyper-V hosts in the cluster that contained the CSVs.

Server 2012 expands the use of CSV to a new type of cluster file server, namely the new SMB Scale-Out file server. The file server type -- Scale-Out or Traditional (i.e., the existing file server model) -- is selected at the time of creation. When you create a new file server of the Scale-Out type, you must create the shares on folders that are stored on CSV volumes. In Server 2012, NTFS volumes that have been CSV-enabled show as file system type CSVFS instead of NTFS. In reality, the file system is still NTFS, but the change in file-system labeling makes it easy to distinguish between volumes on disks that are CSV-enabled (i.e., CSVFS) and those that are not (i.e., NTFS). Remember that a CSV is available to all nodes in the cluster simultaneously, so this created share can now be offered by all the nodes in the cluster at the same time, and all the nodes can get to the content. When creating a Scale-Out file server, you don't need to specify an IP address. The IP addresses for the interfaces that are configured for client access on the cluster nodes are used; all nodes offer the service.

Another great feature is the ability to use SMB Transparent Failover to move a client from one node that offers a Scale-Out file server to another node, without any access interruption. Suppose, for example, that you want to place a node in maintenance mode. The following command moves a specific SMB client from one node to another; you can easily use PowerShell to execute this command for all clients that use a specific node in the cluster.

First, I determine which server an SMB client is using (we used this command previously):

PS C:\> get-smbwitnessclient | select clientname, fileservernodename, witnessnodename

clientname fileservernodename witnessnodename

---------- ------------------ ---------------

savdalwks08 WIN8FC01 WIN8FC02

Now, I move that client to my other server:

PS C:\> Move-SmbWitnessClient -ClientName savdalwks08 -DestinationNode Win8FC02

To verify that the move happened, I rerun my command. I see that the client has moved to the other node in my cluster, and the witness is now my original server. (As the file server and witness can't be the same server, that wouldn't be useful!)

PS C:\> get-smbwitnessclient | select clientname, fileservernodename, witnessnodename

clientname fileservernodename witnessnodename

---------- ------------------ ---------------

savdalwks08 WIN8FC02 WIN8FC01

What does this output mean? Refer again to Figure 2. You can now create that single big LUN that you wanted, with one NTFS volume that all four nodes share simultaneously. (Microsoft supports as many as eight nodes offering one SMB Scale-Out file server). This capability simplifies management, eliminating the need to associate numerous separate LUNs, shares, and IP addresses with each file server. So why does the traditional file server type still exist? Why would you ever use it?

As I mentioned previously, CSV performs some clever mechanics to enable one NTFS volume to be written to and read from all nodes in the cluster simultaneously. One of the cleverest parts is handling metadata writes to NTFS volumes, which is the biggest problem with multiple computers concurrently using one NTFS volume. Having two servers writing metadata at the same time is likely to cause a corruption. CSV solves this problem by having a coordinator node for each CSV disk. This node mounts the disk locally and performs all metadata activity on behalf of the other cluster nodes that send metadata writes over the cluster network to the coordinator. (These other nodes can still directly access the disk for standard data I/O.) This metadata redirection over the network can cause latency in operations. That's why the SMB Scale-Out file server is targeted at key application server workloads such as SQL Server and Hyper-V, which are very light on metadata activity and focus on data I/O. When you contrast the server application I/O characteristics with those of a typical information worker using Microsoft Office documents, the I/O for an information worker is typically 60 to 70 percent metadata operations. That's a lot of data being redirected. I'm not saying that using an SMB Scale-Out file server in such a scenario won't work or will perform badly if architected correctly, but it's certainly something to consider. At this time, the Scale-Out file server is recommended only for server applications like SQL Server and Hyper-V.

There is another reason that the Scale-Out file server is unsuitable for storing Office documents and other user data. The Windows file server platform is used in many situations because of features such as quotas, file screening, file classification, BranchCache, and (in Server 2012) data de-duplication. None of these features are available on a Scale-Out file server. Server applications don't care about such features.

Closing Thoughts

When you combine the Scale-Out file server with the SMB Transparent Failover feature (which works for traditional and SMB Scale-Out file servers), you get a file services platform that allows multiple servers to serve the same share with the same content. The result is great scalability for clients and a resiliency that was previously impossible. Although Scale-Out focuses mainly on SQL Server and Hyper-V workloads, expect more types of workloads to be tested and recommended over time, offering customers many new options in their storage and overall IT architectures.

Sidebar: What About Performance?

I've talked about how the changes that SMB Transparent Failover makes could introduce a slight performance penalty because of the bypassing of write cache and the increasing of I/O from metadata-heavy operations. This penalty might sound fairly off-putting. But in reality, many key server applications that would benefit from this technology, such as Microsoft SQL Server and Hyper-V, specify the use of FILE_FLAG_WRITE_THROUGH to bypass write cache anyway. Also, such applications perform very few metadata operations. Rather, they read and write to the data of the file, so they won't be much affected by the disk-based Resume Key Database. These changes are more likely to have an effect on user workloads, such as opening Microsoft Office documents. Such workloads aren't the focus of this feature.

Comments

Plain text