Enabling the Next-Generation File Server with SMB 3.0

Every organization uses Server Message Block (SMB) in some form to access storage. It might be to access logon scripts, to access and use software-installation media, or for users to access their documents and MP3 collections. But what SMB hasn't been used for is a file-level protocol (in which the client doesn't directly access the disk blocks but instead is served files) for enterprise applications to access remote storage. When it comes to communicating with storage for enterprise workloads, block-level technologies (in which the server can communicate directly with disk blocks) such as iSCSI and Fibre Channel (and maybe NFS for non-Windows workloads) have been top on the list.

Consider the difference between a user accessing a document on a file share and an enterprise application storing its database on a file share. For a user editing a Microsoft PowerPoint document from an SMB share, portions of the document are cached locally, and occasionally the user clicks Save. If the SMB file-server experiences a problem such as rebooting, or if it's clustered and the file share moves to another node in the cluster, the user loses the handle and lock to the file—but without any real impact. The next time the user clicks Save, everything is re-established and no harm is done. Now consider Hyper-V storing a virtual machine (VM) on an SMB file share that experiences a problem. The file share moves to another node in the cluster. First, the Hyper-V server waits for the TCP timeout before realizing that the original connection has gone. This could mean 30 seconds of pause to the VM. But Hyper-V has also now lost its handles and locks on the virtual hard disk (VHD), which is a major problem. Whereas user documents might be used for a few hours, enterprise services such as a VM or database expects handles on files to be available for months without interruption.

Fortunately, SMB 3.0 addresses this issue, and many more. For Windows Server 2012, Microsoft wanted to make SMB a file-level storage protocol that could be used for crucial enterprise workloads such as Microsoft Hyper-V and SQL Server. To make this shift, some major changes to the SMB protocol were required.

Enabling Transparent Failover

If SMB is being used to house enterprise data such as VMs and SQL Server databases, then it's unlikely to be used on a standalone file server. Rather, it will be part of a cluster, to provide high availability. For a clustered file service, a single cluster node typically mounts the LUN that contains the shared file system and offers the share to SMB clients. If that node fails, then another node in the cluster mounts the LUN and offers the file share. However, the SMB client then loses its handles and locks.

SMB Transparent Failover provides protection from a node failure. It does so by enabling a share to move between nodes in a manner that is completely transparent to the SMB clients, maintaining any locks and handles that exist as well as maintaining the state of the SMB connection.

The state of the SMB connection is maintained over three entities: the SMB client, the SMB server, and the disk that holds the data. SMB Transparent Failover ensures that enough context exists to bring the SMB connection state back to an alternate node if a node fails, allowing SMB activities to continue without the risk of error.

However, even with SMB Transparent Failover, there can still be a pause to I/O. The LUN must be mounted on a new node in the cluster. But the Failover Clustering team has done a huge amount of work around optimizing the dismount and mount of a LUN to ensure that it never takes more than 25 seconds. That sounds like a long time, but it's the absolute worst-case scenario, involving large numbers of LUNs and tens of thousands of handles. For most common scenarios, the time would be only a couple seconds. And enterprise services such as Hyper-V and SQL Server can handle an I/O operation of 25 seconds without error.

Another possible cause of interruption to I/O is the SMB client noticing that the SMB server is unavailable. In a typical planned scenario (e.g., a node rebooting because it's being patched), the server notifies clients, which can then take the appropriate actions. But if a node crashes, there is no client notification. Rather, the client sits and waits for TCP timeout before taking action to re-establish connectivity—a waste of resources. Although an SMB client might have no idea that the node it's talking to in the cluster has crashed, the other nodes in the cluster know within a second, thanks to the various IsAlive messages that are sent between nodes.

This knowledge is leveraged by the new Witness Service, available in Windows Server 2012. The Witness Service essentially allows another node in the cluster to act as a witness for the SMB client. If the node that the client is talking to fails, the witness node notifies the SMB client, allowing the client to connect to another node and minimizing the service interruption to a couple seconds. The conversation looks something like the following (but in 1s and 0s and with less personality):

     SMB Client to Server A: "I want to establish a connection, Server A."
     Server A: "The connection is established. Also, I am part of a cluster.
     Servers B, C, and D are also in the cluster."
     SMB Client to Server B: "Server B, I have established an SMB connection to Server A.
     Can you watch Server A and notify me if it fails?"
     Server B: "Yes. Have a nice day."

The good news is that you don't need to do anything to take advantage of SMB Transparent Failover or the Witness Service. When you create a new share on a Windows Server 2012 failover cluster, SMB Transparent Failover is enabled automatically. A wizard guides the process of creating a new share in a Windows Server 2012 file server cluster. The first decision is which type of share you are creating. The answer simply helps to set some default options for the file share, as shown in Figure 1.

Figure 1: Creating Supported Share Types

But for all SMB Share types, the Enable continuous availability setting is enabled, as shown in Figure 2.

SMB Active/Active Configuration

I discussed the necessity of a brief I/O pause as the shared LUN is moved between nodes. You might be familiar with this as a challenge for Windows Server 2008 Hyper-V when moving VMs between nodes. The problem stems from the fact that NTFS is a shared-nothing file system and can't be accessed concurrently by multiple OS instances without the risk of corruption. This problem was solved with the introduction of cluster shared volume (CSV) support in Windows Server 2008 R2. CSV allows all nodes in a cluster to read and write to a set of LUNs simultaneously, using some clever techniques and removing the need to dismount and mount LUNs between nodes.

Windows Server 2012 extends the use of CSVs to a specific type of file server, namely the new Scale-Out File Server. This new option is targeted for use only when sharing application data such as SQL Server databases and Hyper-V VMs. The traditional style of a general-use file server is still available for non-application data, as shown in Figure 3.

Figure 3: Creating a Scale-Out File Server on a CSV

When you choose the option to create a Scale-Out File Server, you must also choose a CSV to use as storage when shares are subsequently created within the file server. Because this storage is available to all nodes in the cluster, all those nodes also host the file share. Therefore, SMB client connections are distributed over all the nodes instead of just one. If a node fails, no work is involved in moving the LUNs, offering an even better experience and reducing interruption in operations to almost zero. This reduction is crucial for the application-server workloads at which this Scale-Out File Server is targeted.

The use of Scale-Out File Servers offers an additional benefit. Typically, when a general-use File Server is created, you must give the new cluster file server a NetBIOS name and unique IP address as part of the configuration. That IP address must be hosted by whichever cluster node is currently hosting the file server. With Scale-Out File Servers, all nodes in the cluster offer the file service. Therefore, no additional IP addresses are required. Instead, the IP addresses of the nodes in the cluster are used via the configured Distributed Network Name (DNN).

I should point out that although all nodes in the cluster offer the same file service—and therefore shares—with the Scale-Out File Server, any single SMB client will connect to only one node at any one time. Essentially, when the SMB client initiates connections, it initially receives a list of all the IP addresses for the hosts in the cluster. The client picks one with which to initiate the SMB session and then uses only that node, unless the node experiences a problem. If that happens, the client converses with an alternate node, except when leveraging the Witness Service.

Protecting Against Connection Failure: SMB Multichannel

SMB Transparent Failover and SMB active/active configuration are great technologies that help protect against interruptions caused by a node failure. But there are other types of failure, such as a connection failure. To counteract this type of issue, you can use technologies such as Microsoft Multipath I/O (MPIO), which provides multiple paths from server to storage. SMB 3.0 introduces SMB Multichannel, which allows an SMB client to establish multiple connections for a single session, providing protection from a single connection failure and boosting performance.

Like most SMB 3.0 features, SMB Multichannel happens automatically. After the initial SMB connection is established, the SMB client looks for additional paths to the SMB server. If multiple network connections are present, those additional paths are used. The use of SMB Multichannel is apparent when monitoring a file copy operation, because only one connection's worth of bandwidth is used initially but doubles as the second connection is established, continues to increase with the third connection, and so on. If a connection fails, other connections continue the SMB channel without interruption.

To determine whether SMB Multichannel is in effect on a server, use the Get-SMBConnection Windows PowerShell cmdlet, which shows the SMB connections to an SMB share. In the output that Figure 4 shows, I can see that I have only one connection to my server.

Figure 4: Listing All the Current SMB Connections

This output indicates that there is only one usable path between the SMB client and the SMB server. If I run the Get-SmbMultichannelConnection cmdlet from the client, the output shows all the possible paths over which the server can accept connections, as shown in Figure 5.

Figure 5: Identifying Possible Paths for SMB Multichannel

However, this list is generated by a "lazy" check and does not mean that a path can actually be created between the client and server IP addresses 10.1.3.1 and 10.1.2.1.

To confirm which path is actually being used between the client and the server, I can look at the TCP connections to remote port 445, which is used for SMB. This confirms that I am using the one path that can be used: remote address 192.168.1.30, as Figure 6 shows.

Figure 6: Finding Actual Connections Used for SMB

A common question, if your SMB client connects to an SMB share that is hosted on an active/active cluster, is whether those multiple connections occur to different nodes in the cluster. The answer is no. The SMB client receives a single IP address for one node in the cluster, and all connections are to that node. All SMB sessions for that cluster from one SMB client will always go to the same node in the cluster. Remember, this isn't a problem because a highly available cluster typically has hundreds if not thousands of connecting SMB clients. The load will be distributed fairly evenly throughout the cluster.

Maximizing Bandwidth: Receive Side Scaling and Remote Direct Memory Access

The final aspect of SMB 3.0 that I want to focus on relates to the larger network-connection pipes in today's data center. Many data centers have shifted from 1Gbps to 10Gbps. But as data centers adopt 10Gbps, the processor in the server becomes a performance bottleneck. A single TCP connection can be processed by only one processor core, which can't handle 10Gbps and typically restricts the bandwidth. This is where Receive Side Scaling (RSS) comes into play. With RSS, a single network interface is split into multiple receiving connections, each of which can be serviced by a separate processing core. Therefore, the full bandwidth can be utilized. Most modern server network adapters automatically support RSS. To determine whether your hardware supports RSS, run the Get-SmbMultichannelConnection cmdlet, as shown in Figure 7.

Figure 7: Viewing the SMB Multichannel Configuration

Note that this output shows the number "4" for both CurrentChannels and MaxChannels. This is the default for Windows Server 2012 when leveraging RSS-capable network cards.

If you then look at the SMB connections from the server, which Figure 8 shows, you'll see that four separate connections are established for the IP address that SMB uses, confirming that RSS is in action.

Figure 8: Identifying Current SMB Client Connections on a Server

You might wonder why an RSS-capable network interface is split into four connections by default. (You can confirm this default by using the Get-SmbClientConfiguration PowerShell cmdlet to look at the SMB configuration. The first line of the output shows the connection count per RSS network interface.) You can change this value, but the number wasn't picked randomly. Microsoft went through much testing on 10Gbps connections and found that four connections produces the most gain; more than four connections brings diminishing returns. However, if you have connections larger than 10Gbps, then increasing this value might benefit you.

Remote Direct Memory Access (RDMA) is another technology that brings high throughput performance and minimizes server load. Network adapters that support RDMA can bypass most of the network stack to communicate directly, avoiding load on the host servers. The Get-SmbMultichannelConnection cmdlet that I referred to earlier will show whether the network adapter supports RDMA. During the initial SMB connection initialization, a check is performed to determine whether both ends of the connection support RDMA. If they do, the connection switches to RDMA. Again, no manual setup is required.

A Powerful Solution

SMB 3.0 is used only between OSs that support SMB 3.0, namely Windows Server 2012 and Windows 8. For other OSs, a negotiation is performed and the highest common version of SMB supported is used. For example, if a Windows 7 machine connects to a Windows Server 2012 file server, then SMB 2.1 is used because that's the highest version that Windows 7 supports.

The primary driver for most of the changes in SMB 3.0 was the desire to make SMB an enterprise-application protocol. That is certainly where you'll see the biggest benefit to SMB. But there are still benefits for regular clients, such as Windows 8 clients. (SMB 3.0 is unavailable for OSs earlier than Windows 8 and Windows Server 2012.) For example, the new SMB encryption capability removes the need for complicated public key infrastructures (PKIs) to achieve protection. SMB 3.0, along with many other Windows Server 2012 storage changes, puts the new OS on the map as a powerful storage solution and gives customers even more choice.

Comments

Plain text