Nimble Storage NimbleOS3 VAAI XCOPY Testing

By | January 18, 2017

Introduction

Nimble Storage arrays are one of my favourite SAN arrays to work with at the moment. However, until Nimble OS3 which was made GA on the 31st of August, 2016, there was no support for the XCOPY VAAI primitive. This meant that when performing actions such as a storage vMotion that moved data between datastores on the same array, or cloning virtual machines, the ESXi host performing the action was required to read the data from the array up through the ESXi host, and then write it back to the array, creating additional load on the ESXi host and traffic on the storage adapters and network.

For arrays that support the XCOPY VAAI primitive, the ESXi host instead would send a command to the array telling it to move the data and the array will do the heavy lifting. Thankfully Nimble OS3 brings support for the XCOPY VAAI primitive and I’m lucky enough to have access to an environment currently running Nimble OS2 that will be upgrading to 3 shortly, so I wanted to take the opportunity to do a few tests.

There is a blog on the Nimble Connect forums here that goes into this in more detail and includes an explanation of pre-XCOPY support (NimbleOS2 and below) and post-XCOPY support (NimbleOS3).

I was also interested to see if there was any speed increase to storage vMotion or virtual machine cloning by offloading it to the array. My findings for this are in the summary.

I’m running the tests in the environment using NimbleOS2 and then compare it after upgrading the array to OS3 and making no other changes to the configuration. As noted in the Nimble blog, the configuration of the data mover can be changed from 4MB to 16MB. I did get time to test with the config set to 16MB, but in fairness the tests I were running were not going to benefit from this config change. I decided to include the results anyway, which are in the table located in the summary. All tests were run at least 3 times at different times of different days to get an average.

VAAI XCOPY and Windows Powered On Machines

In my testing, I was using a virtual machine running Windows Server 2012R2. After the upgrade to NimbleOS3, I did some testing for storage vMotion and cloning. The storage vMotion was being offloaded successfully, but the VM clone was not. After a puzzling few hours, I came across the following post from Cody Hosterman at Pure Storage.

Note: Cody does mention that both storage vMotion and clone operations are not offloaded for powered-on machines. In my testing, the storage vMotion was being offloaded fine, but the VM clone was not.

VAAI XCOPY not being used with Powered-On Windows VM

In short, by default when cloning a powered on Windows virtual machine with VMtools installed, the clone operation will not be offloaded to the array using VAAI. This is due to an advanced option disk.EnableUUID being set to true by default, which allows VMtools to quiesce the virtual disks. This also prevents VAAI XCOPY from being used. Check out Cody’s post for additional information.

When I did the clone test of the virtual machine running NimbleOS3, I did this when the machine was powered off, allowing VAAI XCOPY to be used.

Environment

  • Nimble Storage CS300 array
  • Dell R730 ESXi host running 6.0U2
  • iSCSI running over 10GB networking
  • 2 dedicated NICs on the R730 (vmnic1 and vmnic5) using the iSCSI software initiator and Nimble Connection Manager
  • Virtual machine running Windows Server 2012R2 with 2 disks. 70GB for OS and 110GB for data drive, both thick provision eager zero on the same LSI Logis SAS controller. Data drive has 100GB of actual data stored on it

Test A (Storage vMotion)

With this test I performed a storage vMotion of the virtual machine from datastore 1 to datastore 2, both hosted on the same CS300 array.

NimbleOS2

While the array was running NimbleOS2, XCOPY offload was not enabled and therefore we can see the traffic traversing the network adapters on the ESXi host as it reads the data from the array and then writes the data back to the same array.

Here’s a screenshot of the traffic on the vmnic physical NICs on the server during the move:

And here’s a screenshot of the array performance dashboard:

NimbleOS3

After the array was upgrade to NimbleOS3, XCOPY offload support is now enabled. When doing the storage vMotion, we see there is no traffic increase on either the ESXi host or the Nimble array. As you can see below, the data move has been offloaded to the array. Because it was not obvious, I’ve highlighted the timestamp in the screen shots where the storage vMotion was actioned. You’ll notice the low data rates in both the host and storage screenshots.

Here’s a screenshot of the traffic on the vmnic physical NICs on the server during the move:

And here’s a screenshot of the array performance dashboard:

Test B (VM Clone)

With this test I performed a clone of the virtual machine from datastore 1 to datastore 2, both hosted on the same CS300 array.

NimbleOS2

So as a reminder, when the array was running NimbleOS2, XCOPY offload was not enabled and therefore we can see the traffic traversing the network adapters on the ESXi host as it reads the data from the array and then writes the data back to the same array during the clone operation.

Here’s a screenshot of the traffic on the vmnic physical NICs on the server during the clone:

And here’s a screenshot of the array performance dashboard:

NimbleOS3

Again after the NimbleOS upgrade, XCOPY offload is enabled so the clone operation is simply an instruction sent from the ESXi host to the array, and the array does the heavy lifting. Again, I’ve highlighted the timestamps in the screenshots that show the duration of the VM clone operation.

Note: Just prior to the highlight below you see a burst of traffic. This was when I tested the VM clone with a powered machine. The highlighted section was when the machine was being cloned while powered off, and therefore using VAAI XCOPY.

Here’s a screenshot of the traffic on the vmnic physical NICs on the server during the clone:

And here’s a screenshot of the array performance dashboard:

Summary

As mentioned in the introduction, I performed a number of tests and times the average time to complete a storage vMotion and VM clone when running both NimbleOS2 and NimbleOS3. My findings are in the table below:

ActivityOS VersionAverage Time
Storage vMotionNimbleOS211 minutes 41 seconds
Storage vMotion (4mb)NimbleOS37 minutes 19 seconds
Storage vMotion (16mb)NimbleOS37 minutes 13 seconds
VM CloneNimbleOS214 minutes 9 seconds
VM Clone (4mb)NimbleOS39 mins 2 seconds
VM Clone (16mb)NimbleOS39 mins 12 seconds

As you can see, storage vMotion activities were on average 4 minutes and 22 seconds faster and VM clones were on average 5 minutes and 7 seconds faster when running NimbleOS3 and making use of VAAI. As well as being faster, there is also the benefit of no CPU overheard for the ESXi host and no additional traffic traversing the storage adapters and storage network.

Overall I think this is a welcome addition to the capabilities of the Nimble fleet and is a good reason on its own to upgrade your arrays from NimbleOS2 to NimbleOS3. There are of course some other features of NimbleOS3 that might be appealing to you if you are still running NimbleOS2.

3 thoughts on “Nimble Storage NimbleOS3 VAAI XCOPY Testing

  1. Eric Egolf

    Thanks for posting this Matt. This is helpful to understand how to insure that xcopy is working on the nimble. It is interesting that the Nimble WebUI is blind to the IOP increase during the xcopy….however you can see in the slight increase in latency which implies CPU and Disks must be taxed equally in both non-Xcopy and xcopy scenarios. If this is correct it seems the major benefit is offloading IOPs from the host standpoint as the xcopy still uses the SANs resources equally. Is this your understanding as well? The decrease in time to clone is nice but i can’t say that 11 min to 7 minutes matters much for us….its value is really about avoiding noisy neighbor situations?

    We have a nimble CS460 and AF5000 array and excited for Xcopy to be released with “deduplication awareness.” The general concept is the array should be able to implement the xcopy in record time by just creating some pointers. Right now, even on the AFA, it seems to read all the data and then write it again to nvram and putting it through the duplication engine. Since we are using Citrix MCS with multiple gold images and multiple updates a day this deduplication aware xcopy would be extremely quick and use very little CPU. I am writing this in part to ensure that Nimble puts this high on their priority queue for desired features with the AFA.

    Thanks again for a great post.

    P.S. I notice a huge void in articles on the internet discussing Xcopy. I also found Cody’s article a few others here and there but not much else. I wonder why that is?

    Reply
    1. Matt Post author

      Hi Eric,

      Thanks for reading and thanks for the comment.

      In addition to monitoring the ESXi host network resources in the UI, you can also monitor VAAI stats using esxtop as well. See Duncan’s post – http://www.yellow-bricks.com/2012/12/20/using-esxtop-to-check-vaai-primitive-stats/

      I believe that currently, internal to the array it is still moving / copying the physical blocks of data, so there is still an increased workload on the storage when actioning a task that is offloaded using VAAI. The major benefit from my perspective for using VAAI XCOPY is that it offloads the task from the ESXi host and the storage network, meaning that the CPU in my ESXi host is not processing the move / clone, the NICs on the host are not sending / receiving data and the storage network is not getting used to perform the clone / move.

      In regards to your AFA, are you running NimbleOS3.x? I wasn’t aware that it didn’t use pointers when dedupe was enabled. We’ve got some customers using other storage solutions that make use of this and it is cool to see a deployment from template / clone take a second or two to complete 🙂 Have Nimble made any reference as to when this will be included? I’ll reach out to one of the Nimble SE’s here in Aussie to see if he has any information on this.

      Lastly, I found the same over the last 6 months or so when looking for some xcopy information – there doesn’t seem to be a huge amount out there about it. Maybe it isn’t an exciting topic, or it’s one of those things that are just expected to work.

      Cheers, Matt.

      Reply
      1. Eric Egolf

        Matt we are running 3.4.1 and vmware cloning which is leveraged by Citrix MCS went from ~15 minutes on the CS460 to ~5 minutes on the AFA. The performance increase seems to be due to faster NVRAM rather than “dedup aware” xcopy. On 3.4.1 Nimble does not do the pointer trickery that some thing like a pure storage does. We are anxiously awaiting this feature along with post processing dedup.

        Reply

Leave a Reply

Your email address will not be published. Required fields are marked *