EMC

A tale of two firmware upgrades…

On this fine Friday afternoon, I thought I’d have a little fun comparing and contrasting the firmware upgrade process on two different storage solutions. We recently bought some Nutanix 8035 nodes to replace the existing storage platform. While I wouldn’t necessarily call Nutanix “just” a storage platform, the topic of this discussion will be the storage side of the house. For the sake of anonymity, we’ll call our existing storage platform the “CME XNV 0035”.

One of the biggest factors in choosing the Nutanix platform for our new compute and storage solution was “ease of use”. And there’s a reason for that – the amount of administrative effort required to care and feed the “CME XNV 0035” was far too high, in my opinion. Even a “simple” firmware upgrade took days or weeks of pre-planning/scheduling and 8 hours to complete in our maintenance window. Now that I’ve been through the firmware upgrade on our Nutanix platform, I thought a compare and contrast was in order.

First, let me take you through the firmware upgrade on the “CME XNV 0035”

  1. Reach out to CME Support, open a ticket, and request a firmware update. They might have reached out to you proactively if there was a major bug/stability issue found. (Their product support structure is methodical and thorough, I will give them that) 30 minutes
  2. An upgrade engineer was scheduled to do the “pre upgrade health check” at a later date.
  3. The “pre upgrade health check” occurs, logs and support data are gathered for later analysis. Eventually it occurred frequently enough I’d just go ahead and gather and upload this data on my own and attach it to the ticket. 1 hour
  4. A few hours to a few days later, we’d get the green light from the support analysis that we were “go for upgrade”. In the mean time, the actual upgrade was scheduled with an upgrade engineer for a later date during our maintenance window…typically a week or so after the “pre upgrade health check” happened.
  5. Day of the upgrade – hop on a Webex with the upgrade engineer, and begin the upgrade process.  Logs were gathered again and reviewed.  This was a “unified” XNV 0035, though we weren’t using the file side…..I’m…..not sure why file was even bought at all, but I digress….which meant we still had to upgrade the data movers and THEN move onto the block side.  One storage processor was upgraded and rebooted…took about an hour, and then the other storage processor was upgraded and rebooted…took another hour.  Support logs were gathered again, reviewed by the upgrade engineer, and as long as there were no outstanding issues, the “green light” was given.  6-8 hours

Whew……7.5 – 9 hours of my life down the drain…

download

Now, let’s review the firmware upgrade process on the Nutanix cluster

  1. Log into Prism, click “Upgrade Software” 10 seconds
  2. Click “Download” if it hasn’t done it automatically 1 minute (longer if you’re still on dial up)
  3. Click “Upgrade”, then click the “Yes, I really really do want to upgrade” button (I paraphrase) 5 seconds
  4. Play “2048”, drink a beer or coffee, etc. 30 minutes
  5. Run a “Nutanix Cluster Check (NCC)”
  6. Done

There you have it, 31 minutes and 15 seconds later, you’re running on the latest firmware.  Nutanix touts “One click upgrades”, but I counted four, technically.  I can live with that.

Yes, this post is rather tongue in cheek, but it is reflective of the actual upgrade process for each solution.  Aside from the initial “four clicks”, Nutanix handles everything else for you and the firmware upgrade occurs completely non-disruptively.

unnamed

Advertisements

EMC VNX Pool LUNs + VMware vSphere + VAAI = Storage DEATH

**Cliffs notes – a bug in the VNX OE causes massive storage latency when using vSphere with VAAI enabled – disabling VAAI fixes issue**

Hello, and welcome to my very first blog post! I’ve owned this domain and WordPress subscription for nearly a year and a half and am finally getting around to posting something on it. Considering I’ve spent the last 3 years focused on end user computing, and the majority of that being done with Citrix products, I always figured my first post would be in that domain…but alas, that was not the case.

The problem…

I recently started a new gig and one of the first orders of business was untangling some storage and performance issues in a vSphere 5.5 environment running on top of a Gen 1 EMC VNX 5300.  It was reported that there was very high storage latency, often resulting in LUNs being disconnected from the hosts, during certain operations like a Storage vMotion or deploying a new VM from template.

After a general review of the environment I was able to rule out a glaringly obvious mis-configuration, so I turned to a couple useful performance monitoring tools – Esxtop and Unisphere Analyzer.  While I am by no means an expert with either tool, with a little bit of Google-fu and the assistance of a couple great blogs (which I’ll link to later in this post), I was able to get the info I needed to verify my theory – a bug involving VAAI that was supposed to be addressed in the latest VNX Operating Environment (which at the time of this posting is 5.32.000.5.215) still exists.

I started out by doing some performance baselining with VisualEsxtop (https://labs.vmware.com/flings/visualesxtop) so I could get a picture of what the hosts were seeing during operations that involved VAAI (Storage vMotion, deploy-from-template/clone, etc.)  As you can see in the below screenshot, the VNX is quite pissed off.  The “DAVG” value represents disk latency (in milliseconds) that is likely storage processor or array related.  The “KAVG” value represents disk latency (in milliseconds) associated with the VMkernel.  Obviously, the latency on either side of the equation is nowhere near a reasonable number.  Duncan Epping has a great overview of Esxtop (http://www.yellow-bricks.com/esxtop), I highly recommend you give it a read if you’re newer to the tool like I am.

1

The next step was to use EMC’s Unisphere Analyzer to get a picture of what was occurring on the storage side during these operations.  If you’re not familiar with Unisphere Analyzer, an EMC employee created a brief video on how to capture and review data with it (http://www.youtube.com/watch?v=yCMZ_N7-p7A) – it’s a relatively simple tool that you can garner a lot of valuable information from.  I used it to capture storage side performance metrics during the two following tests.

2

The first test consisted of a Storage vMotion of a VM with VAAI enabled on the host (1 Gb iSCSI to the VNX).  This test moved the VM from LUN_0 to LUN_6, starting at 9:46:44 AM and finishing at 10:01:53 AM.  If you look at the corresponding time period on the Unisphere Analyzer graph you’ll see that response time is through the roof.  While it did not occur during this test, the hosts would often lose their connection to the LUNs during these periods of high latency…not good, obviously.

These warnings always show up in the vSphere Client when this issue occurs (yeah yeah, I’m not using the Web Client for this):

3

The second test consisted of a Storage vMotion of the same VM with VAAI disabled.  This test moved the VM back to LUN_0 from LUN_6, starting at 10:04:13 AM and finishing at 10:14:03 AM.  This time, the Unisphere Analyzer data looks MUCH better.

2

Here is some an example of what Esxtop looked like during the test with VAAI disabled:

4

The LUNs with ~ 1400 read/write IO are obviously the ones involved in the Storage vMotion…notice the lack of “SAN choking”.  I re-ran this test multiple times using other LUNs with identical results…it was obvious at this point that there is still an issue with VAAI being used on this VNX OE.  Fortunately, our production datacenter utilizes 10 Gbe for the iSCSI network and Storage vMotions finish in just a minute or two.  I could see this flaw being particularly problematic in larger environments where Storage vMotion is frequent or something like VDI where VM’s are frequently spun up, tore down, or updated.

The solution…

Obviously, disabling VAAI in vSphere is a guaranteed “work around”.  I wouldn’t necessarily call this a “fix” as the VAAI feature is unusable, but it will stop the high latency and disconnects when vSphere tries to offload certain storage tasks to the array.  Once I had some hard evidence in hand, I did open up a ticket with EMC, and the support engineer was able to confirm this was indeed still a bug and has not been addressed by the latest OE version.

This VMware KB article details the process of disabling VAAI (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665).  I found that just the “DataMover.HardwareAcceleratedMove” parameter in the article had to be disabled.  The EMC support engineer also mentioned they had some success increasing the “MaxHWTransferSize” parameter while leaving VAAI enabled, but that it hadn’t worked for everyone.

You can see more information from this KB article (https://support.emc.com/kb/191685 – you may need an active support account, I had to login to view this page).  I decided to just disable VAAI and call it a day until a valid bug fix was released in some future OE version.  ***update 11/06/15*** it has come to my attention the preceding EMC KB191685 can no longer be accessed at the supplied link…I searched through the support portal and could not find a replacement so I don’t know if they pulled the KB documenting this issue entirely or if it’s been merged into another.  I did however find a support bulletin from June 2015 saying that the VAAI improvements had been added into the .217 firmware.  At one point I did request the .217 firmware only to find out they’d pulled it due to some issue it was causing.  I can only assume the VAAI improvements would’ve been added into some subsequent firmware version but no longer have my VNX’s in production, nor are they under support, and I won’t be able to personally test.

Hopefully this information will be beneficial to someone at there…luckily I found my way through the rabbit hole, but there wasn’t a whole lot publicly available regarding this issue when I was initially seeking a cause.