Month: September 2015

Veeam + Nutanix: “Active snapshots limit reached for datastore”

Last night I ran into an interesting “quirk” using Veeam v8 to back up my virtual machines that live on a Nutanix cluster.  We’d just moved the majority of our production workload over to the new Nutanix hardware this past weekend and last night marked the first round of backups using Veeam on it.

We ended up deploying a new Veeam backup server and proxy set on the Nutanix cluster in parallel to our existing environment.  When there were multiple jobs running concurrently overnight, many of them were in a “0% completion” state, and the individual VM’s that make up the jobs had a “Resource not ready: Active snapshots limit reached for datastore” message on them.

veeam 1

I turned to the all-knowing Google and happened across a Veeam forum post that sounded very similar to the issue I was experiencing.  I decided to open up a ticket with Veeam support since the forum post in question referenced Veeam v7, and the support engineer confirmed that there was indeed a self-imposed limit of 4 active snapshots per datastore – a “protection method” of sorts to avoid filling up a datastore.  On our previous platform, the VM’s were spread across 10+ volumes and this issue was never experienced.  However, our Nutanix cluster is configured with a single storage pool and a single container with all VM’s living on it, so we hit that limit quickly with concurrent backup jobs.

The default 4 active snapshot per datastore value can be modified by creating a registry DWORD value in ‘HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication\’ called MaxSnapshotsPerDatastore and use the appropriate hex or decimal value.  I started off with ’20’ but will move up or down as necessary.  We have plenty of capacity at this time and I’m not worried at all about filling up the storage container.  However, caveat emptor here because it is still a possibility.

This “issue” wasn’t anything specific to Nutanix at all, but is increasingly likely with any platform that uses a scale-out file system that can store hundreds or thousands of virtual machines on a single container.

Advertisements

NetScaler 10.5 – fsck commands have changed

Just a heads up in case anyone else runs into this, the process for verifying the file system integrity of a NetScaler appliance has apparently changed in the 10.5 firmware, but Citrix’s documentation does not (yet) reflect this.

We had a “surprise network outage” over the weekend which took down iSCSI connectivity, and with it, my NetScaler VPX HA pair.

images

Shortly after everything came back up, we began receiving alerts from NetScaler Command Center that “hardDiskDriveErrors” are seen on the system.

nserror

Alright, fine, let’s verify file system integrity and see what there is to see.  Citrix article CTX122845 outlines the process…or so I thought.  I’d never done this/had to do this before, so I couldn’t compare to a past experience.

In Step 5, you are told to press enter after booting to single user mode and the command prompt of the appliance changes to “\u@\h$” but it doesn’t…just “\u@”.  Then, in Step 6 you are directed to enter the “/sbin/fsck /dev/######” command.  However, that command didn’t work either.

I opened a ticket with Citrix support to rule out a PEBKAC error (quite likely if I’m involved) and they confirmed these commands indeed do not work, and they’ve actually changed in the NS 10.5 firmware.

In this screenshot, we see the error when attempting to enter the command according to Step 6 in CTX122845

ns2

Here is an example of the correct command “\u@fsck -t ufs /dev/######”

ns1

Also, be sure to check out what your actual device names are according to CTX121853, it does vary by model and in the case of the VPX, differs from the example used in CTX122845.

Hopefully Citrix will update their documentation soon, but if not, this might be your work around.

A tale of two firmware upgrades…

On this fine Friday afternoon, I thought I’d have a little fun comparing and contrasting the firmware upgrade process on two different storage solutions. We recently bought some Nutanix 8035 nodes to replace the existing storage platform. While I wouldn’t necessarily call Nutanix “just” a storage platform, the topic of this discussion will be the storage side of the house. For the sake of anonymity, we’ll call our existing storage platform the “CME XNV 0035”.

One of the biggest factors in choosing the Nutanix platform for our new compute and storage solution was “ease of use”. And there’s a reason for that – the amount of administrative effort required to care and feed the “CME XNV 0035” was far too high, in my opinion. Even a “simple” firmware upgrade took days or weeks of pre-planning/scheduling and 8 hours to complete in our maintenance window. Now that I’ve been through the firmware upgrade on our Nutanix platform, I thought a compare and contrast was in order.

First, let me take you through the firmware upgrade on the “CME XNV 0035”

  1. Reach out to CME Support, open a ticket, and request a firmware update. They might have reached out to you proactively if there was a major bug/stability issue found. (Their product support structure is methodical and thorough, I will give them that) 30 minutes
  2. An upgrade engineer was scheduled to do the “pre upgrade health check” at a later date.
  3. The “pre upgrade health check” occurs, logs and support data are gathered for later analysis. Eventually it occurred frequently enough I’d just go ahead and gather and upload this data on my own and attach it to the ticket. 1 hour
  4. A few hours to a few days later, we’d get the green light from the support analysis that we were “go for upgrade”. In the mean time, the actual upgrade was scheduled with an upgrade engineer for a later date during our maintenance window…typically a week or so after the “pre upgrade health check” happened.
  5. Day of the upgrade – hop on a Webex with the upgrade engineer, and begin the upgrade process.  Logs were gathered again and reviewed.  This was a “unified” XNV 0035, though we weren’t using the file side…..I’m…..not sure why file was even bought at all, but I digress….which meant we still had to upgrade the data movers and THEN move onto the block side.  One storage processor was upgraded and rebooted…took about an hour, and then the other storage processor was upgraded and rebooted…took another hour.  Support logs were gathered again, reviewed by the upgrade engineer, and as long as there were no outstanding issues, the “green light” was given.  6-8 hours

Whew……7.5 – 9 hours of my life down the drain…

download

Now, let’s review the firmware upgrade process on the Nutanix cluster

  1. Log into Prism, click “Upgrade Software” 10 seconds
  2. Click “Download” if it hasn’t done it automatically 1 minute (longer if you’re still on dial up)
  3. Click “Upgrade”, then click the “Yes, I really really do want to upgrade” button (I paraphrase) 5 seconds
  4. Play “2048”, drink a beer or coffee, etc. 30 minutes
  5. Run a “Nutanix Cluster Check (NCC)”
  6. Done

There you have it, 31 minutes and 15 seconds later, you’re running on the latest firmware.  Nutanix touts “One click upgrades”, but I counted four, technically.  I can live with that.

Yes, this post is rather tongue in cheek, but it is reflective of the actual upgrade process for each solution.  Aside from the initial “four clicks”, Nutanix handles everything else for you and the firmware upgrade occurs completely non-disruptively.

unnamed