vSphere

Oh noes! I’ve lost my vCenter appliance root and/or grub password, halp!

I recently encountered a situation where an issue with a vCenter Server Appliance 6.0 required logging into the shell as the “root” user, but either the password was recorded incorrectly, or the password which was set was typed incorrectly (twice).  Regardless, it was not possible to log in as root, nor was the grub password known (most likely the same password as root when the appliance was initially configured), so we were stuck between a rock and a hard place.

halp

VMware has a KB article that details how to reset the VCSA root password, however and unfortunately, this required entering the grub boot loader password to edit the boot file, so it was kind of a “chicken before the egg” scenario.  Luckily, I found a blog post on UnixArena.com that detailed using a Red Hat Enterprise Linux ISO to boot into recovery and gain access to a file that allows you to bypass the grub password which in turn allowed me to change the root password.  However, once the root password was changed, the grub boot loader was still unprotected by a password which is no bueno.  With some assistance from VMware Support, I was able to set a new grub boot loader password on the VCSA and all was good with the world again.

This post aggregates information from several different sources, and I’ve added in some material of my own to tie the whole process together and it a little easier to follow.  Thanks to UnixArena.com, VMware Support, and Tecmint.com for the resources.

Now, on to the good stuff…

First, you need to download a Red Had Enterprise Linux .ISO – you are required to create an account to request an evaluation, which allows you to download the .ISO.  The version I used for this post was RHEL 7.4.

Upload the .ISO to a vSphere datastore and mount it in the CD-ROM drive of your VCSA.  Power down the VCSA, take a snapshot of it, and then edit the “boot options” to “Force BIOS Setup” so that you can enter the VCSA’s BIOS and modify the boot order.

vcsa1

Once you’re in the BIOS, change to the “Boot” tab and use the “+” key to move “CD-ROM Drive” to the top of the boot order list.  Use the “right arrow” key to move to the “Exit” tab and choose “Exit Saving Changes”.  The VCSA should reboot and boot to the RHEL .ISO.

vcsa2vcsa3

Use the “down arrow” to select “Troubleshooting”.

vcsa4

Use the “down arrow” key or “R” to select the “Rescue a Red Hat Enterprise Linux system” line, then press “Enter”.

vcsa5

The next screen will prompt you to mount the file system in “read-write mode” by selecting option 1.

vcsa6

When prompted, press “Enter” (or Return) to get a shell.

vcsa7

Once the shell is loaded, you should see this:

vcsa8

Change to the “mnt/sysimage/boot” directory (cd /mnt/sysimage/boot), view the contents (ls –lrt) and you should see the “grub” folder.

vcsa9

Change to the “grub” folder (cd grub) and view the contents (ls –lrt) and you should see a “menu.lst” file.

vcsa10

This next step is optional, and if you’ve taken a snapshot of the VCSA before making any changes (which I hope you did) you could always just roll back, but I like to make a backup of the file I’m about to modify, which in this case is “menu.lst”.  Enter the command “cp menu.lst menu.lst.bak” and a copy of the “menu.lst” file will be made named “menu.lst.bak” which could be used to recover the file if you make a mistake in the next step.

vcsa11

Use the “vi” editor to modify the “menu.lst” file by entering the command “vi menu.lst”

The hashed grub password is highlighted below – use the “down arrow” key to move to the line beginning with “password” and type “dd” to remove the line.  Then, enter the command “:wq” to exit and save the file.

vcsa12

Note that the “password” line is removed.

vcsa13

Exit the shell by entering the commands “cd” and then “exit”.  Be sure to unmount the RHEL .ISO or you will boot back into it.

When the grub boot menu appears, press “space”.  Now that the grub password has been removed, you should see that the instructions to enter “p” to “unlock additional options” is no longer present, and you can proceed to edit mode immediately.

Make sure that the “SUSE Linux Enterprise Server…” line is selected, and press “e”.

vcsa14

On the next screen, select the line beginning with “kernel” and press “e” again to edit the boot command.

vcsa15

Append “init=/bin/bash” to the line below, and then press the “enter”.

vcsa16

With the below line highlighted, press “b” to boot into the shell.

vcsa17

When you get a shell, type “passwd root” to change the root password.

vcsa18

Once you’ve entered a matching password twice, you should get a success message that the password has been changed.  Apparently it thinks the password I used is too simple, but whatever, lab.

vcsa19

Reboot the VCSA by issuing the “reboot” command or “Power > Reset” VM option.

When you see the boot menu again, you should notice that there is still no grub password set, meaning that anyone who gains access to the console of the VM and can reboot it can change the root password.  Obviously, if someone is crafty enough to mount your RHEL image and go through the process we just followed, they could still remove the grub password and then change root, so it’s important to have “least privileged” role based access, shield your management network from user facing subnets, and that sort of thing.

The next portion of this post will focus on putting a grub password back in place.

Once the VCSA has booted, press “Alt + F1” to gain console access, then enter the “root” username  and your recently set root password.

vcsa20

Once you’ve authenticated, enter the command “ssh.get” to verify that SSH is currently disabled.  If the status returned is “True” skip to the next section.  If the status returned is “False”, enter the command “ssh.set –enabled true” to enable SSH.  Verify that SSH is now enabled by entering the “ssh.get” command again.

Alternatively, you can use a web browser to the VCSA’s “VAMI page” by going to https://vcenterhostnameorip:5480, logging in as root, selecting “Access” from the navigation menu, and enabling SSH and bash shell.  Since I already had the VCSA VM console open, I did it there.

vcsa21

The next steps we are going to use an SSH client like PuTTY instead of the VCSA VM console so that we can use copy and paste functions easier, which will help ensure the MD5 hash gets entered correctly.  Connect to your VCSA using SSH and login using the root account and the newly set password.

Once logged in to the SSH session, enable shell by entering the commands “shell.set –enabled true” and then “shell”.

vcsa22

At the shell, enter the command “grub-md5-crypt”, and then correctly enter a matching grub password twice.  You will need to copy the md5 hash to clipboard, so highlight it and then paste into Notepad or another text editor for safe keeping.

vcsa23

Next, we will need to modify the “menu.lst” file that we removed the previous hashed password from earlier.  Edit the menu.lst file by entering the command “vi /boot/grub/menu.lst”.

Once in the text editor, press the “insert” key, “down arrow” to the line underneath “timeout” and press “enter”.  “up arrow” once to the newly create blank line, type “password –md5 “ (space after –md5) and then paste in the copied md5 hash.

vcsa24

Once the hash has been pasted into the “menu.lst” file, press “Esc” to exit “edit mode”, then enter “:wq” to save and quite.  Reboot the system and verify that the grub boot loader is once again password protected.  You should see it prompt for “p” instead of “e” if the menu.lst file modifications were successful.  If “e” is still displayed, verify the contents of the menu.lst file are correct and there aren’t any missing characters or anything like that.

Press “p” to enter your new grub password to ensure everything is good to go – if it unlocks the option to edit boot commands, your job is done.  Don’t forget to remove the VM snapshot once you’ve determined your changes are successful.

vcsa25

whew

Resources:

http://www.unixarena.com/2016/04/reset-grub-root-password-vcsa-6-0.html

https://kb.vmware.com/s/article/2069041

https://www.tecmint.com/password-protect-grub-in-linux/

Advertisements

Nutanix Announces Xtract for VM’s, Simplifying the Migration to AHV

Today Nutanix announced a new product called “Xtract for VM’s”, which is a tool to simplify the migration from other hypervisors (currently ESXi only) to Nutanix AHV.

While several options currently exist for migrating from ESXi to AHV, such as in place cluster conversions, Cross Hypervisor DR (if both source and destination are Nutanix clusters), a more manual svMotion/import, Xtract for VM’s facilitates a prescriptive and controlled approach for moving workloads from ESXi to AHV.

Xtract for VM’s uses a simple “one click” wizard approach to target one or more VM’s for migration.  To migrate a VM, a “Migration Plan” is created using a simple wizard and the following criteria are configured:

  • Select or more VM’s (batch processing for efficiency)
  • Specify guest operating system credentials (to install AHV device drivers)
  • Specify network mappings to retain network configuration (correlate the source network in vSphere to destination network in AHV)
  • Specify a migration schedule, if required, to seed data in advance

xtract2

When a VM is configured for migration, a copy of the source ESXi VM is created in AHV, and then any changes to the source VM are synchronized to the destination VM up until the point of cutover.  Downtime for the cutover is minimized and only incurred when the source ESXi VM is powered down and the destination AHV VM is powered up.  The source VM is left on the ESXi host in an unmodified state so that it can be reverted to if an issue is encountered during testing.  Migration can be paused, resumed, or canceled at any time.

xtract1

Xtract for VM’s is available at no additional charge to all Nutanix customers.  However, there are some caveats and requirements for use which can be found on Nutanix’s support site including, but not limited to:

  • Source node must be running ESXi 5.5 or higher
  • vCenter 5.5 or higher must be present and used for migration
    • migration direct from ESXi hosts is not possible
  • There are certain disk limitations that aren’t supported, such as:
    • Independent disks
    • Raw Device Mappings (RDM’s)
    • Multi writer (shared) disks
  • Guest OS must be supported by AHV

Nutanix has built an increasingly compelling argument for migrating to AHV from other hypervisors, as acquisitions (Calm, automation and orchestration) and product enhancement (such network visualization, micro-segmentation, self service portal, AFS/ABS services, etc.) have made their solution more than “just another hypervisor” and now have an answer to most any use case or requirement.

Customers who were hesitant to adopt a “relatively new” hypervisor a couple years ago, or that had a particular use case (such as micro-segmentation via VMware NSX tying them to vSphere) may now have a viable alternative and I suspect that more customers will, at a minimum, be investigating the possibility of migrating away from their existing hypervisor.

If you’re like me, you like options and flexibility in a solution.  Competition is good, and if nothing else, maybe your next VMware renewal will be a little bit cheaper 😉  Easy to use in-box tools such as Xtract for VM’s that simplify the migration process and increase its probability for success make it an easier “sell” aside from the “dollars and cents” argument.

Read more about Xtract for VM’s on the Nutanix Blog, download it HERE, or read the user guide HERE.

 

 

Planning Firewall Rule Sets for Micro-segmentation

I recently gave a short presentation at the local VMUG Usercon on “my journey to micro-segmentation” and thought I’d adapt part of my slide deck to make a blog post on how I go about planning and implementing firewall rules for applications or tenants using the VMware NSX Distributed Firewall.

The more I work with the Distributed Firewall the more I realize there really isn’t one “right way” to secure your VM’s with it…it’s less about how the sausage is made and more about the end result.  However, there are a few steps I see as a requirement for laying down firewall rules over an existing production environment to ensure rules are scoped accurately and without blocking necessary traffic to avoid an issue with operations.

Step 1 – Gather Information

It is important to gather as much information about the traffic needs of the application or tenant upfront.  There are plenty of well known protocols/ports that we can rattle off from memory for various things…a web server will probably need HTTP/S over ports 80/443, a file server will probably be talking SMB over port 445, SQL Server on port 1433, LDAP over 389 or 636, etc.  These are the easy ones.

However, there are inevitably going to be in-house developed apps or industry specific apps that use custom ports or protocols that are in an application tier or within a tenant that may be harder to identify.  The vendor may have good documentation for all ports and directions of communication, but I think this is more the exception than the rule.

If you go to your application owner or support team and ask “what does each tier of your application talk to and over what protocols and ports” you will probably get a reaction pretty close to this:

fwrdevelopment1

If you do get something back from them, there’s a good chance it’s not right…or at least not the full picture.  This is not a knock on app teams, it’s just the reality of the situation that servers and applications are far chattier than many people realize or give thought to.  Being able to block traffic at the virtual NIC level is extremely powerful and when you have to take into consideration east-west traffic flows that were typically never firewalled in a “physical world” there is a lot more work that needs done.

Step 2 – Trust…But Verify

To have any legitimate chance for success at laying down DFW rules on an existing application or tenant without breaking things you need a traffic monitoring tool.  That statement should probably be in CAPS, italicized, and the color red.  You will inevitably block traffic that you were unaware of if you do not monitor the actual traffic flows to and from the VM.  It will happen.  Even if you do somehow know all ports and protocols that are in use by a particular system, it’s not uncommon to find misconfigurations in the guest OS such as pointing to incorrect DNS servers that could be problematic when firewalling a VM.

There are lots of solutions out there that can get you this data.  You could pull it from syslog, maybe your network team already has some sort of NetFlow aggregator, or some in house developed solution.  Whatever you choose, the important part is that you are able to pull relevant, accurate, and easy to consume information from it that allows you to be actionable.  Manually parsing 20,000 lines in a .CSV is none of those things.

For this, we chose vRealize Network Insight.  Not only does it log all flows from the vSphere Distributed Switch, there are tons of “NSX’y” things it does above and beyond traffic flows such as environmental health checks, alerting, and visualization of the network both in the logical and physical realms.  I wish VMware would consider bundling in some of the vRNI features at certain tiers of NSX licensing – the easier you make it to consume the solution the more people will buy of it.  VMware did pay a pretty penny for the acquisition from Arkin so I also understand the need to monetize it.  Regardless, we found it fairly reasonably priced and when you take into consideration the potential financial loss for causing an application outage, it was a no brainer.

The following screenshot is an example of the information you can extract from vRNI.  In my opinion, the most powerful piece of vRNI is the “search” feature.  It’s extremely intuitive to query and that’s how I generate all of my flow data used in firewall rule development.

fwrdevelopment2

I will typically use a query like “show flows where src ip = 192.168.1.0/24” or “show flows where vm name like [vm name]”.  What I’m after is showing both sides of communications flow…I want to know all traffic inbound to the app / VM / tenant as well as all communication outbound from it and adjust my queries as such.

Once I’ve gotten the data I want from vRNI I will export the reports to a .CSV file and further massage the data…possibly removing traffic I know will not be required, removing rows based on the IP or subnet, etc.  This is one of those areas that has to be left open to interpretation, as there are a ton of different ways to scope and filter the data to be applicable to your specific environment.  If you’ve done your job well, at this point you should have identified probably 95% of the required traffic required and can begin creating rules in the Distributed Firewall.

Step 3 – Proceed With Caution

With great power comes great responsibility – as mentioned previously, the ability to apply firewall rules at the vNIC level, before it’d ever hit physical media let alone route through a physical firewall, is extremely powerful.  It introduces all sorts of new ways you can really screw up your day if you don’t take a few precautionary steps.

fwrdevelopment7

I’m going to give a brief overview of my approach to micro-segmentation with the DFW so that the rest will make sense.  When I deploy NSX into an environment, I do so from a position of extreme caution.  I want to avoid an administrator making a mistake, a firewall rule being too broadly scoped, or a bug/issue with the solution itself from creating an outage.  Again, there are lots of ways to go about this – I’m not saying my way is THE way, but hear me out.

  1. The first thing I do is ensure that the “default rule” in the NSX DFW is set to allow.  In earlier versions of NSX, you were given the choice during deployment of NSX Manager whether or not you wanted your default rule to “allow” or “deny” traffic.

    If you chose “deny” you’d probably end up “islanding” your environment, as there are no DFW rules yet so of course traffic will be blocked.  However, the default rule is set to “allow” by default during installation in current versions of NSX, and I don’t think you are even given the option to choose otherwise.  Anyone (myself included here) who played around with NSX in their lab has at one point or another probably blocked all their traffic by mistake and had to issue a REST API call to remove it.

    Instead of having a “global” default rule set to deny, I’ve been doing it on a per-app or per-tenant basis – at least during the rollout phase where the majority of the systems in the environment are not being firewalled yet.  Having the default rule set to “allow” ended up being highly prescient, which I’ll touch on in the next section.

  2. The next thing I do is place any VM not actively being firewall in the “exclusion list” on the NSX Manager.  This prevents any of that VM’s traffic from being processed for filtering by the DFW.

    By doing this I’m hoping to avoid an issue where I have VM’s that don’t have DFW created for them yet, and somehow the default rule gets flipped to “block” or a rule is too broadly scoped and ends up causing me problems.

    There is actually a bug in one of the recent versions of NSX where a condition occurs that causes all VM’s to be removed from the exclusion list by mistake, suddenly opening them up to DFW filtering.  If your default rule is set to “block” and you don’t have rules in place allowing the necessary traffic, you now have an outage on your hands.  Thanks VMware.  This did actually happen to me and I suddenly felt very glad I had left my “global” default rule as “allow”, therefore an outage was avoided.

  3. I use the “applied to” field in the Distributed Firewall to limit the scope of systems considered for processing of that particular firewall rule.  The default setting is to apply a newly created firewall rule to the “distributed firewall”, therefore any VM not on the exclusion list is checked against it for processing.  In a large environment, that’s going to be hundreds or even thousands of rules being checked that have nothing at all to do with that system.  There’s been times I was troubleshooting an issue by showing what rules were applied to the vNIC, and if literally every rule in the environment showed up in the list, it’d have greatly complicated troubleshooting.

    If the firewall rule set applies to an entire tenant, I’ll create a Security Group that contains that tenant’s VM’s and have all the rules in that section have their “applied to” field configured for that Security Group.  If a firewall rule set applies to a single VM or tier of VM’s, I may select the individual VM or possibly a Security Group in the “applied to” field.

    Having “applied to” configured limits the scope and failure domain that a misconfigured rule may impact…instead of it applying to the entire environment, maybe you just block traffic on a handful of VM’s and the damage is much smaller in scope.

Step 4 – You’ve Got the Data, Now Do Something With It

By now you’ve probably (or not) talked with your application owners about how their application communicates with the environment, you’ve generated flow data from vRealize Network Insight, and massaged the .CSV output to further refine the data.

It’s now time to take that output and create your initial firewall rule set in the DFW.  The below screenshot depicts a sample firewall rule set for a tenant.  There are multiple applications within this tenant, and its VM’s have been placed into Security Groups by app or function.  The flow data from vRNI was used to allow the appropriate traffic in or out bound.  A Security Group containing all the tenant’s VM’s was used as the scoping object in the “Applied To” field, for the reasons mentioned previously.

At the end of this rule set you will see the two “default” rules for this tenant.  The “outbound” default rule has a source of [tenant security group], a destination of “any”, and a service of “any”…with the source and destination being reverse for the “inbound” default rule.  The “Action” is currently set to “Allow” during the analysis phase, so that logging can be enabled on the “default rules” to see if any traffic that didn’t match one of the previous rules in the rule set registers as a “hit”.  These “hits” are obviously going to result in blocked traffic once the default rules get set to “block”.

fwrdevelopment3

With logging enabled, we can go to the host(s) that contain the app or tenant’s VM’s and parse the “dfwpktlogs.log” log file to see if any traffic that wasn’t accounted for is hitting the default rule.  This is kind of the “last chance” to rectify any missed traffic – there may be things legitimately blocked and logged here that you do not need to be concerned about…outbound web traffic to Microsoft on a Windows server for example.  It’s the “other” we are concerned about now.

To parse the “dfwpktlogs.log” file, open an SSH session to your host(s) and enter the following commands:

  • cd /var/log
  • cat dfwpktlogs.log | grep 1125 | grep 192.168.1 | grep 2017-05-16

fwrdevelopment4

The above command parses the dfwpktlogs.log command, filtering by rule ID 1125 (the outbound “default rule”), filtering by IP subnet, and filtering by date (to avoid returning flows from days where logging was enabled previously)

Enabling logging on a “default rule” can generate a large amount of data, so it’s recommended to only leave logging enabled temporarily – a time period measured in hours or maybe a day.  Enable it during times that would represent “normal” business function or during a time that some core process runs for that app/tenant to give yourself the most valid logging data.

If you see traffic that you believe should be allowed but is instead hitting the default rule, either modify an existing rule to include the traffic or create a new rule within the rule set to allow it.  Once the logs are clean, or you’re only seeing traffic you expect to be blocked (i.e. outbound internet traffic to Microsoft from a Windows server) then you’re ready to flip the “default rules” to “block”.

Step 5 – Ongoing Operations

So you’ve planned out all your micro-segmentation rules, you’ve created the initial rule set, you’ve monitored the dfwpktlogs.log files to make sure you didn’t miss anything and adjusted the DFW rules where necessary, and you’ve switched your “default rule” to block, and everything went well…….now what?

First thing – pat yourself on the back.  While not overly difficult, properly planning the micro-segmentation of an application or tenant can be quite time consuming to account for all necessary traffic to avoid issues.

fwrdevelopment5

OK, now that’s out of the way…you’ll inevitably get a panicked email from an application owner saying “we performed an upgrade and now the application won’t start”….the upgrade changed or added some ports used and they weren’t in the original rule set created for the app, now they’re blocked.

While you can certainly generate some new flow data from vRNI, I’ve found the quickest and easiest place to check for a blocked flow is the “Flow Monitoring” section in the NSX management GUI.  The time window to show flow data for is completely configurable – if you’re like me, you rarely find out about an issue shortly after it happens…most likely you’ll be going back several days to find that needle in the haystack.  By using an appropriate time window, selecting from the “Blocked Flows” tab, and using the “filter” mechanism, you should be able to find the issue with little effort.

fwrdevelopment6

Summary

Hopefully you found this post helpful.  As mentioned several times already, the NSX Distributed Firewall is extremely powerful and it gives you great flexibility on how to accomplish an increased security posture in your environment.  This methodology is not necessarily THE way, but it’s my way and has worked out pretty well for so far.  As always, I’m open to hearing about new and better ways if you have a different way to do it.

Installing Nutanix NFS VAAI .vib on ESXi Lab Hosts

This post covers the installation of a Nutanix NFS VAAI .vib on some “non-Nutanix” lab hosts.

Why would one do this?  Several months ago I stood up a three node lab environment accessing “shared” storage using a Nutanix filesystem whitelist (allows defined external clients to access the Nutanix filesystem via NFS).  While the Nutanix VAAI plugin for NFS would normally be installed on the host as part of the Nutanix deployment, it obviously was not there on my vanilla ESXi 6.0 Dell R720 servers accessing the whitelist….which made things like deploying VM’s from template and other tasks normally offloaded to the storage unnecessarily slow.

Since Nutanix just released “Acropolis Block Services / ABS” GA in AOS 4.7 (read more about it at the Nutanix blog) there’s probably less of a reason to use filesystem whitelists for this purpose now, but alas, maybe someone will find it useful (*edit* – it’s worth noting that ABS doesn’t currently support ESXi.  I haven’t tried to see if it actually workyet but needless to say, don’t do it from a production environment and expect Nutanix to help you *edit 1/27/17* as of AOS 5.0 released earlier this month, ESXi is supported using ABS)  At the time of this blog post, Windows 2008 R2/2012 R2, Microsoft SQL and Exchange, Red Hat Enterprise Linux 6+, and Oracle RAC are supported.  NFS whitelists aren’t supported by Nutanix for the purpose of running VM’s, either.

  1. The first step is to SCP the Nutanix NFS VAAI .vib from one of your existing CVM’s.  Point your favorite SCP client to the CVM’s IP, enter the appropriate credentials, and browse to the following directory:/home/nutanix/data/installer/%version_of_software%/pkg2016-06-27 07_49_20-PhotosCopy the “nfs-vaai-plugin.vib” file to your workstation so that it can be uploaded to storage connected to your ESXi hosts using the vSphere Client.
  2. Once the .vib is uploaded to storage accessible by all ESXi hosts, SSH to the first host to begin installation.  You may need to enable SSH access on the host as it’s disabled by default.  This can be done by starting the SSH service in %host% > Configuration > Security Profile > Services “Properties” in the vSphere Client.
  3. Once logged in to your ESXi host, we can verify that the NFS VAAI .vib is missing by issuing the “esxcli software vib list” command.vib-listIf the .vib were present, we’d see it at the top of the list.
  4. Now we need to get the exact path to location you placed the .vib on your storage.  This can be done by issuing the “esxcli storage filesystem list” command.  You will be presented with a list of all storage accessible to the host, the mount point, the volume name, and UUID.storage-listHighlight the “mount point” of the appropriate storage volume so that we can paste it into the next command.  Alternatively, you could use the “volume name” in place of the UUID in the mount point path, but this was easier for me.
  5. Next, we will  install the .vib file using the “esxcli software vib install -v “/vmfs/volumes/%UUID_or_volume_name%/%subdir_name%/nfs-vaai-plugin.vib”” command.  I created a subdirectory called “VIBs” and placed the nfs-vaai-plugin.vib file in it.  Be careful as the path to the file is case sensitive.vib-installIf the install was successful, you should see a message indicating it completed successfully and a reboot is required for it to take effect.  Assuming your host is in maintenance mode and has no running VM’s on it, go ahead and reboot now.
  6. Once the host has rebooted and is back online, start a new SSH session and issue the “esxcli software vib list” command again and you should see the new .vib at the top of the list.install-confirmationVoila!  You can now deploy VM’s from template in seconds.itsbeautifulmeme

Veeam + Nutanix: “Active snapshots limit reached for datastore”

Last night I ran into an interesting “quirk” using Veeam v8 to back up my virtual machines that live on a Nutanix cluster.  We’d just moved the majority of our production workload over to the new Nutanix hardware this past weekend and last night marked the first round of backups using Veeam on it.

We ended up deploying a new Veeam backup server and proxy set on the Nutanix cluster in parallel to our existing environment.  When there were multiple jobs running concurrently overnight, many of them were in a “0% completion” state, and the individual VM’s that make up the jobs had a “Resource not ready: Active snapshots limit reached for datastore” message on them.

veeam 1

I turned to the all-knowing Google and happened across a Veeam forum post that sounded very similar to the issue I was experiencing.  I decided to open up a ticket with Veeam support since the forum post in question referenced Veeam v7, and the support engineer confirmed that there was indeed a self-imposed limit of 4 active snapshots per datastore – a “protection method” of sorts to avoid filling up a datastore.  On our previous platform, the VM’s were spread across 10+ volumes and this issue was never experienced.  However, our Nutanix cluster is configured with a single storage pool and a single container with all VM’s living on it, so we hit that limit quickly with concurrent backup jobs.

The default 4 active snapshot per datastore value can be modified by creating a registry DWORD value in ‘HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication\’ called MaxSnapshotsPerDatastore and use the appropriate hex or decimal value.  I started off with ’20’ but will move up or down as necessary.  We have plenty of capacity at this time and I’m not worried at all about filling up the storage container.  However, caveat emptor here because it is still a possibility.

This “issue” wasn’t anything specific to Nutanix at all, but is increasingly likely with any platform that uses a scale-out file system that can store hundreds or thousands of virtual machines on a single container.

VMware NSX Lab Environment – Part 2: Prepare Hosts and Deploy NSX Controllers

Introduction

In Part 2 of this series I will cover preparing the ESXi hosts for NSX and deploying an NSX Controller cluster.  As mentioned in the first part of this series “Part 1:  Import and Configure NSX Manager“, the NSX Manager facilitates the deployment of the Controller clusters and ESXi host preparation (among other things), so needless to say having it up and functioning is a prerequisite for this phase.

At the completion of this post, the NSX environment should be mostly configured and we will be able to start doing fun stuff like deploying logical switches, setting up distributed routing, and playing with distributed firewall rules.  I’m pretty excited. /geekout

As you may have gathered by their name, the NSX Controllers reside in the “Control plane” while the NSX Manager resides in the “Management plane”.  Services such as logical switches, distributed logical routers / firewalls are all hypervisor kernel modules and reside in the “Data plane”.  NSX Edge, a virtual appliance(s) also resides in the data plane.  The focus of this post will be the Control plane.

This diagram, courtesy of the VMware NSX 6 Design Guide, depicts the various “planes” in which NSX components reside.

Controller Deployment 2

Deploying the NSX Controllers

1. Now that NSX Manager is running and linked to your vCenter server, the next step in the process would be to enter your NSX licenses.  However, since this is a lab I am running it in evaluation mode for 60 days, I have nothing to enter here.  If you do own the product, or are lucky enough to have a license key for perpetual lab purposes, you’d just go to “Administration > Licensing > Licenses” in the vSphere Web Client and enter the appropriate license keys.

Controller Deployment 1

2. After you’re licensed (or running in eval mode), it’s time to deploy the NSX Controllers.  Under “Networking and Security”, click “Installation.

Controller Deployment 3

3. Click the green “+” sign underneath “NSX Controller nodes”…

Controller Deployment 4

…and you will be asked to enter vSphere cluster, storage, and networking info.

Controller Deployment 5

Click “Select” next to “IP Pool”.  If you’ve already created one you’d like to use, select it and then click “OK”.  Otherwise, click the green “+” and add a new IP Pool.  Once you’ve entered the details appropriate to your environment, click “OK” on the “Add IP Pool” window, select the radio button next to your new IP Pool, and then click “OK” again on the “Select IP Pool” window.

I’ve created a 10 address IP pool for the NSX Controllers to use – technically you’d only need as many IP addresses as you’d have NSX Controllers, but I get kind of OCD about that sort of thing so I’ve allocated a block of 10.  IP’s are one of the few things I have an abundance of in the lab environment.

Controller Deployment 6

4. Click “OK” on the “Add Controller” window.

** It’s worth noting – the NSX Controller password requirements are fairly strict.  My “lab password” I’ve been re-using throughout the environment did not meet the length requirement and it complained…as you see here.  I have a feeling this is one of those passwords you don’t want to lose so be sure to keep it somewhere safe.

Controller Deployment 7

Assuming everything is good with your configuration, NSX Manager should begin deploying the Controllers to your vSphere environment using the supplied credentials linking it to vCenter (as you can see in the “Initiated by” column under “Recent Tasks”).  It’s also worth noting that the NSX Controllers get a unique identification string appended to them – avoid the urge to modify this – it’s by design.  (And yes, I did just throw a screenshot of the “fat”/C# client in here…I do catch myself flipping back to it from time to time.  It’s a habit I’m working on breaking 😛 )

Controller Deployment 8

5. Once the Controller deployment has completed, it should show a “Normal” status in the vSphere Web Client window.  If that’s the case, it’s fairly safe to say subsequent Controller deployments will be successful, so you can now repeat this process 2 more times in order to meet the “3 Controller node minimum” recommendation.  NSX Controllers should be deployed in odd numbers so that a “majority vote” can occur for electing a Master controller.

**You will not be prompted to enter a Controller password again on subsequent Controller deployments – this is shared between all NSX Controllers in the Controller cluster.

Controller Deployment 9

The Controllers are clustered?  Yes, they are.

Without going in too much detail (the NSX Design Guide does a great job explaining it), the “responsibilities” of an NSX Controller get distributed among all members of the Controller cluster.  A Master Controller is responsible for determining when a Controller node has failed and where the “slices” of a particular role it held should be transitioned to.

This image, courtesy of the VMware NSX Design Guide, shows the number of nodes that can fail for a particular NSX Controller cluster count.

2015-04-27 13_25_03-NSX 6 Design Guide.pdf - Adobe Reader

6. In this step, we will create “anti-affinity” rules for the NSX Controllers to ensure no two Controllers ever reside on the same host.  This is an important step in mitigating impact to the NSX environment if an ESXi host fails.  For a lab environment it’s probably not a big deal but I felt it was important to show, as I frequently see vSphere environments with no DRS rules setup when they should probably be used for resiliency of redundant guest VM’s.  As has been commented on by others, I’m kind of surprised that the creation of the anti-affinity rule isn’t done by NSX Manager automatically when deploying two or more Controllers.  I believe other components, such as NSX Edges, do have a rule created by default…perhaps I’m mistaken and will find out shortly.

Anyhow…

In the vSphere Web Client, navigate to the “Hosts and Clusters” view, select the applicable vSphere cluster, then click the “Manage” tab.  Select “VM/Host” Rules and then click “Add”.

Controller Deployment 10

7. Give your DRS Rule a name – I usually like to get descriptive about the nature of it (i.e. add “Anti-Affinity”) in the name.  Click the “Add” button, select your NSX Controllers, and click “OK”.  Click “OK” once more on the “Add VM/Host Rule” window.

DRS Rules 2

8. Now we will prepare the ESXi hosts for NSX.  Click on the “Host Preparation” tab and then select “Install” next to the appropriate cluster(s).  When prompted, click “Yes” to continue with the install.

Controller Deployment 12

If the host preparation was successful, you should see a green checkmark underneath the “Installation Status” and “Firewall” columns.

Controller Deployment 13

9. The next step is to configure VXLAN on our NSX enabled cluster.  On the “Host Preparation” tab, under the “VXLAN” column, select “Configure”.  You will need to select a Distributed Virtual Switch for VXLAN traffic (I’ve created one dedicated for that purpose with a single vNIC uplink…hey, it’s a lab), enter the appropriate VLAN, set your MTU size (it’s not recommended to go below 1600 due to the ~50 byte VXLAN header addition) so make sure your underlying physical (or in this case, virtual) network is configured for jumbo frames.

I’m going to create an IP pool dedicated for VTEP’s so I’ve selected “New IP Pool…” from the drop down box.

Controller Deployment 14

Enter the appropriate IP Pool information here, then click “OK”.  Like the IP Pool I created for the NSX Controllers, this one has 10 IP addresses in it.  You will need a pool large enough to provide IP addresses for each VTEP (VXLAN Tunnel End Point) interface on each host in that IP space.

Controller Deployment 15

Click “OK” in the “Configure VXLAN networking” window.

At this point, we should have green checkmarks across the board on our “Host Preparation” tab.

Controller Deployment 16

** There are considerable design decisions that must be made when choosing your VMKNic Teaming Policy – the layout of your physical networking and Distributed Virtual Switch uplink configuration could dictate which options are viable.  The VMware NSX Design Guide goes over this in great detail (beginning around page 73) and is worth the read.

This table courtesy of the VMware NSX Design Guide shows the teaming and failover modes available based on the uplink type.

Controller Deployment 17

10. The next step is to create a Segment ID (which I believe is also sometimes called a VXLAN Network Identifier/VNI) Pool.  I like to think of Segment ID’s like special VLAN’s inside of the NSX environment – they are used to differentiate the various logical network segments just like VLAN’s on a physical switch logically segment the traffic.  NSX let’s you specify a range of 5,000 to 16,777,216…so roughly 16 million possibilities.  The range you specify in your Segment ID Pool will dictate the maximum amount of logical switches available to your NSX environment.

Under the “Installation” section, click the “Logical Network Preparation” tab and select the “Segment ID” section.  Click “Edit” next to “Segment IDs & Multicast Address allocation…”

Controller Deployment 18

11. Specify your Segment ID Pool range.  I chose 5000-5999, which gives me 1000 possible network segments…far greater than I’d ever need in a lab, but hey why not?

Controller Deployment 19

** I’ve not checked the option to “Enable multicast addressing” and am relying on Unicast for my BUM traffic (Broadcast, Unknown, Unicast, and Multicast).  Not to sound like a broken record, but there are considerable design decisions you’d make to determine whether or not to use Multicast, Unicast, or Hybrid modes.  Page 25 of the VMWare NSX Design Guide goes into detail about the pros and cons of each, when to use or not use, etc.  This is not something I ever had to give much thought to in my “server centric” world prior to starting down the NSX wormhole, and found it one of the harder concepts to grasp and remember when studying for the VCP-NV exam.  This is an area I am still shoring up because it’s directly related to the way NSX propagates information throughout the environment, so it’s obviously a critical piece.

12. The final piece is to configure our Transport Zone.  What is a Transport Zone you ask?  Well, per VMware, it quite literally “defines a collection of ESXi hosts that can communicate with each other across a physical network infrastructure.”  In other words, it determines which cluster(s) participate in the NSX environment.

Click on the “Transport Zones” section, then the green “+” sign.

Controller Deployment 20

Give the Transport Zone a name, select the appropriate replication mode, and the cluster(s) you wish to be included in the Transport Zone.  Click “OK”.

Controller Deployment 21

12. At this point, all the NSX Controllers and supporting configuration should be in place.  Review each tab under “Networking and Security > Installation > Logical Network Preparation” to ensure everything looks correct.

Controller Deployment 22

Controller Deployment 23

Controller Deployment 24

If so, we’re ready to do all the fun stuff you really wanted to deploy NSX for.  The next post in this series will handle the configuration of logical switching, distributed routing, and Edge Services.  A couple of the big things I’m looking to demonstrate are isolation of mock customers in a multi tenant environment and securing a VDI deployment with NSX.

Being my first go-round installing NSX, if you see any inaccuracies or a better way of doing things, please let me know.  And as always, thanks for reading!

 

VMware NSX Lab Environment – Part 1: Import and Configure NSX Manager

Introduction

This blog series will cover the installation and configuration of VMware NSX 6.1.3 in a lab environment.  As such, there are certain design considerations I am overlooking because this is not a production deployment and lab resources are somewhat limited.  It goes without saying a production deployment may and should look a little different 😉

An example of this would be to distribute the vSphere and NSX infrastructure into multiple clusters – a “Management” cluster which might house vCenter, NSX Manager, and the NSX Controllers; an “Edge” cluster which might house things like the NSX Edge Services Gateway and the Distributed Logical Router (DLR) / DLR Control VM, which control the flow of L2 (NSX Bridging) or L3 traffic (NSX Logical Routing) into and out of the NSX logical networking environment; and a “Compute” cluster where the bulk of your server and/or desktop virtualization workload would live.  In a production deployment, these clusters may even span two or more racks to provide resiliency of all workloads in the event of a rack loss.

Example of “dispersed clusters” courtesy of VMware Hands on Labs (Management/Edge cluster + Compute cluster):

NSX Clusters

However, my purpose for deploying NSX in my lab is to get more hands on exposure to the product and as a “proof of concept” for the various capabilities of NSX.  While important to know all the design considerations, it’s not particularly important in THIS instance.

I highly recommend familiarizing yourself with the NSX Design Guide – it’s a great piece of reference material.  I read it start to finish as part of studying for the VCP-NV exam and largely attribute its content, in combination with the VMware Hands on Labs, for passing the exam.  The NSX Design Guide can be found HERE.

The Lab Environment

I will also not be covering the installation and configuration of the various vSphere 6.0 infrastructure that all the NSX components ride upon.  There are many great blogs and white papers covering this, and let’s be honest – if you’re looking to lab out NSX, you probably already have vSphere configuration down pat.

This NSX lab environment exists on a nested ESXi cluster running vSphere 6.0.  There are three ESXi 6.0 virtual machines, each with 24 GB of RAM, 4 vCPU, and ~200 GB of “local” storage across multiple datastores (a couple of which will be used for VSAN at a later date).  A vCenter Server Appliance 6.0 was imported into the nested environment and runs on top of the virtualized ESXi hosts.  Running a nested hypervisor gives you a lot of flexibility on the hardware which it runs on (and flexibility for isolation if desired, to avoid screwing important stuff up 😛 ), so this could just as easily run on top of a couple “home lab” type boxes without issue.

William Lam (@lamw on Twitter) has a GREAT series of posts on his blog virtuallyGhetto.com detailing the requirements and configuration for running nested ESXi – I highly recommend checking it out.  In fact, the 3-node ESXi cluster I am using for this blog post was deployed from an .OVF file he’s made available to the community.  Check out his Nested ESXi Series here and VSAN .OVF Template series here.  I’ve configured this lab environment to be 100% isolated to the outside world from a network, storage, and hardware perspective…which gives me some freedom to make changes and mistakes without breaking anything I care about.

There’s probably many (and better) blogs covering this same subject, but hey, I need the blogging practice anyway so maybe someone will find it useful…so thanks in advance for reading.

And without further ado, importing and configuring NSX Manager.

Import the NSX Manager

The first step in getting NSX running in your environment is to install and configure the NSX Manager.  The NSX Manager is a virtual appliance that is responsible for deploying all the other components of NSX such as the NSX Controllers, Edge Gateways, etc.

There is a 1:1 relationship between a NSX Manager and a vCenter server – one NSX Manager serves a single vCenter Server environment.

** As stated in the NSX Installation Guide, the NSX Manager virtual machine installation includes VMware Tools.  Do not attempt to upgrade or install VMware Tools on the NSX Manager.  One of the first things I noticed (and felt compelled to do) once the NSX Manager was running is to get rid of the angry yellow “VMware Tools is outdated on this virtual machine” warning on the NSX Manager VM Summary tab.  Fight the urge and leave it at the included VMware Tools level.

NSX Manager 1

1. Deploying the NSX Manager must be done through the vSphere Web Client instead of the C# client due to some “extra configuration options” that are only present in the Web Client. Right click on your vCenter Server and select “Deploy OVF Template”.

** You must have the Client Integration Plugin installed in order to deploy an OVF through the vSphere Web Client.  I’ve had issues with the plugin and Internet Explorer 11, so I recommend running this through Chrome or Firefox until/if those issues are resolved.

NSX Manager 2

2. Browse to your .OVA file.  Mine is on an .ISO mounted in the virtual DVD drive of the virtual “jump box” workstation I access my lab environment from.  Click “Next”.

NSX Manager 3

NSX Manager 4

3. On the “Review details” screen, select the “Accept extra configuration options” check box (this option wouldn’t be presented in the C# client, and then you’d have issues with your NSX Manager).

NSX Manager 5

4. Accept the EULA, blah blah blah, then click “Next”.

NSX Manager 6

5. Select a name for the NSX Manager – I got super creative with mine and left it “NSX Manager” – select a folder/location, then click “Next”.

NSX Manager 7

6. Select a cluster or host, then click “Next”.

NSX Manager 8

7. Select a disk format.  I chose “Thin Provision” since this is a lab and storage is at a premium right now.  If necessary, change your VM Storage Policy.  While this lab will have VSAN configured in it eventually, it does not now, and I’m installing on “local” storage (in quotes because it’s actually a .VMDK file on a SAN LUN) of my ESXi host, so I’ve left it at “Datastore Default”.  Click “Next” once you’ve selected the appropriate storage settings for your environment.

** Another shout out to William Lam (@lamw) and his blog www.virtuallyghetto.com for the super handy nested ESXi VSAN OVF templates from which this lab cluster was deployed.

NSX Manager 9

8. Now it’s time to select a network for your NSX Manager.  Right now I’m using the default “VM Network” that was created when I installed ESXi, but I’m essentially treating it as my management network for management/vMotion VMkernel interfaces.  I’ll have some other network interfaces for VXLAN traffic etc. which will be setup at a later time.  Once you have selected the appropriate network, click “Next”.

NSX Manager 10

9. The “Customize template” window has quite a bit for you to fill in – there are some passwords for various purposes on the NSX Manager, IP/DNS/network settings, etc.  Fill in the info as it applies to your environment, then click “Next”.

NSX Manager 11

10. On the “Ready to complete” screen, verify all your information is correct before clicking finish.  If everything looks good, click “Finish”.  The NSX Manager virtual appliance will now be deployed and powered on automatically (if selected).

NSX Manager 12

Configure the NSX Manager

1. Once the NSX Manager appliance has finished booting and is online, log into the NSX Manager appliance to resume configuration.  You will have specified this IP address in Step 9.  Example https://172.16.99.150.  The credentials used for login will also have been specified in Step 9.  Username: admin Password: [user defined].  If you did not define a password during installation, it should be “default”.

NSX Manager Home

2. The first thing to do is check that all the necessary services are running…if they’re not…you won’t get much further.  Click the “View Summary” button to be taken to the appliance summary page.  All services should show “Running”, with possibly exception to the “SSH Service”.

NSX Summary

 

NSX Summary 2

3. Now we will register NSX Manager with your vCenter Server.  Click the , click “Manage” tab, and then click “NSX Management Service” under the “Components” menu section.

NSX Manager 14

Click “Edit” to enter your vCenter Server details

NSX Manager 15

4. Enter the appropriate credentials to connect to your vCenter Server.  Yeah yeah yeah, I used my personal lab account instead of a service account…so what?!  Click “OK” and you should be prompted to trust the vCenter Server certificate.  Click “Yes” on this window.

NSX Manager 16

NSX Manager 17

 

**Update**

I ended up ditching using my “personal” lab account for the vCenter connection…not exactly sure why yet, but whenever I logged into vCenter it showed a “No NSX Managers available” error.  I switched to a new “service” account that also has Administrator rights in vCenter and the NSX Manager populated correctly.  It even showed my “personal” account listed as an NSX Enterprise Administrator but I could not see anything.  After logging into vCenter with the “nsxservice” service account, I went to Networking and Security > NSX Managers > Manage > Users and deleted my personal account and re-added it as an Enterprise Administrator.  Once I did that, I was able to log into vCenter with my personal account and see the NSX Manager fine.  Who knows…perhaps I did something wrong during the initial vCenter linking.  So just a heads up in case you run into a similar issue.

nsx users 1

5. If the connection to your vCenter Server was successful, you should see a green circle next to the “Status” field.

NSX Manager 18

 

6. Next, the Lookup Service will be configured.  Click the “Edit” button in the “Lookup Service” area.

NSX Lookup Service

Enter your Lookup Service details – it’s worth noting that in vSphere 6.0, the Lookup Service port is now 443.  I had assumed it was still 7444 and ran into some errors applying the configuration.  Luckily I ran across Chris Wahl’s (@ChrisWahl) blog with a quick explanation.

NSX Lookup Service 2

As when linking your vCenter server, trust the certificate.

NSX Lookup Service 3

And you should now see a “Connected” status under the lookup service.

NSX Lookup Service 4

7. The final step for configuring NSX Manager in the lab is to make a backup of the configuration.  It is recommended to do a backup while NSX Manager is in a “clean state” so that it can be rolled back to in the event of an issue during subsequent changes.

Click “Backup and Restore” in the “Settings” menu.

NSX Manager 19

Click “Change” next to “FTP Server Settings”, then enter your FTP/SFTP server details here.  I actually don’t have an FTP server in my lab yet, so I’m going to forego a backup at this point.  I like to live on the edge anyway.  If you’d like to setup scheduled backups, there is an option to do so on this page as well…probably recommended for a production deployment.

NSX Manager 20

8. At this point, configuration of NSX Manager is complete and it has been linked to your vCenter Server.  You should now be able to log into the vSphere Web Client and continue with deployment of the remaining NSX infrastructure pieces.

If you’re currently logged into the Web Client, log out and then log back in and you should see a new “Networking and Security” panel.  Click on “Networking and Security”.

NSX Manager 21

It is from here that most of the NSX configuration will be managed and further NSX services and components deployed from.  The next post in this series will cover deploying and configuring the NSX Controllers.  Stay tuned and thanks for reading!

NSX Manager 22

 

Part 2:  Prepare Hosts and Deploy NSX Controllers is up!