VMware

VMware NSX Distributed Firewall Rules – Scoping and Direction Matter

I, like I’m sure many of you, were not traditionally firewall or security admins prior to adding VMware NSX to your vSphere environments.  As such, there’s been a bit of a learning curve for me regarding what I knew [or thought I knew] regarding physical firewalls and how that translates [or doesn’t] to the NSX Distributed Firewall (DFW).

As I’ve been rolling out NSX DFW rules to various types of systems with different accessibility requirements, I ran across some unexpected behavior when scoping the rules.

Let’s look at an example 2 tier application consisting of a “web server” and an “app server”.  If this were a traditional physical firewall setup, the web server would probably be in the DMZ, or at least a different subnet from the app server, the traffic would route through the firewall and rules would be applied to allow or restrict traffic.

nsx-firewall-blog-v2

As a theoretical example, for our web tier, we’re allowing HTTP/HTTPS/FTP inbound to the web server from “any” source (presumably, any number of public networks), letting FTP back outbound to “any” destination, DNS outbound to our internal DNS servers, and SMB traffic to the app server where files are stored.  We make the assumption that while FTP traffic may be allowed outbound to any destination, it’s only going to reach that destination if it allows FTP inbound.  Everything else is denied by default.  Pretty straight forward.

For the app server, we’re allowing SMB inbound from “any” source (maybe there are several hundred internal VLAN’s that users could access the server from and it is not accessible externally), RDP is allowed inbound from “any” source, we have some various Active Directory / LDAP related ports open for domain membership, pings are allowed outbound to “any” due to a monitoring application hosted on the server, and DNS is allowed outbound to our DNS servers.  Everything else is denied by default.

Based on these firewall rules, when comparing what traffic is allowed in or out of each server, there is really only one traffic pattern which should match between the two, which is SMB from the web server to the app server (highlighted).

However – everything is not as it seems…

At this point, I have created DFW rules functionally identical to the first diagram in this post.  Let’s go through some various connectivity checks…

dfw-rules-3dfw-rules-4

From the web server, we can access file shares on the app server, thanks to a combination of firewall rule 4 allowing SMB traffic outbound from the web server to the app server, and firewall rule 5, allowing SMB traffic inbound to the app server from “any” source.

dfw-rules-6

From a user workstation, we can pull up the default website on the web server, thanks to firewall rule 1 allowing inbound HTTP traffic from “any” source.  So far so good.

dfw-rules-5

Let’s try to ping it from the same workstation…no dice, and as expected, since ICMP is not allowed anywhere in the rule set “Web Tier” (rules 1 through 4).

dfw-rules-7

Now let’s try the same tests from the app server itself…wait – that’s strange…both ICMP and SMB traffic is allowed from the app tier to the web tier, even though there are no rules applying to the security group containing the web server which specifically allows that traffic in.  Is such a thing even possible?

dfw-rules-meme-2

dfw-rules-8v2

The “problem”…

Let’s use the “Apply Filter” option in the Distributed Firewall to determine which rule(s) are to blame.  I specified the “Source” as the app VM, the “Destination” as the web VM, changed the action to “Allow” (this could also be handy to see what rule was blocking traffic you thought should be allowed by choosing the “Block” option), and then selected ICMP as the “Protocol”.

dfw-rules-9v2

And now we can see that Rule 1038 that allows the Security Group containing the app VM to send ICMP traffic to “any” destination has matched the filter.

dfw-rules-10

When I think of firewall rules in the “traditional” manner, I would expect allowing outbound ICMP from our application server to a destination of “any” wouldn’t also imply that ALL VM’s in my NSX environment should also allow that traffic inbound.  The whole point of “zero trust” and “default deny” is that unless traffic is explicitly allowed, it should be denied.  Perhaps to someone who comes from a network/security background and has used many different firewalls, this would be seen as expected behavior in certain scenarios – but that is not intuitive to this virtualization guy.

In a nutshell, there are a couple things in play here…

  1. Scoping matters.  By selecting a destination of “Any”, NSX truly means ANY.  Even though you may not have allowed a particular traffic type inbound on some unrelated system, because we have this “Any” rule, our application server can talk to it over that protocol.  I can see this being particularly problematic in a multi tenant environment, or maybe some kind of PCI environment where you have to prove a definitive dividing line between different systems.  One improperly scoped rule later and you have unintended consequences.
  2. Direction matters.  Hidden by default is the column titled “Direction”.  When creating a new firewall rule, this column is hidden, and the default value is “In/Out”, which is the root of our problem here.  If we’d configured Rule 1038’s “Direction” value as “Out”, it wouldn’t have been implied that it should be allowed “In” on the web server.  In my opinion, VMware should not have this column hidden by default, and an administrator should have to choose a direction on the rule without a value being pre-populated.  In addition, I could find no way to manipulate the “Direction” value when using “Service Composer” – the default value is In/Out and there’s no way (at least in the GUI,) to change it.

The “fix”…

The first way to “fix” this issue is to always assign the appropriate directional value to each firewall rule.  Through a combination of “In” and “Out” rules, your traffic should be allowed in the direction you expect without any “unintended consequences”.  The rules are still Stateful, meaning that if we allow ICMP out to “Any” from the app VM (but only in the “Out” direction), that traffic is allowed to return back to the app VM without requiring a second rule stating so.

Add the “Direction” column to your view

dfw-rules-11

Then, click the “Edit” icon next to the “Direction” value

dfw-rules-12

Then select the appropriate value from the “Direction” drop down menu

dfw-rules-13

Let’s go ahead and modify these DFW rules with the appropriate “Direction” and test again.

dfw-rules-16

As you can see here, from the app VM to the web VM, HTTP, SMB, and ICMP which previously worked is now blocked.

dfw-rules-17

Scoping matters…

The other important thing to consider is the rule scoping – in the example above, the web server allows HTTP/HTTPS traffic inbound from “Any” source.  Perhaps in this case the web server is publicly facing and there’s no real need for internal systems to access it directly.  In such a scenario, an IP Set allowing only public IP addresses to communicate with it could be used.

Here I’ve created two “IP Sets” on my NSX Manager.  One contains “all subnets” that I’ve called “ipset_all-networks” with a range of 1.1.1.1-254.254.254.254 and the other is a called “ipset_all-private-networks” with the three private IP spaces specified (if you only use a small part of one private IP space, you could certainly get that granular, too).

dfw-rules-14

Then, I created a Security Group called “sg_all-public-networks”, chose a static member of my IP Set called “ipset_all-networks”, then created an exclusion using my IP Set “ipset_all-private-networks” to block any internal IP address from matching the rule.  I could use this Security Group in place of the “Any” scoping object on my publicly facing web server, or even inverse it so that no public IP’s are allowed when scoping a rule.

dfw-rules-15

Obviously, there are many ways to as they say “skin a cat” with the Distributed Firewall, but as I found out…direction and scoping matter.

Got a better or more efficient way to manage the NSX Distributed Firewall rules?  I’m all ears!  😛

Installing Nutanix NFS VAAI .vib on ESXi Lab Hosts

This post covers the installation of a Nutanix NFS VAAI .vib on some “non-Nutanix” lab hosts.

Why would one do this?  Several months ago I stood up a three node lab environment accessing “shared” storage using a Nutanix filesystem whitelist (allows defined external clients to access the Nutanix filesystem via NFS).  While the Nutanix VAAI plugin for NFS would normally be installed on the host as part of the Nutanix deployment, it obviously was not there on my vanilla ESXi 6.0 Dell R720 servers accessing the whitelist….which made things like deploying VM’s from template and other tasks normally offloaded to the storage unnecessarily slow.

Since Nutanix just released “Acropolis Block Services / ABS” GA in AOS 4.7 (read more about it at the Nutanix blog) there’s probably less of a reason to use filesystem whitelists for this purpose now, but alas, maybe someone will find it useful (*edit* – it’s worth noting that ABS doesn’t currently support ESXi.  I haven’t tried to see if it actually workyet but needless to say, don’t do it from a production environment and expect Nutanix to help you *edit 1/27/17* as of AOS 5.0 released earlier this month, ESXi is supported using ABS)  At the time of this blog post, Windows 2008 R2/2012 R2, Microsoft SQL and Exchange, Red Hat Enterprise Linux 6+, and Oracle RAC are supported.  NFS whitelists aren’t supported by Nutanix for the purpose of running VM’s, either.

  1. The first step is to SCP the Nutanix NFS VAAI .vib from one of your existing CVM’s.  Point your favorite SCP client to the CVM’s IP, enter the appropriate credentials, and browse to the following directory:/home/nutanix/data/installer/%version_of_software%/pkg2016-06-27 07_49_20-PhotosCopy the “nfs-vaai-plugin.vib” file to your workstation so that it can be uploaded to storage connected to your ESXi hosts using the vSphere Client.
  2. Once the .vib is uploaded to storage accessible by all ESXi hosts, SSH to the first host to begin installation.  You may need to enable SSH access on the host as it’s disabled by default.  This can be done by starting the SSH service in %host% > Configuration > Security Profile > Services “Properties” in the vSphere Client.
  3. Once logged in to your ESXi host, we can verify that the NFS VAAI .vib is missing by issuing the “esxcli software vib list” command.vib-listIf the .vib were present, we’d see it at the top of the list.
  4. Now we need to get the exact path to location you placed the .vib on your storage.  This can be done by issuing the “esxcli storage filesystem list” command.  You will be presented with a list of all storage accessible to the host, the mount point, the volume name, and UUID.storage-listHighlight the “mount point” of the appropriate storage volume so that we can paste it into the next command.  Alternatively, you could use the “volume name” in place of the UUID in the mount point path, but this was easier for me.
  5. Next, we will  install the .vib file using the “esxcli software vib install -v “/vmfs/volumes/%UUID_or_volume_name%/%subdir_name%/nfs-vaai-plugin.vib”” command.  I created a subdirectory called “VIBs” and placed the nfs-vaai-plugin.vib file in it.  Be careful as the path to the file is case sensitive.vib-installIf the install was successful, you should see a message indicating it completed successfully and a reboot is required for it to take effect.  Assuming your host is in maintenance mode and has no running VM’s on it, go ahead and reboot now.
  6. Once the host has rebooted and is back online, start a new SSH session and issue the “esxcli software vib list” command again and you should see the new .vib at the top of the list.install-confirmationVoila!  You can now deploy VM’s from template in seconds.itsbeautifulmeme

Veeam + Nutanix: “Active snapshots limit reached for datastore”

Last night I ran into an interesting “quirk” using Veeam v8 to back up my virtual machines that live on a Nutanix cluster.  We’d just moved the majority of our production workload over to the new Nutanix hardware this past weekend and last night marked the first round of backups using Veeam on it.

We ended up deploying a new Veeam backup server and proxy set on the Nutanix cluster in parallel to our existing environment.  When there were multiple jobs running concurrently overnight, many of them were in a “0% completion” state, and the individual VM’s that make up the jobs had a “Resource not ready: Active snapshots limit reached for datastore” message on them.

veeam 1

I turned to the all-knowing Google and happened across a Veeam forum post that sounded very similar to the issue I was experiencing.  I decided to open up a ticket with Veeam support since the forum post in question referenced Veeam v7, and the support engineer confirmed that there was indeed a self-imposed limit of 4 active snapshots per datastore – a “protection method” of sorts to avoid filling up a datastore.  On our previous platform, the VM’s were spread across 10+ volumes and this issue was never experienced.  However, our Nutanix cluster is configured with a single storage pool and a single container with all VM’s living on it, so we hit that limit quickly with concurrent backup jobs.

The default 4 active snapshot per datastore value can be modified by creating a registry DWORD value in ‘HKEY_LOCAL_MACHINE\SOFTWARE\Veeam\Veeam Backup and Replication\’ called MaxSnapshotsPerDatastore and use the appropriate hex or decimal value.  I started off with ’20’ but will move up or down as necessary.  We have plenty of capacity at this time and I’m not worried at all about filling up the storage container.  However, caveat emptor here because it is still a possibility.

This “issue” wasn’t anything specific to Nutanix at all, but is increasingly likely with any platform that uses a scale-out file system that can store hundreds or thousands of virtual machines on a single container.

VMware NSX Lab Environment – Part 2: Prepare Hosts and Deploy NSX Controllers

Introduction

In Part 2 of this series I will cover preparing the ESXi hosts for NSX and deploying an NSX Controller cluster.  As mentioned in the first part of this series “Part 1:  Import and Configure NSX Manager“, the NSX Manager facilitates the deployment of the Controller clusters and ESXi host preparation (among other things), so needless to say having it up and functioning is a prerequisite for this phase.

At the completion of this post, the NSX environment should be mostly configured and we will be able to start doing fun stuff like deploying logical switches, setting up distributed routing, and playing with distributed firewall rules.  I’m pretty excited. /geekout

As you may have gathered by their name, the NSX Controllers reside in the “Control plane” while the NSX Manager resides in the “Management plane”.  Services such as logical switches, distributed logical routers / firewalls are all hypervisor kernel modules and reside in the “Data plane”.  NSX Edge, a virtual appliance(s) also resides in the data plane.  The focus of this post will be the Control plane.

This diagram, courtesy of the VMware NSX 6 Design Guide, depicts the various “planes” in which NSX components reside.

Controller Deployment 2

Deploying the NSX Controllers

1. Now that NSX Manager is running and linked to your vCenter server, the next step in the process would be to enter your NSX licenses.  However, since this is a lab I am running it in evaluation mode for 60 days, I have nothing to enter here.  If you do own the product, or are lucky enough to have a license key for perpetual lab purposes, you’d just go to “Administration > Licensing > Licenses” in the vSphere Web Client and enter the appropriate license keys.

Controller Deployment 1

2. After you’re licensed (or running in eval mode), it’s time to deploy the NSX Controllers.  Under “Networking and Security”, click “Installation.

Controller Deployment 3

3. Click the green “+” sign underneath “NSX Controller nodes”…

Controller Deployment 4

…and you will be asked to enter vSphere cluster, storage, and networking info.

Controller Deployment 5

Click “Select” next to “IP Pool”.  If you’ve already created one you’d like to use, select it and then click “OK”.  Otherwise, click the green “+” and add a new IP Pool.  Once you’ve entered the details appropriate to your environment, click “OK” on the “Add IP Pool” window, select the radio button next to your new IP Pool, and then click “OK” again on the “Select IP Pool” window.

I’ve created a 10 address IP pool for the NSX Controllers to use – technically you’d only need as many IP addresses as you’d have NSX Controllers, but I get kind of OCD about that sort of thing so I’ve allocated a block of 10.  IP’s are one of the few things I have an abundance of in the lab environment.

Controller Deployment 6

4. Click “OK” on the “Add Controller” window.

** It’s worth noting – the NSX Controller password requirements are fairly strict.  My “lab password” I’ve been re-using throughout the environment did not meet the length requirement and it complained…as you see here.  I have a feeling this is one of those passwords you don’t want to lose so be sure to keep it somewhere safe.

Controller Deployment 7

Assuming everything is good with your configuration, NSX Manager should begin deploying the Controllers to your vSphere environment using the supplied credentials linking it to vCenter (as you can see in the “Initiated by” column under “Recent Tasks”).  It’s also worth noting that the NSX Controllers get a unique identification string appended to them – avoid the urge to modify this – it’s by design.  (And yes, I did just throw a screenshot of the “fat”/C# client in here…I do catch myself flipping back to it from time to time.  It’s a habit I’m working on breaking 😛 )

Controller Deployment 8

5. Once the Controller deployment has completed, it should show a “Normal” status in the vSphere Web Client window.  If that’s the case, it’s fairly safe to say subsequent Controller deployments will be successful, so you can now repeat this process 2 more times in order to meet the “3 Controller node minimum” recommendation.  NSX Controllers should be deployed in odd numbers so that a “majority vote” can occur for electing a Master controller.

**You will not be prompted to enter a Controller password again on subsequent Controller deployments – this is shared between all NSX Controllers in the Controller cluster.

Controller Deployment 9

The Controllers are clustered?  Yes, they are.

Without going in too much detail (the NSX Design Guide does a great job explaining it), the “responsibilities” of an NSX Controller get distributed among all members of the Controller cluster.  A Master Controller is responsible for determining when a Controller node has failed and where the “slices” of a particular role it held should be transitioned to.

This image, courtesy of the VMware NSX Design Guide, shows the number of nodes that can fail for a particular NSX Controller cluster count.

2015-04-27 13_25_03-NSX 6 Design Guide.pdf - Adobe Reader

6. In this step, we will create “anti-affinity” rules for the NSX Controllers to ensure no two Controllers ever reside on the same host.  This is an important step in mitigating impact to the NSX environment if an ESXi host fails.  For a lab environment it’s probably not a big deal but I felt it was important to show, as I frequently see vSphere environments with no DRS rules setup when they should probably be used for resiliency of redundant guest VM’s.  As has been commented on by others, I’m kind of surprised that the creation of the anti-affinity rule isn’t done by NSX Manager automatically when deploying two or more Controllers.  I believe other components, such as NSX Edges, do have a rule created by default…perhaps I’m mistaken and will find out shortly.

Anyhow…

In the vSphere Web Client, navigate to the “Hosts and Clusters” view, select the applicable vSphere cluster, then click the “Manage” tab.  Select “VM/Host” Rules and then click “Add”.

Controller Deployment 10

7. Give your DRS Rule a name – I usually like to get descriptive about the nature of it (i.e. add “Anti-Affinity”) in the name.  Click the “Add” button, select your NSX Controllers, and click “OK”.  Click “OK” once more on the “Add VM/Host Rule” window.

DRS Rules 2

8. Now we will prepare the ESXi hosts for NSX.  Click on the “Host Preparation” tab and then select “Install” next to the appropriate cluster(s).  When prompted, click “Yes” to continue with the install.

Controller Deployment 12

If the host preparation was successful, you should see a green checkmark underneath the “Installation Status” and “Firewall” columns.

Controller Deployment 13

9. The next step is to configure VXLAN on our NSX enabled cluster.  On the “Host Preparation” tab, under the “VXLAN” column, select “Configure”.  You will need to select a Distributed Virtual Switch for VXLAN traffic (I’ve created one dedicated for that purpose with a single vNIC uplink…hey, it’s a lab), enter the appropriate VLAN, set your MTU size (it’s not recommended to go below 1600 due to the ~50 byte VXLAN header addition) so make sure your underlying physical (or in this case, virtual) network is configured for jumbo frames.

I’m going to create an IP pool dedicated for VTEP’s so I’ve selected “New IP Pool…” from the drop down box.

Controller Deployment 14

Enter the appropriate IP Pool information here, then click “OK”.  Like the IP Pool I created for the NSX Controllers, this one has 10 IP addresses in it.  You will need a pool large enough to provide IP addresses for each VTEP (VXLAN Tunnel End Point) interface on each host in that IP space.

Controller Deployment 15

Click “OK” in the “Configure VXLAN networking” window.

At this point, we should have green checkmarks across the board on our “Host Preparation” tab.

Controller Deployment 16

** There are considerable design decisions that must be made when choosing your VMKNic Teaming Policy – the layout of your physical networking and Distributed Virtual Switch uplink configuration could dictate which options are viable.  The VMware NSX Design Guide goes over this in great detail (beginning around page 73) and is worth the read.

This table courtesy of the VMware NSX Design Guide shows the teaming and failover modes available based on the uplink type.

Controller Deployment 17

10. The next step is to create a Segment ID (which I believe is also sometimes called a VXLAN Network Identifier/VNI) Pool.  I like to think of Segment ID’s like special VLAN’s inside of the NSX environment – they are used to differentiate the various logical network segments just like VLAN’s on a physical switch logically segment the traffic.  NSX let’s you specify a range of 5,000 to 16,777,216…so roughly 16 million possibilities.  The range you specify in your Segment ID Pool will dictate the maximum amount of logical switches available to your NSX environment.

Under the “Installation” section, click the “Logical Network Preparation” tab and select the “Segment ID” section.  Click “Edit” next to “Segment IDs & Multicast Address allocation…”

Controller Deployment 18

11. Specify your Segment ID Pool range.  I chose 5000-5999, which gives me 1000 possible network segments…far greater than I’d ever need in a lab, but hey why not?

Controller Deployment 19

** I’ve not checked the option to “Enable multicast addressing” and am relying on Unicast for my BUM traffic (Broadcast, Unknown, Unicast, and Multicast).  Not to sound like a broken record, but there are considerable design decisions you’d make to determine whether or not to use Multicast, Unicast, or Hybrid modes.  Page 25 of the VMWare NSX Design Guide goes into detail about the pros and cons of each, when to use or not use, etc.  This is not something I ever had to give much thought to in my “server centric” world prior to starting down the NSX wormhole, and found it one of the harder concepts to grasp and remember when studying for the VCP-NV exam.  This is an area I am still shoring up because it’s directly related to the way NSX propagates information throughout the environment, so it’s obviously a critical piece.

12. The final piece is to configure our Transport Zone.  What is a Transport Zone you ask?  Well, per VMware, it quite literally “defines a collection of ESXi hosts that can communicate with each other across a physical network infrastructure.”  In other words, it determines which cluster(s) participate in the NSX environment.

Click on the “Transport Zones” section, then the green “+” sign.

Controller Deployment 20

Give the Transport Zone a name, select the appropriate replication mode, and the cluster(s) you wish to be included in the Transport Zone.  Click “OK”.

Controller Deployment 21

12. At this point, all the NSX Controllers and supporting configuration should be in place.  Review each tab under “Networking and Security > Installation > Logical Network Preparation” to ensure everything looks correct.

Controller Deployment 22

Controller Deployment 23

Controller Deployment 24

If so, we’re ready to do all the fun stuff you really wanted to deploy NSX for.  The next post in this series will handle the configuration of logical switching, distributed routing, and Edge Services.  A couple of the big things I’m looking to demonstrate are isolation of mock customers in a multi tenant environment and securing a VDI deployment with NSX.

Being my first go-round installing NSX, if you see any inaccuracies or a better way of doing things, please let me know.  And as always, thanks for reading!

 

VMware NSX Lab Environment – Part 1: Import and Configure NSX Manager

Introduction

This blog series will cover the installation and configuration of VMware NSX 6.1.3 in a lab environment.  As such, there are certain design considerations I am overlooking because this is not a production deployment and lab resources are somewhat limited.  It goes without saying a production deployment may and should look a little different 😉

An example of this would be to distribute the vSphere and NSX infrastructure into multiple clusters – a “Management” cluster which might house vCenter, NSX Manager, and the NSX Controllers; an “Edge” cluster which might house things like the NSX Edge Services Gateway and the Distributed Logical Router (DLR) / DLR Control VM, which control the flow of L2 (NSX Bridging) or L3 traffic (NSX Logical Routing) into and out of the NSX logical networking environment; and a “Compute” cluster where the bulk of your server and/or desktop virtualization workload would live.  In a production deployment, these clusters may even span two or more racks to provide resiliency of all workloads in the event of a rack loss.

Example of “dispersed clusters” courtesy of VMware Hands on Labs (Management/Edge cluster + Compute cluster):

NSX Clusters

However, my purpose for deploying NSX in my lab is to get more hands on exposure to the product and as a “proof of concept” for the various capabilities of NSX.  While important to know all the design considerations, it’s not particularly important in THIS instance.

I highly recommend familiarizing yourself with the NSX Design Guide – it’s a great piece of reference material.  I read it start to finish as part of studying for the VCP-NV exam and largely attribute its content, in combination with the VMware Hands on Labs, for passing the exam.  The NSX Design Guide can be found HERE.

The Lab Environment

I will also not be covering the installation and configuration of the various vSphere 6.0 infrastructure that all the NSX components ride upon.  There are many great blogs and white papers covering this, and let’s be honest – if you’re looking to lab out NSX, you probably already have vSphere configuration down pat.

This NSX lab environment exists on a nested ESXi cluster running vSphere 6.0.  There are three ESXi 6.0 virtual machines, each with 24 GB of RAM, 4 vCPU, and ~200 GB of “local” storage across multiple datastores (a couple of which will be used for VSAN at a later date).  A vCenter Server Appliance 6.0 was imported into the nested environment and runs on top of the virtualized ESXi hosts.  Running a nested hypervisor gives you a lot of flexibility on the hardware which it runs on (and flexibility for isolation if desired, to avoid screwing important stuff up 😛 ), so this could just as easily run on top of a couple “home lab” type boxes without issue.

William Lam (@lamw on Twitter) has a GREAT series of posts on his blog virtuallyGhetto.com detailing the requirements and configuration for running nested ESXi – I highly recommend checking it out.  In fact, the 3-node ESXi cluster I am using for this blog post was deployed from an .OVF file he’s made available to the community.  Check out his Nested ESXi Series here and VSAN .OVF Template series here.  I’ve configured this lab environment to be 100% isolated to the outside world from a network, storage, and hardware perspective…which gives me some freedom to make changes and mistakes without breaking anything I care about.

There’s probably many (and better) blogs covering this same subject, but hey, I need the blogging practice anyway so maybe someone will find it useful…so thanks in advance for reading.

And without further ado, importing and configuring NSX Manager.

Import the NSX Manager

The first step in getting NSX running in your environment is to install and configure the NSX Manager.  The NSX Manager is a virtual appliance that is responsible for deploying all the other components of NSX such as the NSX Controllers, Edge Gateways, etc.

There is a 1:1 relationship between a NSX Manager and a vCenter server – one NSX Manager serves a single vCenter Server environment.

** As stated in the NSX Installation Guide, the NSX Manager virtual machine installation includes VMware Tools.  Do not attempt to upgrade or install VMware Tools on the NSX Manager.  One of the first things I noticed (and felt compelled to do) once the NSX Manager was running is to get rid of the angry yellow “VMware Tools is outdated on this virtual machine” warning on the NSX Manager VM Summary tab.  Fight the urge and leave it at the included VMware Tools level.

NSX Manager 1

1. Deploying the NSX Manager must be done through the vSphere Web Client instead of the C# client due to some “extra configuration options” that are only present in the Web Client. Right click on your vCenter Server and select “Deploy OVF Template”.

** You must have the Client Integration Plugin installed in order to deploy an OVF through the vSphere Web Client.  I’ve had issues with the plugin and Internet Explorer 11, so I recommend running this through Chrome or Firefox until/if those issues are resolved.

NSX Manager 2

2. Browse to your .OVA file.  Mine is on an .ISO mounted in the virtual DVD drive of the virtual “jump box” workstation I access my lab environment from.  Click “Next”.

NSX Manager 3

NSX Manager 4

3. On the “Review details” screen, select the “Accept extra configuration options” check box (this option wouldn’t be presented in the C# client, and then you’d have issues with your NSX Manager).

NSX Manager 5

4. Accept the EULA, blah blah blah, then click “Next”.

NSX Manager 6

5. Select a name for the NSX Manager – I got super creative with mine and left it “NSX Manager” – select a folder/location, then click “Next”.

NSX Manager 7

6. Select a cluster or host, then click “Next”.

NSX Manager 8

7. Select a disk format.  I chose “Thin Provision” since this is a lab and storage is at a premium right now.  If necessary, change your VM Storage Policy.  While this lab will have VSAN configured in it eventually, it does not now, and I’m installing on “local” storage (in quotes because it’s actually a .VMDK file on a SAN LUN) of my ESXi host, so I’ve left it at “Datastore Default”.  Click “Next” once you’ve selected the appropriate storage settings for your environment.

** Another shout out to William Lam (@lamw) and his blog www.virtuallyghetto.com for the super handy nested ESXi VSAN OVF templates from which this lab cluster was deployed.

NSX Manager 9

8. Now it’s time to select a network for your NSX Manager.  Right now I’m using the default “VM Network” that was created when I installed ESXi, but I’m essentially treating it as my management network for management/vMotion VMkernel interfaces.  I’ll have some other network interfaces for VXLAN traffic etc. which will be setup at a later time.  Once you have selected the appropriate network, click “Next”.

NSX Manager 10

9. The “Customize template” window has quite a bit for you to fill in – there are some passwords for various purposes on the NSX Manager, IP/DNS/network settings, etc.  Fill in the info as it applies to your environment, then click “Next”.

NSX Manager 11

10. On the “Ready to complete” screen, verify all your information is correct before clicking finish.  If everything looks good, click “Finish”.  The NSX Manager virtual appliance will now be deployed and powered on automatically (if selected).

NSX Manager 12

Configure the NSX Manager

1. Once the NSX Manager appliance has finished booting and is online, log into the NSX Manager appliance to resume configuration.  You will have specified this IP address in Step 9.  Example https://172.16.99.150.  The credentials used for login will also have been specified in Step 9.  Username: admin Password: [user defined].  If you did not define a password during installation, it should be “default”.

NSX Manager Home

2. The first thing to do is check that all the necessary services are running…if they’re not…you won’t get much further.  Click the “View Summary” button to be taken to the appliance summary page.  All services should show “Running”, with possibly exception to the “SSH Service”.

NSX Summary

 

NSX Summary 2

3. Now we will register NSX Manager with your vCenter Server.  Click the , click “Manage” tab, and then click “NSX Management Service” under the “Components” menu section.

NSX Manager 14

Click “Edit” to enter your vCenter Server details

NSX Manager 15

4. Enter the appropriate credentials to connect to your vCenter Server.  Yeah yeah yeah, I used my personal lab account instead of a service account…so what?!  Click “OK” and you should be prompted to trust the vCenter Server certificate.  Click “Yes” on this window.

NSX Manager 16

NSX Manager 17

 

**Update**

I ended up ditching using my “personal” lab account for the vCenter connection…not exactly sure why yet, but whenever I logged into vCenter it showed a “No NSX Managers available” error.  I switched to a new “service” account that also has Administrator rights in vCenter and the NSX Manager populated correctly.  It even showed my “personal” account listed as an NSX Enterprise Administrator but I could not see anything.  After logging into vCenter with the “nsxservice” service account, I went to Networking and Security > NSX Managers > Manage > Users and deleted my personal account and re-added it as an Enterprise Administrator.  Once I did that, I was able to log into vCenter with my personal account and see the NSX Manager fine.  Who knows…perhaps I did something wrong during the initial vCenter linking.  So just a heads up in case you run into a similar issue.

nsx users 1

5. If the connection to your vCenter Server was successful, you should see a green circle next to the “Status” field.

NSX Manager 18

 

6. Next, the Lookup Service will be configured.  Click the “Edit” button in the “Lookup Service” area.

NSX Lookup Service

Enter your Lookup Service details – it’s worth noting that in vSphere 6.0, the Lookup Service port is now 443.  I had assumed it was still 7444 and ran into some errors applying the configuration.  Luckily I ran across Chris Wahl’s (@ChrisWahl) blog with a quick explanation.

NSX Lookup Service 2

As when linking your vCenter server, trust the certificate.

NSX Lookup Service 3

And you should now see a “Connected” status under the lookup service.

NSX Lookup Service 4

7. The final step for configuring NSX Manager in the lab is to make a backup of the configuration.  It is recommended to do a backup while NSX Manager is in a “clean state” so that it can be rolled back to in the event of an issue during subsequent changes.

Click “Backup and Restore” in the “Settings” menu.

NSX Manager 19

Click “Change” next to “FTP Server Settings”, then enter your FTP/SFTP server details here.  I actually don’t have an FTP server in my lab yet, so I’m going to forego a backup at this point.  I like to live on the edge anyway.  If you’d like to setup scheduled backups, there is an option to do so on this page as well…probably recommended for a production deployment.

NSX Manager 20

8. At this point, configuration of NSX Manager is complete and it has been linked to your vCenter Server.  You should now be able to log into the vSphere Web Client and continue with deployment of the remaining NSX infrastructure pieces.

If you’re currently logged into the Web Client, log out and then log back in and you should see a new “Networking and Security” panel.  Click on “Networking and Security”.

NSX Manager 21

It is from here that most of the NSX configuration will be managed and further NSX services and components deployed from.  The next post in this series will cover deploying and configuring the NSX Controllers.  Stay tuned and thanks for reading!

NSX Manager 22

 

Part 2:  Prepare Hosts and Deploy NSX Controllers is up!

A Tale of Two Vendors…

The topic of this blog post is a comparison of two vendors and how I feel they embrace their user (customer) communities in very different ways.

To give you some background, I spent the previous 3 or so years as a consultant focusing on desktop and application virtualization with Citrix and Microsoft products – primarily Citrix XenApp and XenDesktop on the EUC side and Microsoft Hyper-V on the hypervisor side.    However, I “cut my teeth” in virtualization back in the VMware GSX / ESX 2.5 days and was primarily interested in datacenter technologies…I guess you could say VMware was my “first love”.  Fast forward to the present, I’ve taken a new position and find myself getting reacquainted with some of VMware’s newer software that I’d neglected learning much about.  As I was trying to brush up on my skills by setting up labs, reading white papers, researching user group meetings, etc. I noticed something – VMware has all sorts of great “customer enablement” offerings that Citrix doesn’t really have a parallel solution for.

Now let me preface this by saying my intent is absolutely not to vendor bash – I like to think I’m fairly vendor agnostic.  I like Citrix, I like their products, and I had a great time implementing them (well, most of the time, at least)….but I believe they’re really missing out on the user / customer enablement side of things in several areas.

  1. User groups – I don’t think there’s much to say here that hasn’t already been said…but VMware seems to have a much more mature, organic, and community oriented user group.  As a partner, I didn’t really “get” to participate in the Citrix user group events much so perhaps my perspective is skewed…but maybe not based on comments by others.  This leads me to my second point.
  2. VMUG Advantage (http://www.vmug.com/Advantage) – $200 buys you a one year membership to VMUG Advantage, which among other things gets you 20% off certification exams (which could pay for a big chunk of the membership after a test or three), $100 discount on VMworld admission, VMware Fusion and Workstation (which I believe you get as a VCP anyway), and my personal favorite:
  3. Subscription based “evaluation” software – EVALExperience (http://www.vmug.com/p/cm/ld/fid=8792) gets you access to a ton of VMware software for non-production use.  Continuous improvement and learning is critical to staying at the top of your game and the easiest way for many to do that is with a lab…yes, you could perpetually reinstall and reconfigure every 60 or 90 days using trial software, but that gets old.  And yes, if you work for a partner there are benefits associated with that.  But for the rest of us, EVALExperience might be just what the doctor ordered.
  4. Free training and hosted labs – VMware Hands on Labs (http://labs.hol.vmware.com/HOL/catalogs/)…wow, how did I go so long without using this?  Standing up your own “home lab” is rewarding in and of itself, but sometimes it can be difficult to recreate certain “enterprise class” solutions at home.  VMware HOL has an expansive catalog of labs from “software defined datacenter” to “mobility” to “end user computing” and everything in between.  I recently took advantage of their “Introduction to NSX” lab and it was a great experience…considering you can’t download a trial of NSX for your home lab at this time, it was an easy way for me to get hands-on with NSX in pursuit of the VCP-NV certification.  Allowing VMware “users” access to this sort of training is invaluable and helps grow and enhance the skills of the people who will be the “champion” of their products in the future.

I’m a firm believer in “competition breeds innovation” and hope that Citrix sees some of the positive ways VMware has embraced their user community and comes up with something great of their own.

EMC VNX Pool LUNs + VMware vSphere + VAAI = Storage DEATH

**Cliffs notes – a bug in the VNX OE causes massive storage latency when using vSphere with VAAI enabled – disabling VAAI fixes issue**

Hello, and welcome to my very first blog post! I’ve owned this domain and WordPress subscription for nearly a year and a half and am finally getting around to posting something on it. Considering I’ve spent the last 3 years focused on end user computing, and the majority of that being done with Citrix products, I always figured my first post would be in that domain…but alas, that was not the case.

The problem…

I recently started a new gig and one of the first orders of business was untangling some storage and performance issues in a vSphere 5.5 environment running on top of a Gen 1 EMC VNX 5300.  It was reported that there was very high storage latency, often resulting in LUNs being disconnected from the hosts, during certain operations like a Storage vMotion or deploying a new VM from template.

After a general review of the environment I was able to rule out a glaringly obvious mis-configuration, so I turned to a couple useful performance monitoring tools – Esxtop and Unisphere Analyzer.  While I am by no means an expert with either tool, with a little bit of Google-fu and the assistance of a couple great blogs (which I’ll link to later in this post), I was able to get the info I needed to verify my theory – a bug involving VAAI that was supposed to be addressed in the latest VNX Operating Environment (which at the time of this posting is 5.32.000.5.215) still exists.

I started out by doing some performance baselining with VisualEsxtop (https://labs.vmware.com/flings/visualesxtop) so I could get a picture of what the hosts were seeing during operations that involved VAAI (Storage vMotion, deploy-from-template/clone, etc.)  As you can see in the below screenshot, the VNX is quite pissed off.  The “DAVG” value represents disk latency (in milliseconds) that is likely storage processor or array related.  The “KAVG” value represents disk latency (in milliseconds) associated with the VMkernel.  Obviously, the latency on either side of the equation is nowhere near a reasonable number.  Duncan Epping has a great overview of Esxtop (http://www.yellow-bricks.com/esxtop), I highly recommend you give it a read if you’re newer to the tool like I am.

1

The next step was to use EMC’s Unisphere Analyzer to get a picture of what was occurring on the storage side during these operations.  If you’re not familiar with Unisphere Analyzer, an EMC employee created a brief video on how to capture and review data with it (http://www.youtube.com/watch?v=yCMZ_N7-p7A) – it’s a relatively simple tool that you can garner a lot of valuable information from.  I used it to capture storage side performance metrics during the two following tests.

2

The first test consisted of a Storage vMotion of a VM with VAAI enabled on the host (1 Gb iSCSI to the VNX).  This test moved the VM from LUN_0 to LUN_6, starting at 9:46:44 AM and finishing at 10:01:53 AM.  If you look at the corresponding time period on the Unisphere Analyzer graph you’ll see that response time is through the roof.  While it did not occur during this test, the hosts would often lose their connection to the LUNs during these periods of high latency…not good, obviously.

These warnings always show up in the vSphere Client when this issue occurs (yeah yeah, I’m not using the Web Client for this):

3

The second test consisted of a Storage vMotion of the same VM with VAAI disabled.  This test moved the VM back to LUN_0 from LUN_6, starting at 10:04:13 AM and finishing at 10:14:03 AM.  This time, the Unisphere Analyzer data looks MUCH better.

2

Here is some an example of what Esxtop looked like during the test with VAAI disabled:

4

The LUNs with ~ 1400 read/write IO are obviously the ones involved in the Storage vMotion…notice the lack of “SAN choking”.  I re-ran this test multiple times using other LUNs with identical results…it was obvious at this point that there is still an issue with VAAI being used on this VNX OE.  Fortunately, our production datacenter utilizes 10 Gbe for the iSCSI network and Storage vMotions finish in just a minute or two.  I could see this flaw being particularly problematic in larger environments where Storage vMotion is frequent or something like VDI where VM’s are frequently spun up, tore down, or updated.

The solution…

Obviously, disabling VAAI in vSphere is a guaranteed “work around”.  I wouldn’t necessarily call this a “fix” as the VAAI feature is unusable, but it will stop the high latency and disconnects when vSphere tries to offload certain storage tasks to the array.  Once I had some hard evidence in hand, I did open up a ticket with EMC, and the support engineer was able to confirm this was indeed still a bug and has not been addressed by the latest OE version.

This VMware KB article details the process of disabling VAAI (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665).  I found that just the “DataMover.HardwareAcceleratedMove” parameter in the article had to be disabled.  The EMC support engineer also mentioned they had some success increasing the “MaxHWTransferSize” parameter while leaving VAAI enabled, but that it hadn’t worked for everyone.

You can see more information from this KB article (https://support.emc.com/kb/191685 – you may need an active support account, I had to login to view this page).  I decided to just disable VAAI and call it a day until a valid bug fix was released in some future OE version.  ***update 11/06/15*** it has come to my attention the preceding EMC KB191685 can no longer be accessed at the supplied link…I searched through the support portal and could not find a replacement so I don’t know if they pulled the KB documenting this issue entirely or if it’s been merged into another.  I did however find a support bulletin from June 2015 saying that the VAAI improvements had been added into the .217 firmware.  At one point I did request the .217 firmware only to find out they’d pulled it due to some issue it was causing.  I can only assume the VAAI improvements would’ve been added into some subsequent firmware version but no longer have my VNX’s in production, nor are they under support, and I won’t be able to personally test.

Hopefully this information will be beneficial to someone at there…luckily I found my way through the rabbit hole, but there wasn’t a whole lot publicly available regarding this issue when I was initially seeking a cause.