I recently gave a short presentation at the local VMUG Usercon on “my journey to micro-segmentation” and thought I’d adapt part of my slide deck to make a blog post on how I go about planning and implementing firewall rules for applications or tenants using the VMware NSX Distributed Firewall.
The more I work with the Distributed Firewall the more I realize there really isn’t one “right way” to secure your VM’s with it…it’s less about how the sausage is made and more about the end result. However, there are a few steps I see as a requirement for laying down firewall rules over an existing production environment to ensure rules are scoped accurately and without blocking necessary traffic to avoid an issue with operations.
Step 1 – Gather Information
It is important to gather as much information about the traffic needs of the application or tenant upfront. There are plenty of well known protocols/ports that we can rattle off from memory for various things…a web server will probably need HTTP/S over ports 80/443, a file server will probably be talking SMB over port 445, SQL Server on port 1433, LDAP over 389 or 636, etc. These are the easy ones.
However, there are inevitably going to be in-house developed apps or industry specific apps that use custom ports or protocols that are in an application tier or within a tenant that may be harder to identify. The vendor may have good documentation for all ports and directions of communication, but I think this is more the exception than the rule.
If you go to your application owner or support team and ask “what does each tier of your application talk to and over what protocols and ports” you will probably get a reaction pretty close to this:
If you do get something back from them, there’s a good chance it’s not right…or at least not the full picture. This is not a knock on app teams, it’s just the reality of the situation that servers and applications are far chattier than many people realize or give thought to. Being able to block traffic at the virtual NIC level is extremely powerful and when you have to take into consideration east-west traffic flows that were typically never firewalled in a “physical world” there is a lot more work that needs done.
Step 2 – Trust…But Verify
To have any legitimate chance for success at laying down DFW rules on an existing application or tenant without breaking things you need a traffic monitoring tool. That statement should probably be in CAPS, italicized, and the color red. You will inevitably block traffic that you were unaware of if you do not monitor the actual traffic flows to and from the VM. It will happen. Even if you do somehow know all ports and protocols that are in use by a particular system, it’s not uncommon to find misconfigurations in the guest OS such as pointing to incorrect DNS servers that could be problematic when firewalling a VM.
There are lots of solutions out there that can get you this data. You could pull it from syslog, maybe your network team already has some sort of NetFlow aggregator, or some in house developed solution. Whatever you choose, the important part is that you are able to pull relevant, accurate, and easy to consume information from it that allows you to be actionable. Manually parsing 20,000 lines in a .CSV is none of those things.
For this, we chose vRealize Network Insight. Not only does it log all flows from the vSphere Distributed Switch, there are tons of “NSX’y” things it does above and beyond traffic flows such as environmental health checks, alerting, and visualization of the network both in the logical and physical realms. I wish VMware would consider bundling in some of the vRNI features at certain tiers of NSX licensing – the easier you make it to consume the solution the more people will buy of it. VMware did pay a pretty penny for the acquisition from Arkin so I also understand the need to monetize it. Regardless, we found it fairly reasonably priced and when you take into consideration the potential financial loss for causing an application outage, it was a no brainer.
The following screenshot is an example of the information you can extract from vRNI. In my opinion, the most powerful piece of vRNI is the “search” feature. It’s extremely intuitive to query and that’s how I generate all of my flow data used in firewall rule development.
I will typically use a query like “show flows where src ip = 192.168.1.0/24” or “show flows where vm name like [vm name]”. What I’m after is showing both sides of communications flow…I want to know all traffic inbound to the app / VM / tenant as well as all communication outbound from it and adjust my queries as such.
Once I’ve gotten the data I want from vRNI I will export the reports to a .CSV file and further massage the data…possibly removing traffic I know will not be required, removing rows based on the IP or subnet, etc. This is one of those areas that has to be left open to interpretation, as there are a ton of different ways to scope and filter the data to be applicable to your specific environment. If you’ve done your job well, at this point you should have identified probably 95% of the required traffic required and can begin creating rules in the Distributed Firewall.
Step 3 – Proceed With Caution
With great power comes great responsibility – as mentioned previously, the ability to apply firewall rules at the vNIC level, before it’d ever hit physical media let alone route through a physical firewall, is extremely powerful. It introduces all sorts of new ways you can really screw up your day if you don’t take a few precautionary steps.
I’m going to give a brief overview of my approach to micro-segmentation with the DFW so that the rest will make sense. When I deploy NSX into an environment, I do so from a position of extreme caution. I want to avoid an administrator making a mistake, a firewall rule being too broadly scoped, or a bug/issue with the solution itself from creating an outage. Again, there are lots of ways to go about this – I’m not saying my way is THE way, but hear me out.
- The first thing I do is ensure that the “default rule” in the NSX DFW is set to allow. In earlier versions of NSX, you were given the choice during deployment of NSX Manager whether or not you wanted your default rule to “allow” or “deny” traffic.
If you chose “deny” you’d probably end up “islanding” your environment, as there are no DFW rules yet so of course traffic will be blocked. However, the default rule is set to “allow” by default during installation in current versions of NSX, and I don’t think you are even given the option to choose otherwise. Anyone (myself included here) who played around with NSX in their lab has at one point or another probably blocked all their traffic by mistake and had to issue a REST API call to remove it.
Instead of having a “global” default rule set to deny, I’ve been doing it on a per-app or per-tenant basis – at least during the rollout phase where the majority of the systems in the environment are not being firewalled yet. Having the default rule set to “allow” ended up being highly prescient, which I’ll touch on in the next section.
- The next thing I do is place any VM not actively being firewall in the “exclusion list” on the NSX Manager. This prevents any of that VM’s traffic from being processed for filtering by the DFW.
By doing this I’m hoping to avoid an issue where I have VM’s that don’t have DFW created for them yet, and somehow the default rule gets flipped to “block” or a rule is too broadly scoped and ends up causing me problems.
There is actually a bug in one of the recent versions of NSX where a condition occurs that causes all VM’s to be removed from the exclusion list by mistake, suddenly opening them up to DFW filtering. If your default rule is set to “block” and you don’t have rules in place allowing the necessary traffic, you now have an outage on your hands. Thanks VMware. This did actually happen to me and I suddenly felt very glad I had left my “global” default rule as “allow”, therefore an outage was avoided.
- I use the “applied to” field in the Distributed Firewall to limit the scope of systems considered for processing of that particular firewall rule. The default setting is to apply a newly created firewall rule to the “distributed firewall”, therefore any VM not on the exclusion list is checked against it for processing. In a large environment, that’s going to be hundreds or even thousands of rules being checked that have nothing at all to do with that system. There’s been times I was troubleshooting an issue by showing what rules were applied to the vNIC, and if literally every rule in the environment showed up in the list, it’d have greatly complicated troubleshooting.
If the firewall rule set applies to an entire tenant, I’ll create a Security Group that contains that tenant’s VM’s and have all the rules in that section have their “applied to” field configured for that Security Group. If a firewall rule set applies to a single VM or tier of VM’s, I may select the individual VM or possibly a Security Group in the “applied to” field.
Having “applied to” configured limits the scope and failure domain that a misconfigured rule may impact…instead of it applying to the entire environment, maybe you just block traffic on a handful of VM’s and the damage is much smaller in scope.
Step 4 – You’ve Got the Data, Now Do Something With It
By now you’ve probably (or not) talked with your application owners about how their application communicates with the environment, you’ve generated flow data from vRealize Network Insight, and massaged the .CSV output to further refine the data.
It’s now time to take that output and create your initial firewall rule set in the DFW. The below screenshot depicts a sample firewall rule set for a tenant. There are multiple applications within this tenant, and its VM’s have been placed into Security Groups by app or function. The flow data from vRNI was used to allow the appropriate traffic in or out bound. A Security Group containing all the tenant’s VM’s was used as the scoping object in the “Applied To” field, for the reasons mentioned previously.
At the end of this rule set you will see the two “default” rules for this tenant. The “outbound” default rule has a source of [tenant security group], a destination of “any”, and a service of “any”…with the source and destination being reverse for the “inbound” default rule. The “Action” is currently set to “Allow” during the analysis phase, so that logging can be enabled on the “default rules” to see if any traffic that didn’t match one of the previous rules in the rule set registers as a “hit”. These “hits” are obviously going to result in blocked traffic once the default rules get set to “block”.
With logging enabled, we can go to the host(s) that contain the app or tenant’s VM’s and parse the “dfwpktlogs.log” log file to see if any traffic that wasn’t accounted for is hitting the default rule. This is kind of the “last chance” to rectify any missed traffic – there may be things legitimately blocked and logged here that you do not need to be concerned about…outbound web traffic to Microsoft on a Windows server for example. It’s the “other” we are concerned about now.
To parse the “dfwpktlogs.log” file, open an SSH session to your host(s) and enter the following commands:
- cd /var/log
- cat dfwpktlogs.log | grep 1125 | grep 192.168.1 | grep 2017-05-16
The above command parses the dfwpktlogs.log command, filtering by rule ID 1125 (the outbound “default rule”), filtering by IP subnet, and filtering by date (to avoid returning flows from days where logging was enabled previously)
Enabling logging on a “default rule” can generate a large amount of data, so it’s recommended to only leave logging enabled temporarily – a time period measured in hours or maybe a day. Enable it during times that would represent “normal” business function or during a time that some core process runs for that app/tenant to give yourself the most valid logging data.
If you see traffic that you believe should be allowed but is instead hitting the default rule, either modify an existing rule to include the traffic or create a new rule within the rule set to allow it. Once the logs are clean, or you’re only seeing traffic you expect to be blocked (i.e. outbound internet traffic to Microsoft from a Windows server) then you’re ready to flip the “default rules” to “block”.
Step 5 – Ongoing Operations
So you’ve planned out all your micro-segmentation rules, you’ve created the initial rule set, you’ve monitored the dfwpktlogs.log files to make sure you didn’t miss anything and adjusted the DFW rules where necessary, and you’ve switched your “default rule” to block, and everything went well…….now what?
First thing – pat yourself on the back. While not overly difficult, properly planning the micro-segmentation of an application or tenant can be quite time consuming to account for all necessary traffic to avoid issues.
OK, now that’s out of the way…you’ll inevitably get a panicked email from an application owner saying “we performed an upgrade and now the application won’t start”….the upgrade changed or added some ports used and they weren’t in the original rule set created for the app, now they’re blocked.
While you can certainly generate some new flow data from vRNI, I’ve found the quickest and easiest place to check for a blocked flow is the “Flow Monitoring” section in the NSX management GUI. The time window to show flow data for is completely configurable – if you’re like me, you rarely find out about an issue shortly after it happens…most likely you’ll be going back several days to find that needle in the haystack. By using an appropriate time window, selecting from the “Blocked Flows” tab, and using the “filter” mechanism, you should be able to find the issue with little effort.
Hopefully you found this post helpful. As mentioned several times already, the NSX Distributed Firewall is extremely powerful and it gives you great flexibility on how to accomplish an increased security posture in your environment. This methodology is not necessarily THE way, but it’s my way and has worked out pretty well for so far. As always, I’m open to hearing about new and better ways if you have a different way to do it.