Over the past couple weeks we had issues with a handful of legacy Windows Server 2003 boxes that would randomly “lose connection to the domain” – they were unable to access other resources on the domain and could not authenticate interactive logons using domain accounts. Credit goes to my coworkers who were the ones to uncover the source of the issue, but I wanted to get this out there in case anyone else runs into it and bangs their head on the wall trying to figure out WTF happened.
A few days after upgrading the Dell/Quest/whoever KACE appliance to version 7.0 the first round of servers had domain authentication issues and exhibited the following errors in the Event Log:
It appears as though the “storage” referenced by these two errors is actually lack of memory, not lack of available space on the system drive.
When attempting to logon to the server with a domain account (local accounts still worked fine):
“konea.exe” (KACE agent) with ~12K open handles
Killing the konea.exe process restored domain functionality almost immediately (as would a reboot as a side effect of the service restarting) but without disabling the Windows service for it, it’d only be a matter of time before the issue returned. We ended up disabling the service temporarily until the issue was escalated through KACE support.
Initially we thought the issue was isolated only to 2003 servers, but it occurred on a handful of 2012 servers about a week after the incident on the 2003 servers. I’m assuming that is due to 2003 not being able to support as many open handles as more modern Windows operating systems, but as the uptime increased on the newer servers, they too would fall victim to it eventually.
While the KACE server version was upgraded to 7.0, the agents deployed on most of the servers were still 6.4. KACE support stated this should normally not be an issue, but recommended that the agent be upgraded to the same level as the server (7.0) in order to “resolve” the issue. As of the writing of this post, we are still waiting to hear what actually “broke” with the 6.4 agent > 7.0 server interaction.
They confirmed that an increasing number of other customers were also reporting this issue. 7.0 is still fairly new and I imagine many customers have not moved to it yet, but if they do and the agents are not upgraded at the same time, people will likely run into this issue. If you are running KACE 7.0 and still have older agents deployed, beware that you could start having servers fall off the domain.