The Curious Case of Lync Server 2013 CLS “Running” but not Actually Working

This was one of those “ohhhh yeah…” moments – one of those issues that made complete sense once you were able to see root cause but one of those issues where you weren’t initially able to “see the forest through the trees” when troubleshooting began. It eventually took a case with Premiere support, along with multiple rounds of different Premiere engineers, but we finally arrived at the root cause and a solution. Without further delay, the issue:


Success Code - 0, Successful on 1 agents

Some of you may look at the above screenshot and think, “uh…Trevor…so what?”, assume I’ve starting drinking too early in the morning and move on to someone else’s blog. The true secret, however, was that CLS wasn’t actually working and the screenshot above shows that the cmdlet is actually missing information to confirm that CLS was functional. What I should have seen was this:

Notice how the cmdlet actually returned Tracing Status to me and indicated the status, scenario and computer I’m interacting with. That’s the piece that was missing in this customer’s environment.

So to recap, this is what I knew to be true:

CLS simply didn’t function at all in any capacity, even though the Windows Service for the CLS process was running. No amount of service restart or server reboot would help. No cmdlet requests were being processed or executed to make CLS do what it is supposed to do.
It took a significantly long time – on the order of 5+ minutes – for any of the CLS cmdlets to complete. I have always known that CLS isn’t exactly the Bugatti Veyron Super Sport of the logging world, but it has never taken that long for a simple Show-CsClsLogging cmdlet to complete in any deployment I’ve been involved in.
This was happening on multiple servers (4 at the time the ticket was opened with Microsoft) so it definitely seemed like something had occurred or changed in this customer’s environment that would be at play with this issue.

Given what I know about the environment and what I know about the functioning of Lync Server and CLS, I start to dig in on my own…

First Things First

Knowing the fact that Sophos Anti-Virus has caused multiple Lync Server related issues in this environment in the past, I immediately began focusing my attention there. I fired up Process Monitor to begin looking at process traces whilst I executed a CLS cmdlet via PowerShell, and in doing so I saw this:


Sophos_Detoured_x64.dll

The CLSAGENT.EXE executable is having its operation forced through a Sophos DLL, “Sophos_Detoured_x64.dll”. I actually encountered this issue before and had previously penned a blog post involving security hardening to SQL Server, but I had not yet seen this cause an issue with CLS. Knowing that this DLL detouring wasn’t supported my Microsoft nor a performance help in general, I went into regedit and $NULL’d out the following registry key entries per the instruction at Sophos:

HKLM\Software\Microsoft\Windows NT\CurrentVersion\Windows\AppInit_Dlls

HKLM\Software\Wow6432Node\Microsoft\Windows NT\CurrentVersion\Windows\AppInit_Dlls

Note: If the AppInit_Dlls value contains any text – in my case it contained the NTFS file path to the Sophos_Detoured_x64.dll – then DLL detouring is being utilized by your Anti-Virus vendor.

I rebooted the server after making the registry change and tried running the CLS cmdlets again after the reboot, but I didn’t get any different results. Things still appeared broken.

One Step Closer

I went back into Process Monitor and began looking at traces while I again executed a CLS cmdlet in PowerShell. The process traces looked very different this time (with Sophos not in the picture) and I could see that the CLSAgent executable was actually trying to do something:

The Process Monitor traces showed that the CLSAGENT.EXE process was stuck in a perpetual loop of “Thread Create” and then immediately after, a “Thread Exit”. When comparing the log above to a server where CLS was functional, there is a significant difference:

In the working example directly above, after one of the first “Thread Create” operations, you see the CLSAGENT.EXE process begin writing information to multiple .cache files in the “C:\Windows\ServiceProfiles\NetworkService\AppData\Local\Temp\Tracing” NTFS location. From that point on in the trace, CLSAGENT.EXE seems perfectly happy. On the non-working server, the traces never indicated getting to the point where logs were being written.

Final Examinations & Troubleshooting

Thinking logically, it seemed as though the CLSAGENT.EXE process was still potentially being interfered with, so I went through the list of items that made sense to check:

Is Sophos configured to exclude all Lync Server related NTFS directories and application executables from real-time scanning?
- Confirmed that yes, exclusions are in place.
Can Sophos be turned off to fully confirm that it is not in any way playing a part?
- Confirmed that yes, even with Sophos Anti-Virus turned off the end result did not change.
Are any firewall ports being blocked that would prevent CLS from functioning?
- Confirmed both through Wireshark and Process Monitor that TCP 50,001-50,003 were open and that network flows were present.
Given that CLS runs under the NetworkService account, are any NTFS restrictions in place that would prevent the account from writing to the desired NTFS locations?
- Confirmed that while some GPO configuration was present, there were no GPO settings that would prevent the NetworkService account from accessing or writing to the desired NTFS directories.

It was at this point that we got Premiere Support from Microsoft involved. It took a number of weeks and a number of engineers, but we finally had an answer presented to us this past week.

The Root Cause

Before just “showing my cards”, a little back ground info to help set the stage…

Dynamic Port Background

Starting in Windows Server 2008, Microsoft made a change to the way dynamic UDP/TCP ports are used within the operating system to bring it in-line with IANA standards. Prior to Windows Server 2008 the dynamic port range was 1024-65535, but in Windows Server 2008 and newer the dynamic port range changed to 49152-65535:

https://support.microsoft.com/en-us/kb/929851

What this means is that any process that may need to request a TCP port for networking communications (think about applications that may use RPC) will use, by default, an open port in the 49152-65535 range for communications. Additionally, you can customize those port ranges to help allay your InfoSec team so that potentially a smaller port range may be used – say ports 50000-55000.

Specifying Specific Ports for NetLogon

In addition to the dynamic ports of the OS, system administrators can actually set a few registry keys that specify the Windows OS to use specific ports for certain communications:

https://support.microsoft.com/en-us/kb/224196

Registry key 1

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

Registry value: TCP/IP Port
Value type: REG_DWORD
Value data: (available port)

You need to restart the computer for the new setting to become effective.

Registry key 2

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters

Registry value: DCTcpipPort
Value type: REG_DWORD
Value data: (available port)

You need to restart the Netlogon service for the new setting to become effective.

Despite the KB article above talking about Active Directory, the important piece to remember is that LSASS.EXE is utilized even by member servers in the domain. Additionally, LSASS.EXE is the parent process that spawns the NetLogon process so it consumes any settings that are configured on the base OS build:

https://blogs.technet.microsoft.com/askpfeplat/2015/01/11/rpc-endpoint-mapper-returns-dynamic-port-incorrectly-when-active-directory-is-configured-to-use-static-port/

How it all fits together

In this particular environment, the servers had been customized in regards to the dynamic port range configuration and in regards to static ports for the NetLogon service. After multiple rounds of logs and investigation, the final engineer eventually focused in on what each ProcessID’s active network ports were on the system by using the ‘netstat’ command in conjunction with the ‘findstr’ command:

netstat -ano | findstr 5000

The engineer then took the ProcessID from the results and looked in Task Manager to find which service or executable was tied to that ProcessID. What the engineer eventually discovered was that there was another Windows process bound to the CLS TCP ports – lsass.exe.

LSASS.EXE is the Windows Local Security Authority Subsystem which is responsible for all security processing on a server including authentication, security policy processing, audit policy processing, and token processing. But why would LSASS.EXE be listening on the port CLS wants to use? The answer to that question is two-fold:

Since LSASS.EXE relies on the dynamic port range configuration of the Windows OS, it simply looks for a random port available upon boot up.
- Note: Given that the servers also had a static port configuration set for the NetLogon service that overlapped the CLS ports, it meant that even reboots would not have solved the issue because the same port would have been used after every single reboot.
Since LSASS.EXE starts much earlier in the boot process than CLSAGENT.EXE, it has free reign to bind to the TCP ports that CLS needs because the CLS service isn’t running yet.

As a result of the port range configuration and the boot process order, CLS was effectively being starved out of a port needed to function.

The Fix?

In short, the fix was very simple: change the dynamic port range and move the static NetLogon port configuration.

Dynamic Ports

This is a pretty easy change to implement using netsh commands:

Netsh int ipv4 set dynamicport tcp start=24419 num=16383
Netsh int ipv4 set dynamicport udp start=24419 num=16383

Netsh int ipv6 set dynamicport tcp start=24419 num=16383
Netsh int ipv6 set dynamicport udp start=24419 num=16383

NetLogon Static Ports

This is also a pretty easy change to implement through regedit:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

Registry value: TCP/IP Port
Value type: REG_DWORD
Value data: 30000

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters

Registry value: DCTcpipPort
Value type: REG_DWORD
Value data: 30001

The Results…

Following the change of the dynamic port range and the change of the NetLogon configuration, you need to reboot the servers in question. Low-and-behold…after doing so we had functional CLS:

A quick examination of netstat also confirmed that CLS was bound to the TCP ports, as expected:

Short Re-Cap

In the end it was part configuration error and part dumb luck. This customer had overlapped their dynamic TCP port and static TCP port allocations with ports that CLS wanted to use. It was simply the luck-of-the-draw that LSASS.EXE had grabbed one of the CLS ports and all because LSASS.EXE starts when Windows boots…long before CLSAGENT.EXE starts, which is set to “Delayed Start” by default. When you combined everything together, you get the case of CLS “running” but not actually working.

Short Aside:

If you are using Microsoft’s recommendations for TCP/UDP ports for Lync/Skype4B QoS, any port above 40803 should be dedicated for the various Lync/Skype4B MCUs. Don’t overlap your dynamic or static OS ports in the same range that your MCUs will operate in…instead move it below the 40803 range, as we did in this fix. You may have to ensure your firewall rules are updated to reflect the new SrcPort for communications (if you are using internal firewalling), but it’s a small price to pay to be able to actually use CLS!