Uncovering SIP trunk High Availability issues through Lync CQM

High availability is crucial for any critical application utilized by an enterprise. High availability can refer to different things when you ask different IT professionals, but the core of the definition is that a service should be fault tolerant and able to withstand failures without administrator intervention. Where this becomes a “grey area” is when you start considering how HA crosses sites/geographic boundaries, but we’ll save that conversation for a later date.

When it comes to Lync/Skype4B there are two configurations for High Availability (remember that Microsoft considers HA within a single site or data center only):

I generally recommend DNS load balancing for SIP as it is considerably easier to configure and troubleshoot. DNS load balancing can be used for Mediation Server pools as well, but unless a PSTN gateway is configured to utilize DNS LB you end up not having HA at all. With a recent Lync health check I was actually able to identify this lack of HA through the Call Quality Methodology data.

Call Quality Methodology

If you’ve never run this in your Lync environment, then you should. Here’s the documentation on it. Now – go forth and exercise your Lync CQM-fu!

Analyzing the CQM Data

The CQM scorecard outputs a bunch of useful data, but I wanted to examine things a little further so I took a look at the Plant_2_Mediation_Gateway CSV file. This file contains all the streams between a Mediation Server and the trunks associated with the Gateways.

Simply looking at the table itself can provide some insight but where the data becomes extra useful is creating a new Pivot Table out of it. Simply select the data source, create your Pivot Table and voila!

Once I had the Pivot Table, it took me only a few seconds to identify an issue. Go ahead and search for it, I’ll wait… Still don’t see it? Ok, I’ll help…

Remember that for each Gateway to connect with a Mediation Server, it utilizes a Trunk in the Lync Topology. If a Gateway is correctly configured and is utilizing load balancing for calls sent to Lync (either DNS based or potentially IP address based), we should see the trunk equally utilized across all servers in the mediation pool. In the environment I was examining, calls from a certain Gateway were limited to a single server within the mediation pool. Effectively the CQM data showed that HA was likely not configured.

Moral of the story

After checking with the customer on this, we were able to 100% corroborate that the SIP provider had only configured a single IP address on their SIP trunks to the Lync mediation pools. If either of the Lync Front End servers would have gone down, all inbound calls from that provider would have begun failing. Thankfully, the Lync Front End servers had not gone down, but this allowed us to identify and proactively remediate an issue before it resulted in an outage.

You may have thought that CQM was for call/network stats only, but it has proven to be much, MUCH more useful. Kudos to Jens Trier Rasmussen and colleagues for this super-helpful tool!

Uncovering SIP trunk High Availability issues through Lync CQM

Share this: