This issue was discovered during an SBA upgrade from Lync Server 2010 to Lync Server 2013.  While the issue does have a few random musings out on the Internet that seemed to be loosely-related (although none were a direct resolution to my issue), it ultimately turned out to be that I was the problem.  I’ve since learned my lesson and am much wiser now…and hopefully you all will be too after reading through this post.

Without further delay…the issue (and thus error):

Lync-QoEGatewayPeerError-NotFound10013; reason="Gateway peer in inbound call is not found in topology document or does not depend on this Mediation Server"

Well that’s interesting…  It seems a Mediation Server is receiving a call from a gateway that isn’t defined in Topology.  Which gateway and which mediation server, though?  Looking through QoE reports I was able to determine it was coming from a PSTN gateway in a branch that I had just completed an SBA upgrade on.  I had to pull the syslogs from the Audiocodes Mediant 1000 gateway and when I did, I discovered something entirely unexpected:

Lync-GWError-SIP400BadRequestSIP/2.0 400 Bad Request

Inbound PSTN calls were failing with a SIP/2.0 Bad Request error that was being generated by Lync.  Even more interesting was that the gateway was actually sending calls to the Front End Mediation Pool and not the SBA Mediation Pool.  This was surprising for a number of reasons:

  • The SBA is configured as the first entry in the ProxySet within the Audiocodes gateway – all incoming calls from the PSTN should go there first.
  • The Front End Mediation pool is configured as the second entry in the ProxySet within the Audiocodes gateway – incoming calls from the PSTN will go there only if the SBA is marked as down.
Lync-GWConfig-LyncProxySet
  • The Front End is configured for resiliency with the SBA within topology, so calls should be accepted no problem in the event the SBA is down.
Lync-TopologyConfig-SBAResiliency

Digging In

First off I had to determine why the calls weren’t going to the SBA Mediation Server.  The Audiocodes gateway monitors the ProxySet peers through SIP OPTIONS requests and sends calls to available servers.   Upon examination it turned out that the Mediation Server service on the SBA had simply failed to start up and wasn’t running.  As a result, calls were being sent to the Front End pool per the ProxySet configuration, which is by design.  Why, however, is the Front End pool failing to process the calls?

Using the AlwaysOn and IncomingandOutgoingCall scenarios in CLS, I was able to examine the debug traces on the Front End servers and find the following piece of information:

Lync-MediationServerError-NextHopPeerNotFoundThe host portion of the from header, gwfqdn.domain.com, arriving at MS listening port (5068) did not match any next hop peers' FQDN or IP Address.

Huh?  What?  This gateway had previously worked just fine through the SBA and I re-confirmed this by turning on the Mediation Server service on the SBA and calls immediately began working.  Turning off the Mediation Server service on the SBA resulted in the calls failing to the Front End pool again.

Down Into the Rabbit Hole…

Into Topology Builder I go and begin examination the Branch Site configuration.

  • PSTN Gateway object correctly defined?  CHECK.
  • Trunk object correctly defined to SBA?  CHECK.
  • SBA Resiliency correct defined to FE Pool?  CHECK.

At this point my brain is spinning in circles trying to figure out what’s wrong and then it dawns on me…the error in the debug logs is telling me exactly what I need to know but I wasn’t understanding how it was saying it.  The error again:

The host portion of the from header, gwfqdn.domain.com, arriving at MS listening port (5068) did not match any next hop peers' FQDN or IP Address.

What this error is really saying is:

An incoming call from the PSTN gateway, gwfqdn.domain.com, arrived at my Mediation Server PSTN gateway port, but I can't accept the call because that gateway is not defined as a next-hop for me.

Mediation Servers cannot accept calls from PSTN Gateways that it has no association to.  How do you define that association in Topology, you ask?  Simple:  you define a TRUNK.

Looking at the Central Site configuration, my error became instantly apparent:  I did not have a Trunk defined between the Front End Mediation Pool and the Gateway in the branch.  😳

I quickly configured a new trunk and published Topology with the new information:

Lync-TopologyConfig-FETrunkToGW

Once CMS replication had completed to the SBA, I monitored the syslog traffic from the Audiocodes gateway and SUCCESS!  Calls were now completing:

Lync-GWSuccess-CallCompleted

I can assure you that no other changes were made, all that was done is the Topology configuration.  In addition, I tested this exact scenario both against a Lync Server 2010 infrastructure and a Lync Server 2013 infrastructure, and in both cases calls would fail unless a trunk was correctly defined in Topology.

My Concern and Confusion

Confusion #1

The biggest confusion I had was incorrectly assuming the resiliency configuration in Topology would handle this particular failure scenario – namely the failure of the Mediation Server on the SBA for incoming call scenarios.  In reality, it doesn’t.  I would strongly recommend that people test this scenario because I am very confident you will have the exact same failures I had.  As a result you will need to define additional trunks in Topology to properly handle this failure scenario in Lync Server 2013 (and above) environments.

Concern #1

There isn’t a single piece of documentation (that I can find) from Audiocodes or Sonus or Microsoft that talks about the requirement for adding an additional trunk to the central site mediation pool when deploying SBAs.  Audiocodes clearly defines adding the FEPool as the second entry in the ProxySet but that configuration in-and-of-itself does you absolutely zero good without the additional trunk configuration in the Lync Server topology.  Configuring the ProxySet without the additional trunk in Topology doesn’t get you automatic inbound call failover, it gets you failed calls.  Not good, not good at all.

Concern #2

SBA’s have been around since Lync Server 2010, but Lync Server 2010 is restricted when compared to Lync Server 2013, especially in regards to Mediation Server flexibility.  In the Lync Server 2010 world a Mediation Server can be 1:N, meaning a single mediation server (or pool) can have only a single trunk to a particular gateway, but that same server (or pool) can connect to multiple different gateways.

Note:  Yes, there are ways of “faking” this and creating multiple trunks to a single gateway using alias DNS records, but many don’t do that and the documentation for SBA installs don’t speak of this anyway.

Where it becomes a problem with Lync Server 2010 is that since you are unable to define a second trunk to the Mediation Server it means that automatic inbound call failover from SBA to FEMediation won’t work.  As a result, in the Lync Server 2010 failure scenario you are forced to update the PSTN Gateway association in Topology first, but only after you’ve detected the failure and have choosen to continue to route through the FEPool.  Not exactly automated…and a bit of a “chicken and egg” scenario.

Concern #3

I’ve talked with many colleagues on this and many were surprised at my findings and/or suspect there’s a “code issue”.  I tested this exact scenario both against Lync Server 2010 Mediation Servers and Lync Server 2013 Mediation Servers and got the same behavior with both.

Note:  I don’t think I’m crazy here, but I will acknowledge a mistake if one has been made.  If someone can prove me wrong, please let me know the details and I’ll update the post to be correct.

That being said, the behavior makes complete sense in that a Mediation Server won’t accept a call from a gateway for which it has no trunk configured in Topology.  You can take this same scenario and stretch it to a non-SBA deployment:  take any gateway in your Topology and begin sending calls to a Mediation Server for which there is no association defined (or Trunk, in Lync-2013 parlance) and your calls will begin to fail.

Bottom Line

If you have Lync Server 2013 (or above), make sure you define alternate trunks to the Front End Mediation Pool so that automatic inbound call failover can occur with your branch SBA deployments.  You don’t necessarily have to use those trunks for outbound calls if you don’t want to, but given that the SBA Mediation Server could fail it does provide you an alternate path to the gateway through backup PSTNUsage routing.  If you have Lync Server 2010 – which is beginning to be long in the tooth, anyway – you will have to manually update the PSTN Gateway association to reflect the Front End Mediation Pool (in a scenario where the SBA Mediation Server is down) so that calls are accepted and processed.

Again, what this boiled down to was a horribly ignorant understanding of inbound call processing, my incorrect assumptions of SBA/FE resiliency functionality, and an ignorance of configured Trunks in Topology.  It’s my own fault, but I’m appreciative of the opportunity to correct an oversight and set concepts straight within my brain!  As an additional note, make sure that you define alternate trunks for non-SBA deployment scenarios as well – say a FE disaster recovery scenario – so that you don’t run into the same issue in that scenario!

Note:  This environment is running the January 2016 CU for Lync Server 2013, so if some of you are thinking it’s a code issue…I’m not sure I agree!

6 thoughts on “‘Gateway peer in inbound call is not found in topology document’ with Lync Server 2013

  1. Your findings are absolutely correct. And I am happy that we weren’t the only ones with this lack of understanding. At first we thought that we were the only dumb ones not knowing this, but most people we talked about were just as clueless.
    And you’re also right that you can’t find much about this issue from Microsoft or the gateway vendors.
    Anyway, I can tell you that it is a lot of work to add these trunks. Especially if you have hundreds of gateways worldwide and if you have to follow change prcocesses and if you have to verify that it really works.

    There’s another thing we’ve overlooked which is related to that issue. Imagine you have two pools and you have pool failover configured. If Frontend Pool A goes down the users move over to Frontend Pool B – perfect. But if the Mediation Pool A goes down as well no PSTN calls are not possible until you add the necessary trunks to Pool B.

    1. Correct on all counts. It basically boils down to this: if the possibility exists that you would send calls from GWs to any other Mediation Server in your environment, you effectively need to have a mesh of trunks to handle an any< ->any scenario. It’s a boatload of work, as you pointed out, but it’s the only option for automatic failover scenarios.

    2. Hello,

      The other thing what you can do is
      I am not sure about audiocodes but this is what we do in Sonus.
      The Sites where having 2 Gw/Sba ..instead of making so many cross trunk between 2 sba/Gw..
      Just make one Sip trunk between 2 GWs
      so that if let suppose your first Gw’s mediation server goes down the external calls will start moving out to second GW via in between sip trunk…and this Gw will send calls to mediation server…
      The above is also valid even if your whole Sba goes down for 1 st gw…automatically users will get registered to front end server and they will receive call from 2 nd Gw mediation server

      And offcourse this is just for incoming calls..outgoing is also easy

      Regards
      Sepe

      1. I’m not sure I would use that approach unless I absolutely had to, but it is certainly an option within Audiocodes too. You could certainly choose to create a mesh, either within Skype trunks or your gateways/SBCs. The ultimate approach depends on the customer and the scenario, but I would be inclined to leave the complexity out of the GW/SBCs, if I had a choice.

        1. Hello,
          Yes i agree, but i think where the customer is having few users or sites
          We found this as best solution where we are having customers with almost more then 400 sites
          That case i think if we will take care everything on Lync that will make everything so slow
          Already when we open the lync control it takes almost few MINUTES before reflecting the things and same when we publish..
          Regards
          Sepe

          1. Sure, that is understandable. It would be easier to manage if we had the ability to create trunks through Powershell instead of having to go through Topology Builder, but since Topology Builder defines the topology…there isn’t much that can be done. I completely understand your pain!

Comments are closed.