23Jan/17

Examining the Call Quality Dashboard Template in SOF

On the week of January 9, 2017, Microsoft added some considerable new offerings within the Skype Operations Framework. While SOF is helpful in many aspects, its breadth and scope make it difficult to understand what to use and where to use it and how to use it. In more than one way, trying to understanding SOF is like the the old saying ‘trying to drink from a fire hose‘ – the content is all good but the volume of stuff seems to get in the way. Even so, Microsoft provided a home-run in the new content by giving customers a template to utilize for the Call Quality Dashboard within Skype for Business Online. If you are using Skype for Business Online today, you should go download this template and begin looking at your data, because the findings will be eye-opening and worthwhile.

What’s CQD, anyway?

For some out there, you may have no idea what CQD is.  Maybe you don’t use Skype4B.  Maybe you do but you haven’t delved into the inner-workings.  Either way, CQD can simply be described as an advanced way to analyze representations of media streams, media quality, and usage metrics. Before diving in to CQD though, you need a small history lesson…

Within on-premises deployments, you have two databases that comprise what’s known as the ‘Monitoring Databases’:

  • CDR.mdf – The CDR database contains call detail records – session information that contains who did something, what they did, and when they did it.  Examples include: SIP URI’s, modality type, timestamps, etc.
  • QoE.mdf – The QoE database contains quality metrics – specific network and performance information that contains where someone did something and how it performed. Examples include: IP addresses, modality type, packet loss, jitter, MOS, etc.

The big problem back in the Lync Server 2010/2013 era was that while the CDR/QoE information was great to have, the Monitoring Reports that MSFT provided to query the data weren’t overly robust. The pre-built reports offered value but they were not customizable (meaning they were static in the data they queried) and creating new reports required you have an intimate knowledge of SSRS, T-SQL, and an understanding of the CDR/QoE database schema. Most folks – myself included – don’t have that level of understanding so we simply used things as-is.

When Skype for Business Server 2015 landed, Microsoft offered a new solution called the ‘Call Quality Dashboard‘. There are several good things about the solution but my top three would be:

  • Reporting and analysis using the power and speed of Microsoft SQL Server Analysis Services – CQD utilizes Microsoft SQL Analysis Services to provide fast summary, filter, and pivoting capabilities to power the dashboard via an Analysis Cube. Reporting execution speed and the ability to drill down into the data can reduce analysis times dramatically.
  • New data schema optimized for call quality reporting – The Cube has a schema designed for voice quality reporting and investigations. Portal users can focus on the reporting tasks instead of figuring out how the QoE Metrics database schema maps to the views they need. The combination of the QoE Archive and the Cube provides an abstraction that reduces the complexity of reporting and analysis via CQD. The QoE Archive database schema also contains tables that can be populated with deployment-specific data to enhance the overall value of the data.
  • Built-in report designer and in-place report editing – The Portal component comes with several built-in reports modeled after the Call Quality Methodology. Portal users can modify the reports and create new reports via the Portal’s editing functionality.

It’s fast. It’s easier to use. It’s customizable. Win-win-win. Not so fast…

A significant remaining limitation was the lack of in-depth templates (and thus, guidance) for what you should be querying, but bigger than that was the complete lack of visibility to user accounts that may be hosted within Skype for Business Online. Customers were left completely in the dark and unable to examine quality issues for user accounts that were homed within Skype for Business Online. Microsoft heard the complaints though and eventually released the Call Quality Dashboard for Skype for Business Online, thus allowing customers the same data analysis that is available to CQD on-premises. Even though CQD (in both scenarios, on-premises and online) contain some pre-built reports, customers were still left scratching their heads about what other pieces of information they should be examining. What other metrics could shed light on issues they’ve been having? Enter the CQD SOF template (v1)

What’s Included?

The SOF CQD sample template is a multi-layered set of reports, with a primary mission at examining audio quality. While audio is the primary reporting factor, it does not mean you cannot duplicate reports to search for data for video or application sharing. If a customer wants to report on that data then they absolutely can, but start with audio analysis, resolve your issues, and you should see the remaining modalities start to fall in line.

The top-most report is a usage/trend report that aims at showing you the total number of streams and the percentage of those streams that classify the call as having been poor (or outright failures):

If you click ‘Edit’ to examine the query, you see the data that is being pulled:

A few things to know and understand when looking at queries in the query editor:

  • Dimensions – These are items that get put on the X-axis of the chart but those items are used as groupings to summarize the queried measurements into a better visualization of the data. Month/Year is very common but you could report on others as well, such as Network Subnet, or Building Name, etc..
  • Measurements – These are the pieces of data that make up the Y-axis of the chart. Your available query options here contain all the pieces of stream information imported from the QoE database.  Jitter, packet loss, round-trip-time, percentages, etc are all available at your disposal.
  • Filters – Filters can be used to isolate and return only specific sets of data from the larger CQD data sets. Filters impact what is returned for the ‘measurements’ and could be configured to be many things. Month/Year is common (to look at data from only a certain set of months) or you could configure a filter to look at only internal network segments, etc.

Each of the types above effectively correlate back to T-SQL, so if you can grasp your mind around T-SQL then you can work with the query editor in an easier manner:

  • Dimensions are like T-SQL GROUP BY statements
  • Measurements are like T-SQL SELECT statements
  • Filters are like T-SQL WHERE statements

To dig deeper into these reports, you simply ‘follow the rabbit hole’ by clicking on the hyperlink of the report name.

Note: If a report name is clickable, that means there are sub-reports available, otherwise the report name will not be clickable.

One-Level Deeper

The first sub-report contains a bevy of data and includes reports that offer even more sub-reports.  The second level top reports include:

Audio

This report (and its sub-reports) is where you will like spend most of your time. We’ll dive a bit further into this report as we keep peeling back the layers of the CQD onion.

Media Reliability – Call Setup Failures

This report (and its sub-reports) is another useful report where you will likely spend some time. If you want to determine why clients aren’t able to connect to media or for supplementary information about why calls are poor, then you’ll find that additional data here. We’ll dive a bit further into this report as we keep peeling back the onion layers.

User Reported Experiences – Rate My Call

This report (and its sub-reports) is, IMHO, useless. Most people I know don’t fill out those prompts asking them to rate a call. Maybe your users do but I don’t find this report all that useful. YMMV.

Client Versions

This report is a useful way to track client versions. Guidance from MSFT is to remain no more than 4 months behind the current version of the client software and this report will help you identify folks using out-of-date versions.

Devices – Microphone

This report is a useful way to track microphones used by clients. Want to find out people who are using internal microphones instead of certified devices? This report (and sub-reports) will tell you.

Devices – Speaker

This report is a useful way to track speakers used by clients. Want to find out people who are using internal speakers instead of qualified devices? This report (and sub-reports) will tell you.

Audio Sub-Reports

You will spend most of your time here, as these reports identify specific stream paths and metrics issues that help you identify the biggest problems in your network environment. The best reports here and most useful (IMHO) are as follows:

Client-to-Server Poor Audio Streams

Use this report to easily examine the number of streams considered poor that involve things like conferences or CloudPBX calling. Dig further info the sub-reports to begin identifying what buildings and/or subnets are the most prone to issue…

Client<->Server Poor Audio Streams by Building

This report will give you the exact location (assuming you have filled out and imported your subnet locations – which you absolutely should!) of the building and network name involving your poor streams. Dig further info the sub-reports to begin identifying what made the calls poor, including metrics and/or connection type…

Client<->Server Poor Audio Streams by Building, Subnet and Network Connection

This report will give you the reason of ‘why’ a call was classified as poor and where those calls are from. Instead of identifying the call as ‘poor’ the table shows you the calls that are poor by classification – packet loss, degradation,round trip, concealed ratio. Another report at this level includes additional information to allow you to potentially help identify last-hop routing issues…

Client<->Server Poor Audio Streams by Reflexive IP

This report will give you the reason of ‘why’ a call was classified as poor and where those calls are from, but also adds the reflexive IP address used in the stream. The Reflexive IP is the IP address as seen by Office365 (the NAT IP or STUN address) of the stream. Use this to help you determine if media streams are egressing from an unexpected network location or to identify if a particular network egress point is potentially saturated.

TCP Usage

This report (and sub-reports) will identify audio streams that use TCP for transport instead of UDP. These reports will effectively help you quantify and isolate firewall configurations that don’t allow the right protocol or right ports. Dive in further to determine network subnets that are the culprits…

TCP Breakdown by Building and Subnet

This report gives you subnets involved in calls using TCP as transport. If TCP is used for transport, there is a possibility that either ports or IP’s may be mis-configured in your network firewalls. Another possibility is that client streams may be egressing via an HTTP proxy…

HTTP Proxy Usage

If you have streams egressing via a proxy, be ready for some significant issues. Avoid proxies if at all possible and you can do so by ensuring traffic to Skype4B Online IP’s are bypassed. Unfortunately this report doesn’t show what network sites these calls are coming from, but one could easily build a sub-report to do so.

Peer-to-Peer Sub-Reports

Without re-posting a bunch of pictures, this report (and sub-reports) contain the same information as the Client-to-Server reports but it is filtered to provide you information for calls between endpoints (P2P) within your network. These calls should never go out to the Internet (or ExpressRoute) so you can help isolate and identify network segments that are problematic within your internal network.

Media Reliability Sub-Reports

Call Setup Failures by Building, Subnet and Reflexive IP

This report helps you identify subnets and/or external IPs that have firewall rules blocking traffic to the Skype4B Online IP ranges. It potentially also helps you identify firewall rules that may be configured for SSL/TLS inspection or DPI/IPS traffic manipulation. Use the ‘Call Setup Failure Reason’ column to help with that identification. Despite this being great, it doesn’t identify which IP addresses are in the communication failure…

Custom Report – Failures to Office365 Media Relays

You can create this custom report to identify exactly which Office365 IP addresses are in the failed communication path. Your firewall team claims that they have things right? Well…if so, this report will show nearly zero failures. If failures exist, then somewhere there is a firewall or router blocking communication to Office365 and you can show them this data to prove it.

Limitation to Note

While the data is great, you should note a few ‘gotchas’:

Limitation One

Since the client submits the QoE reports at the end of the call, the default report data may include information for elements outside of your corporate network. CDR/QoE reports include all parties for a call and/or conference, so it could include federated partners or anonymous guests. As a result, your reports and tables will include IP addresses that confuse and confound both yourself and your network team. You will almost certainly need to filter the queries to isolate your internal network using one of a few methods:

  1. Use the Second Tenant ID filter
  2. Use the Second Inside Corp filter
  3. Use the Inside Corp Pair filter

There seems to be multiple ways to try and filter the data and unfortunately I receive varying results when using each of the queries above. You’ll likely need to play around with the queries and export the data to CSV for some manual analysis, but at the end of the day you can begin to identify network segments using one (or many) of the methods above.

Limitation Two

CQD is all historical data. You cannot use CQD to pre-emptively identify quality issues nor is CQD useful if you haven’t imported your building data. Take the time before deployments to fill out this data.

What’s Next?

The template is great. The insights are valuable. It’s not perfect, however. I’ve already built some tables and reports with content that I’d like to see, especially around actual metrics reports for streams and not just what CQD uses as classification for ‘poor calls’. Microsoft will undoubtedly continue building this template and I definitely look forward to what’s next. Kudos to MSFT for a solid foundation on this!

14Dec/16

Handling SIP OPTIONS Requests on Audiocodes SBCs

12/20/2016 – Updated to include alternate IP-to-IP Routing configuration

SIP OPTIONS requests are a crucial piece of functionality for Lync/Skype4B deployments, but even so, OPTIONS requests are utilized within other Unified Communications platforms as well.  OPTIONS requests are most commonly used as a keepalive mechanism between SIP-based systems to determine if the remote end is ‘alive’.  For many of the IT Admins out there, you’ll recognize this as the difference between performing a TCP test on Port 80:

monitoringexample-tcptest-overallfailure

vs. ensuring an HTTP 200 OK is returned to an HTTP GET request to the same server:

monitoringexample-httptest-overallsuccess

The difference is critical:  just because a port is open or a TCP connect completes, doesn’t mean the application on the remote end utilizing that port is actually functional.  In the above scenario, our web server may be functional but perhaps the IIS Application Pool isn’t running and as a result, HTTP 500 errors are being generated thus taking an e-commerce website offline.  Not good!

For most environments you’ll see monitoring systems and hardware load balancers configured for this additional in-depth configuration, but OPTIONS requests are the equivalent in the SIP world of this more advanced monitoring capability.  Despite this common keepalive usage, OPTIONS requests as per RFC 3261 can actually be utilized to obtain much more:

The SIP method OPTIONS allows a UA to query another UA or a proxy server as to its capabilities. This allows a client to discover information about the supported methods, content types, extensions, codecs, etc. without "ringing" the other party. For example, before a client inserts a Require header field into an INVITE listing an option that it is not certain the destination UAS supports, the client can query the destination UAS with an OPTIONS to see if this option is returned in a Supported header field. All UAs MUST support the OPTIONS method.

Want to know what codecs are supported by the remote endpoint?  Check.

Note:  You typically don’t see codecs listed in most OPTIONS requests but the capability does exist within the RFC to handle it.

Want to know what content types are supported by the remote endpoint?  Check.

Bottom line:  OPTIONS requests are your baseline heartbeat for SIP user agents.  If a OPTIONS request isn’t responded to by the remote UAS, then the UA believes the remote UAS is ‘down’ and must attempt to re-route requests to another remote UAS.

Where Skype4B Fits In

Mediation Servers in Skype4B send OPTIONS requests to all Trunks defined in Topology, which means that each Mediation Pool is checking each PSTN Gateway for status.  If SIP OPTIONS requests are not processed by the PSTN Gateway then Skype4B thinks the trunk is down and won’t attempt to route calls to it.  After all, why attempt to route a call somewhere when it may be down? In turn, the SBC typically maintains status of the remote endpoints within its configuration, such as a upstream Cisco Call Manager cluster:

optionsarchitecture-skype4bandcucm

So long as OPTIONS requests are processed by all endpoints, calls should flow without issue.  Even so, a potentially outage inducing scenario exists if you aren’t prescriptive in your configuration…

The Audiocodes Specifics

When you look at most of the Audiocodes guides out there, you’ll notice that they don’t typically outline what is required to properly handle OPTIONS requests.  In most general documentation, you see a more ‘*‘ approach whereby all messages from Skype are simply forwarded to a remote system without much thought:

sbcconfig-wildcardroutingsample

What the rule above shows is that if any type of SIP request (Request Type=All) comes from the Skype4B Mediation Servers (Source IP Group), the message is to be routed to a remote SBC (Destination IP Group).  This is all fine and dandy – and will likely result in calls that function – but there’s a very dirty secret:  it results in OPTIONS messages being routed to the remote endpoint as well.  Instead of the SBC terminating the OPTIONS message from Skype4B, the OPTIONS message gets passed along to the remote endpoint (say CUCM, for example) and the Audiocodes SBC won’t report back status to Skype4B until it hears back from CUCM.  In effect, your OPTIONS status from Skype4B to the SBC becomes dependent upon the successful completion of the OPTIONS status being reported by another remote system upstream (CUCM, for example) as a result of the routing rule configuration.

Take my advice:  don’t use the approach above!

The correct way to handle this would be to ensure that the Audiocodes SBC is configured so that it locally handles OPTIONS messages.  Instead of passing OPTIONS requests along to a remote system for a downstream system, you want the SBC to handle each OPTIONS request locally, thus ensuring an independent view of status from the perspective of the SBC and for each connected remote system.

For each system the SBC is interacting with, you need to define an IP-to-IP routing rule that resides at the very top of your rule list:

sbcconfig-optionsconfig

There are a few critical differences in this rule:

  • Request Type = OPTIONS
  • Destination Type = Dest Address
  • Destination Address = internal

With this rule in place, the SBC locally handles the OPTIONS request from Skype4B and immediately reports back its own status back to Skype4B.  It does not pass the message along to an upstream system (CUCM, for example).  And yes, you need to have as many of these rules as you have SIP systems sending OPTIONS requests to the Audiocodes SBC.

12/20/2016 Update

Note: Another configuration is possible.  Instead of creating an IP-to-IP rule for each IP Group (potentially resulting in tens or hundreds of rules), you can essentially create a “catch all” rule that allows a single IP-to-IP rule to handle receiving OPTIONS requests.  Using the same base format as the rule above, you simply need to change the Source IP Group so that it is ‘Any’:

sbcconfig-optionsconfig-wildcard

This rule will function exactly the same and allow the SBC to locally handle OPTIONS requests.  The big difference is that this rule will allow any IP Group defined in the SBC to be responded to all while using a single rule instead of defining individual rules for each IP Group in the configuration.  Personally, I would define an IP-to-IP rule per IP Group but choose the configuration that suits you best.

As a result of this rule, the flow changes dramatically and the SBC processes the request locally:

sbcconfig-optionsconfig-inboundrequest
sbcconfig-optionsconfig-200ok

The syslog entries show additional detail of the SBC processing the OPTIONS request locally:

sbcconfig-optionsconfig-inboundrequest-sysloginfo
sbcconfig-optionsconfig-200ok-sysloginfo

Moral of the Story

When configuring Audiocodes SBC’s, make sure you have specific IP-to-IP routing rules defined using above as a basis for properly handling SIP OPTIONS messages.  There are a few Audiocodes documents out there that have these settings defined, but many of the Lync/Skype4B related documents seem to be absent this info.  It seems counter intuitive that you’d have to define special rules for OPTIONS requests, but given that the SBC is flexible enough to actually route OPTIONS messages at all, it does make some logical sense.  An OPTIONS message is still a SIP message, so just remember the extra steps required to properly configure an Audiocodes SBC to handle the messages.

As a final note, this type of logic doesn’t seem to exist in the Audiocodes gateway code.  In that scenario, the OPTIONS requests seem to just ‘automatically handled’ and additional configuration isn’t required.  IMHO, it seems that the SBC code is the truly first place this became a requirement.

31Oct/16

Musing About ‘Enterprise Control Issues’ with Office365 Networking Configuration

First off, all opinions and thoughts here are my own.  You, my dear reader, are not required to agree with me nor are you required to read the post.  Continue at your own peril.

Second, while I’m not a neurosurgeon, psychotherapist, psychologist, or sociologist, I can still use that mushy-grey-matter in between my ears to notice things and draw conclusions using deductive reasoning and critical thinking.

Third, I fully realize that I’m making some generalizations in my statements but I also realize that many of these generalizations have been proven time and time again by the customers/organizations I’ve interacted with over the past four years.

The Musing?

"A vast majority of Enterprises - but especially large(r) ones - act like petulant children when they begin their journey to 'the Cloud'."

This seems to manifest itself in multitudes of ways:

  • Features/Functionality may be different, resulting in user training challenges and thus resistance (or outright refusal) to adapt.
  • Internal Business Processes must be retooled to work around deficiencies or adapted to take advantage of enhancements, which sometimes never occurs.
  • Cost model structures typically need to change to account for simple service consumption costs instead of complex CapEx/OpEx models that were previously used.
  • Internal corporate fiefdoms battling each other to ‘maintain face’ or ‘maintain their ground’ or ‘maintain reason for this is how it’s done’, resulting in significant delays or stoppage altogether.
  • Etc…

As a result, Enterprises often act like children and throw temper tantrums, scream and cry, or go pout in a corner as a response to the changes seen as a result of their ‘Cloud journey’.

Acknowledging Reality

Despite my statements above, the reality is that all of these responses are natural to our human nature and our psychology.  We, as humans, all hate change.

https://hbr.org/2012/09/ten-reasons-people-resist-chang

http://www.huffingtonpost.com/morty-lefkoe/is-it-really-human-nature_b_906331.html

http://www.forbes.com/sites/lisaquast/2012/11/26/overcome-the-5-main-reasons-people-resist-change/#5beaf0553393

For all of the reasons I’ve listed in the previous section, there is a valid point to be expressed. Each one impacts the nature of the Enterprise and how it operates.  It impacts not only the business but also those employed by the business, which include people like you and me.  As a result, we all have skin in this game.

 I make no argument that our concerns ought not be fleshed out.  Concerns must be acknowledged, worked on, and resolved.  If any organization is to be successful in the journey to “the Cloud”, they must embrace Operational Change Management and solve problems that arise.

That being said, there are one or two groups within IT that often exhibit a far greater resistance to change. They often remain steadfastly gripped to their existing ways, resisting at all opportunities, mumbling ‘over my dead body’ with each change that comes.  Coincidentally enough, these are my old peeps.  My old team members from a previous life.  Folks responsible for defending the Enterprise castle from the Barbarians that seek to take it over.  Who, you ask?

Information Security and their counterparts, Enterprise Networking.

The Enterprise Castle

For years, InfoSec and EntNet were responsible for defending the castle:

  • 10.0.0.0/8
  • 172.16.0.0/12
  • 192.168.0.0/16

We used firewalls, VPN tunnels, IDP, IPS, HTTP proxies, and other defense-in-depth ‘stuff’ to keep the bad guys out. Our data and processes were based around keeping the Enterprise castle safe and making sure the crown jewels remained in the king and queen’s vault.

Where Office365 Breaks the Castle

With the advent of ‘the Cloud’ and offerings like Office365, the castle mentality fails at the outset.  The Cloud is Saas (Software-as-a-Service) and runs in data centers outside of your control.  You, the enterprise, interact with services that run over the public Internet and thus the ‘protect our internal stuff’ mentality fails because your ‘stuff’ isn’t internal anymore.  Despite that, the SaaS ‘castle’ mentality is similar to the Enterprise ‘castle’ mentality:  they still protect their internal stuff just as you do, but given that their castle contains jewels for multiple kings and queens, they operate in a new state that completely differs from how a single Enterprise operates.

When Office365 comes in the picture, InfoSec and EntNet usually steps in and issues their list of demands:

  • Communication must be restricted only to specific IPs
  • Ports must be restricted to only TCP/UDP ports required
  • HTTPS traffic must be inspected
  • We control how/where the traffic goes
  • Etc…

Now this is fine-and-dandy to demand but the reality is that this is a tall order to implement.  More importantly, some of these simply cannot be ascertained no matter how much you dislike it.  Examples?  Ok, no problem…

Office 365 URLs and IP address ranges

In the URL above, Microsoft provides customers with a centralized list of all the IPs, FQDNs, and TCP/UDP ports required to interface with their Office365 services.  They even break it down by service…how nice of them!  When this list is presented to InfoSec/EntNet, the push back is immediate and fierce:

  • There are hundreds of IP ranges listed!  Not allowed.  We need something more specific.
  • Our firewalls cannot handle DNS-based rules.  See demand #1.
  • The ports are too many (especially Skype4B)!  Not allowed.  We need to restrict them.
  • The FQDNs are too many and go across too many domains.  Not allowed.  We need to restrict them.
  • Etc.

While I ‘understand’ the asks above, the stark reality is that you won’t get your demands or there are significant issues with your asks:

  • While MSFT breaks down the IP ranges by service, due to the dynamic nature of HA/DR within Office365 for each service, your data could be accessible via any of the IP ranges listed.  If you restrict IP’s and a fail over occurs within Office365 that results in a different IP block now responsible for communication, and that IP block is not in your ‘allowed list’, then the fault is yours not Microsoft’s.
  • While MSFT lists the TCP/UDP port ranges by service, you risk an outage if you alter your config to not allow the listed ports.  There is a functional reason for those port ranges being required and you risk causing a service disruption for your end-users if you deny the ports.  Don’t shoot yourself in the foot.
  • MSFT lists FQDNs for each of the services because it is easier to administer by DNS than by ranges of IP blocks.  MSFT adds and removes IP blocks and FQDNs from Office365 as required, so DNS-based resolution automatically keeps up with those changes if you can implement it.  Otherwise, you – the Enterprise – must keep track of changes that occur to the service and respond accordingly.

Why OCM and the ‘petulant’ mentality matters

At the end of the day, Microsoft is providing you a service, guaranteeing that it will work with the published configuration.  Despite that published information, nothing in IT is stagnant, and Office365 remains true to that statement.  Almost every month Microsoft publishes updates to the Office365 FQDN/IP/Port page to an RSS feed that every single Office365 customer should follow because that RSS feed includes changes that are not yet active and published on the FQDN/IP/Port lists:

https://support.office.com/en-us/o365ip/rss

What typically happens is that Enterprises often take the original lists, plug them into firewalls, IDP, IPS, HTTPS proxies, etc, and then move on to other tasks.  Until something breaks, that is, and then Microsoft support is involved and determines that the issue is because the Enterprise didn’t keep up with the changes in the Office365 service or that security was too restrictive:

  • Maybe firewall rules weren’t updated to account for new IP blocks.
  • Maybe HTTP proxies weren’t updated with new FQDNs.
  • Maybe firewall rules weren’t updated to account for new TCP/UDP ports.
  • Maybe client communication paths are using CDNs or other non-Microsoft controlled endpoints:
WARNING: IP addresses filtering alone isn’t a complete solution due to dependencies on internet based services such as Domain Name Services, Content Delivery Networks (CDNs), Certificate Revocation Lists, and other third party or dynamic services. These dependencies include dependencies on other Microsoft services such as the Azure Content Delivery Network and will result in network traces or firewall logs indicating connections to IP addresses owned by third parties or Microsoft but not listed on this page. These unlisted IP addresses, whether from third party or Microsoft owned CDN and DNS services are dynamically assigned and can change at any time.

Whatever the reason, it often boils down to a stagnant mentality by the Enterprise that change doesn’t occur, or that they don’t ‘agree’ with the change, or maybe it was just an honest mistake.  For instance, 61 IP sets were added to the Skype for Business Online service this month, and those IP addresses become effective on 12/1/2016.  You, as the Enterprise customer, simply don’t have an option on whether those IPs are used for your ‘stuff’.  If you draw a line in the sand and say “NO!  We don’t WANT it that way!”, then expect issues and egg on your face.  The better option is to keep up with the changes and implement as required so that things continue to function.

Bottom Line

I “get it”.  I really do.  I understand the ‘old mentality’ and the ‘castle’ mindset.  That mindset will bite you though.

Enterprises and my fellow friends in InfoSec/EntNet must adapt to the changes and realities of a shared service like Office365.  Every decision made is a trade off between security, usability, and risk.  Microsoft isn’t perfect and neither are Enterprises.  They are, however, doing their part in alerting Enterprises that may have stricter needs in regards to security.  We all hate change, myself included, but change is a bona-fide fact of life and those who don’t adapt will suffer (or fail) in their journey.  Please, please, please make sure you start thinking about OCM and how you will adapt to the dynamic structure of not only Office365 but also other cloud services as well.  Your future truly does depend on it.

22Aug/16

Skype4B Online PSTN Conferencing Service Numbers

8/30/2016 – Additional information/clarification regarding outbound dialing & PSTN Consumption Billing

Microsoft is ever-expanding the availability of PSTN conferencing services within Skype4B Online, adding significant geographic footprint every 6 months, or so.  Despite this increase, there is still some rampant confusion about what is available and where it can be had.  Given how widespread the confusion is – especially amongst customers looking at the service – this post is attempting to clear up some components to allow a more clear picture of the service.

Where can you purchase it?

Your first step is to simply find out if PSTN Conferencing is available for your users.  This boils down to two separate requirements:

  1. Is PSTN Conferencing available where my Office365 tenant is located?
  2. Is PSTN Conferencing available where my end-users are physically located?

Microsoft does have an Office support article on this subject that outlines exactly what countries are available for PSTN Conferencing, but it doesn’t mean they explain it very well.  The list within the Office support article means that if your tenant or end-user is physically located within the list, the user is officially able to be licensed to utilize PSTN Conferencing features.  Microsoft also refers to this list as the ‘sell-to’ list.

I’m not in the list – can I still use it?

This is where things get tricky and very confusing.  In most cases it boils down to these scenarios:

  1. My Office365 tenant is in the ‘sell-to’ list but my end-user is located in a country that is not
  2. My Office365 tenant and my end-user is in the ‘sell-to’ list but callers who dial into my conferences are not

Scenario #1

For each user that you enable within Office365, there is a critical attribute that is required for each user object and that attribute is the ‘UsageLocation’ attribute:

Skype4BOnline-UserAttributes-UsageLocation

There are many blog posts out there that talk about how UsageLocation is utilized and the PSTN Conferencing feature follows suit with what those other posts talk about.  If your end-user exists in Pakistan, for example, you’ll notice that assigning the license isn’t available (or becomes removed) and it’s all because the UsageLocation attribute doesn’t match up with where the service is available.  I’ve seen some customers try to change the UsageLocation attribute, say in the aforementioned scenario, to utilize the US and after a short while the licensing then works.  The problem with this approach is two fold:

  • The Office365 terms of service don’t officially allow this
  • The UsageLocation attribute is used to help determine domestic VS international calling types when it comes to PSTN billing charges

Since the PSTN Conferencing Service allows dial-out as part of the functionality*, you could end up with tens-of-thousands of dollars of unexpected international call charges when you start altering the UsageLocation.  Take for instance the case where you change a Pakistani user’s location to the US and then they join a conference and have the conference dial back to their local number…  In this case, the call is international and not domestic, because their UsageLocation is ‘US’.  In nearly every scenario, this is not advisable and customers should not change usage locations just to get PSTN conferencing.

*Note:  Prior to September 2016, users could utilize international dial-out functionality within PSTN Conferencing with no additional charges as Microsoft covered call charges under an ‘all-you-can-eat style’ method of billing.  Starting in September 2016 you must have PSTN Consumption Billing enabled within your tenant to support outbound international dialing within PSTN Conferencing.  Domestic outbound dialing should continue to be ‘free’ and does not require PSTN Consumption Billing.

Scenario #2

In this scenario you’ve got true access to utilize the PSTN Conferencing functionality but the callers who join your meetings may not be calling from a location that MSFT has in its ‘sell-to’ list.  This is somewhat less of a problem because Microsoft does actually have local dial-in numbers available in locations that are not included in the ‘sell-to’ list.  Callers can simply dial a number within a country closest to them and reach the conference without much thinking. If the caller can’t find a local, domestic number though, long distance and/or international call charges may apply until Microsoft adds numbers to that geography in the future.

Types of Numbers Available

When PSTN Conferencing originally was released, only shared local toll numbers were available for each geographic region.  Toll-free numbers were not available until recently, with the introduction of PSTN Consumption Billing.  Now that toll-free and toll numbers are available, the PSTN Conferencing feature set is a bit more complete and on-par with other solutions in the market.

Within the types of numbers available, there are three different configurations for those numbers:

  1. Shared Phone Numbers
  2. Dedicated Phone Numbers
  3. Service Phone Numbers

Shared Phone Numbers

This is essentially the toll numbers that have always been available since the introduction of PSTN Conferencing.  These numbers are shared across the entirety of the Office365 infrastructure and any customer can utilize these numbers for inbound calling to their meetings.  Additionally, the language support for the IVR menu system cannot be changed for shared phone numbers.  Microsoft pre-populates these numbers within Office365 and all you must do is simply assign a number to a user account (matching the number’s geography with the location the user is in).  This list is by far the largest in terms of sheer scope of geography.

Dedicated Phone Numbers

This is new(-ish) and includes toll numbers that are specific to your organization/tenant.  These numbers are not shared with other Office365 customers.  To obtain dedicated phone numbers you have two options:

  1. Obtain a phone number directly from Microsoft
  2. Port an existing number from your on-premises PSTN provider to Microsoft

The important thing to note is that while these seem like great options, this option has now been deprecated from usage within PSTN conferencing.  Within this deprecation, Microsoft has begun to separate end-user phone numbers (dedicated phone numbers) and conferencing or auto-attendant phone numbers (service phone numbers) and as a result, dedicated phone numbers are no longer able to be utilized for PSTN Conferencing and must be used exclusively for end-user phone numbers.  The ‘Dedicated Phone Number’ functionality within PSTN Conferencing has shifted to Service Phone Numbers, even though Service Phone Numbers are still dedicated phone numbers.  It’s more of a logical distinction related to billing and capacity support.

Service Phone Numbers

This is new and is intended to take the place of Dedicated Phone Numbers for features such as PSTN Conferencing, Call Queues, and Auto-Attendants within Office365.  Service Phone Numbers allow customers to request dedicated numbers that are specific to their organization/tenant and use those numbers for the aforementioned functionality.  These numbers include toll and toll-free numbers within a subset of the countries that are supported by the Shared Phone Number functionality.  The current countries included in support are:

Skype4BOnline-ServiceNumbers-AvailableCountries1
Skype4BOnline-ServiceNumbers-AvailableCountries2

What is listed is 26 countries where dedicated Service Numbers can be obtained for organizations to utilize for PSTN Conferencing functionality.  Even better is that each Country/Region not only allows toll-free, but it also allows you to request numbers specific to a region within that country.  Where this becomes advantageous is in countries of large geography, say Australia, that may bill calls from Perth to Sydney differently than an intra-Sydney call.  By obtaining numbers that are as local to your users as possible, this will help organizations reduce calling costs as much as possible.  Service numbers will additionally allow you to specify the language utilized for IVR menus, unlike shared numbers.

Despite the Office Support articles saying that porting numbers is an option for Service Numbers, there is a significant limitation in that Microsoft will only allow number porting for countries where PSTN Calling Services are active:

Skype4BOnline-NumberPorting-AllowedCountries

At the current time, that means that only US or UK numbers can be ported.  All other numbers are unavailable to be ported until services are expanded to reach the other geographies.

Shared Number Availability

Given that shared numbers are, well, shared, it may help some customers and architects to see what numbers are available prior to purchasing the PSTN Conferencing service.  Why bother, you ask?  Just because a number is available in a country doesn’t mean that it is a local call for someone within that country.  A user in Melbourne or Perth calling a PSTN number in Sydney is generally billed at a different rate than someone in Sydney calling a PSTN number in Sydney (same goes for the United States intra-region calling or UK intra-region calling).  Microsoft doesn’t publish these numbers publically so the spreadsheet below may help architects in planning the cost structure of a rollout of PSTN Conferencing within Skype4B Online:

Service Number Availability

Given that Service Numbers are very, very new, I haven’t had a chance to put together a full spreadsheet of availability for the 26 countries and the regions supported.  Stay tuned as that will be forthcoming…

Wrapping Up

Hopefully this helps clear up some confusion around PSTN Conferencing.  The service and details are always changing – literally – so this will likely be outdated in a few months.  I will endeavor to keep this updated every 3-4 months or so to reflect the latest information available from Microsoft.

16Aug/16

Using Lync Server 2013 Persistent Chat Whilst Blocking IM Capabilities

In the midst of a 2010 to 2013 migration, a requirement was proposed that was, well, one of those ‘head scratcher’ asks:

"We are upgrading from Group Chat 2010 to Persistent Chat but we don't want Persistent Chat users to be able to IM each other.  IM must be disabled for the users who utilize Persistent Chat".

I’ll openly admit I struggled with understanding why one would require to do that, but it was a business requirement by the managers of the specific business units so we simply had to take it as-is and move on.  One of the big advancements with Lync 2013 was that the Persistent Chat client was built-in to the normal Lync client application and did not require the deployment of a separate application like Group Chat 2010.  Given the desire to upgrade to the newer client and newer back-end, a few options are available, all with caveats and issues to consider.  So without further a-do, the options:

Option 1 – Keep Using the Group Chat 2010 Client

Given that the Group Chat 2010 client is supported with Persistent Chat, it does provide you a method to allow the restriction of IM but allow the usage of P-Chat.  It’s not exactly an elegant solution, however, especially if you want to take advantage of the built-in functionality of P-Chat within the Lync 2013 (or Skype 2015/2016) client.  You’re now maintaining two different sets of applications and having to ensure they only get installed on systems that require it.  Not ideal and certainly not the easiest of solutions.

Option 2 – Deploy Client-Side Registry Keys to Disable IM

This one comes from way back in the OCS days where you could utilize registry keys to manage the client modalities and functionalities.  Even with the newest clients, there are still registry keys that take precedence over what a client receives through in-band policy configurations.

reg add HKLM\Software\Policies\Microsoft\Office\15.0\Lync /v DisableIM /t REG_DWORD /d 1 /f

reg add HKLM\Software\Policies\Microsoft\Office\16.0\Lync /v DisableIM /t REG_DWORD /d 1 /f

reg add HKCU\Software\Policies\Microsoft\Office\15.0\Lync /v DisableIM /t REG_DWORD /d 1 /f

reg add HKCU\Software\Policies\Microsoft\Office\16.0\Lync /v DisableIM /t REG_DWORD /d 1 /f
  • Using Office 2013?  Make sure you’re using the ‘15.0’ key above.
  • Using Office 2016?  Make sure you’re using the ‘16.0’ key above.
  • Disabling IM for every user on a machine?  Make sure you use one of the ‘HKLM’ keys above.
  • Disabling IM for a specific user on a machine?  Make sure you use on of the ‘HKCU’ keys above.

You can add the registry key to your standard Windows Image, add it via Group Policy Preferences or add it to a batch file for usage by SCCM or login scripts.  However you accomplish it, it should look something like this when you’re done:

Skype2016-ClientRegistry-NoIM

Option 3 – Use Ethical Walling Software

This would be the equivalent of Hub Transport rules in the Exchange Server world, but in Lync Server these are applications that run MSPL and/or UCMA functionality to examine and intercept SIP traffic.  There are several third-party solutions that could be utilized for ethical walls:

  1. Ethical Wall for Lync (Microsoft)
  2. Vantage (Actiance)
  3. Ethical Wall (MultiEx)
  4. Ethical Wall (SkypeShield)

None of these are free, however, and given the desire to remain low-cost by this client, any third-party solutions were simply not an option.

The Result?

As a result of the deliberations, Option #2 above was chosen and implemented.  The registry key was pushed out to the enterprise.  Following the registry key push, restart the Skype4B (or Lync 2013) application and when you do, you’ll notice that IM capabilities are now not available…

Skype2016-ClientRegistry-NoIMResult

…while the Persistent Chat functionality is available…

Skype2016-PChatAccess-NoIM

I can chat away, all day long, within the confines of Persistent Chat but I have no ability to utilize normal Instant Messaging features within the client.  Problem solved, right?

The Limitations

Given that this is a client-side registry key, it only applies to systems the key is installed on.  Unfortunately this leaves a very large gap of places that someone could use IM:

  1. Systems that don’t have the registry key deployed
  2. Outlook Web App IM
  3. Mobility Clients
  4. Skype Basic Clients (or Lync Basic)
  5. Lync for Mac 2011 clients
  6. Skype for Business for Mac client

Generally speaking, Options 1 & 2 at the top of the post are valid ways to prevent users – in a well managed desktop environment – from utilizing IM.  That being said, there are still numerous ways to potentially circumvent these settings and be able to send IM’s.  Most circumventions could be managed by various policy configurations within Lync and/or Exchange, but in my opinion, you are far better off to look at utilizing ethical walling software to limit those interactions and to provide reporting on your users that may have breached those policy requirements.

20Jun/16

The Curious Case of Lync Server 2013 CLS “Running” but not Actually Working

This was one of those “ohhhh yeah…” moments – one of those issues that made complete sense once you were able to see root cause but one of those issues where you weren’t initially able to “see the forest through the trees” when troubleshooting began.  It eventually took a case with Premiere support, along with multiple rounds of different Premiere engineers, but we finally arrived at the root cause and a solution.  Without further delay, the issue:

Lync2013-CLSCmdlet-Blank
Success Code - 0, Successful on 1 agents

Some of you may look at the above screenshot and think, “uh…Trevor…so what?”, assume I’ve starting drinking too early in the morning and move on to someone else’s blog.  The true secret, however, was that CLS wasn’t actually working and the screenshot above shows that the cmdlet is actually missing information to confirm that CLS was functional.  What I should have seen was this:

Lync2013-CLSCmdlet-WorkingExample

Notice how the cmdlet actually returned Tracing Status to me and indicated the status, scenario and computer I’m interacting with.  That’s the piece that was missing in this customer’s environment.

So to recap, this is what I knew to be true:

  1. CLS simply didn’t function at all in any capacity, even though the Windows Service for the CLS process was running.  No amount of service restart or server reboot would help.  No cmdlet requests were being processed or executed to make CLS do what it is supposed to do.
  2. It took a significantly long time – on the order of 5+ minutes – for any of the CLS cmdlets to complete.  I have always known that CLS isn’t exactly the Bugatti Veyron Super Sport of the logging world, but it has never taken that long for a simple Show-CsClsLogging cmdlet to complete in any deployment I’ve been involved in.
  3. This was happening on multiple servers (4 at the time the ticket was opened with Microsoft) so it definitely seemed like something had occurred or changed in this customer’s environment that would be at play with this issue.

Given what I know about the environment and what I know about the functioning of Lync Server and CLS, I start to dig in on my own…

First Things First

Knowing the fact that Sophos Anti-Virus has caused multiple Lync Server related issues in this environment in the past, I immediately began focusing my attention there.  I fired up Process Monitor to begin looking at process traces whilst I executed a CLS cmdlet via PowerShell, and in doing so I saw this:

Lync2013-CLSagent-SophosDetour
Sophos_Detoured_x64.dll

The CLSAGENT.EXE executable is having its operation forced through a Sophos DLL, “Sophos_Detoured_x64.dll”.  I actually encountered this issue before and had previously penned a blog post involving security hardening to SQL Server, but I had not yet seen this cause an issue with CLS.  Knowing that this DLL detouring wasn’t supported my Microsoft nor a performance help in general, I went into regedit and $NULL’d out the following registry key entries per the instruction at Sophos:

HKLM\Software\Microsoft\Windows NT\CurrentVersion\Windows\AppInit_Dlls

HKLM\Software\Wow6432Node\Microsoft\Windows NT\CurrentVersion\Windows\AppInit_Dlls

Note:  If the AppInit_Dlls value contains any text – in my case it contained the NTFS file path to the Sophos_Detoured_x64.dll – then DLL detouring is being utilized by your Anti-Virus vendor.

I rebooted the server after making the registry change and tried running the CLS cmdlets again after the reboot, but I didn’t get any different results.  Things still appeared broken.

One Step Closer

I went back into Process Monitor and began looking at traces while I again executed a CLS cmdlet in PowerShell.  The process traces looked very different this time (with Sophos not in the picture) and I could see that the CLSAgent executable was actually trying to do something:

Lync2013-CLSagent-NoSophosDetour

The Process Monitor traces showed that the CLSAGENT.EXE process was stuck in a perpetual loop of “Thread Create” and then immediately after, a “Thread Exit”.  When comparing the log above to a server where CLS was functional, there is a significant difference:

Lync2013-CLSagent-WorkingExample

In the working example directly above, after one of the first “Thread Create” operations, you see the CLSAGENT.EXE process begin writing information to multiple .cache files in the “C:\Windows\ServiceProfiles\NetworkService\AppData\Local\Temp\Tracing” NTFS location.  From that point on in the trace, CLSAGENT.EXE seems perfectly happy.  On the non-working server, the traces never indicated getting to the point where logs were being written.

Final Examinations & Troubleshooting

Thinking logically, it seemed as though the CLSAGENT.EXE process was still potentially being interfered with, so I went through the list of items that made sense to check:

  • Is Sophos configured to exclude all Lync Server related NTFS directories and application executables from real-time scanning?
    • Confirmed that yes, exclusions are in place.
  • Can Sophos be turned off to fully confirm that it is not in any way playing a part?
    • Confirmed that yes, even with Sophos Anti-Virus turned off the end result did not change.
  • Are any firewall ports being blocked that would prevent CLS from functioning?
    • Confirmed both through Wireshark and Process Monitor that TCP 50,001-50,003 were open and that network flows were present.
  • Given that CLS runs under the NetworkService account, are any NTFS restrictions in place that would prevent the account from writing to the desired NTFS locations?
    • Confirmed that while some GPO configuration was present, there were no GPO settings that would prevent the NetworkService account from accessing or writing to the desired NTFS directories.

It was at this point that we got Premiere Support from Microsoft involved.  It took a number of weeks and a number of engineers, but we finally had an answer presented to us this past week.

The Root Cause

Before just “showing my cards”, a little back ground info to help set the stage…

Dynamic Port Background

Starting in Windows Server 2008, Microsoft made a change to the way dynamic UDP/TCP ports are used within the operating system to bring it in-line with IANA standards.  Prior to Windows Server 2008 the dynamic port range was 1024-65535, but in Windows Server 2008 and newer the dynamic port range changed to 49152-65535:

https://support.microsoft.com/en-us/kb/929851

What this means is that any process that may need to request a TCP port for networking communications (think about applications that may use RPC) will use, by default, an open port in the 49152-65535 range for communications. Additionally, you can customize those port ranges to help allay your InfoSec team so that potentially a smaller port range may be used – say ports 50000-55000.

Specifying Specific Ports for NetLogon

In addition to the dynamic ports of the OS, system administrators can actually set a few registry keys that specify the Windows OS to use specific ports for certain communications:

https://support.microsoft.com/en-us/kb/224196

Registry key 1

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

Registry value: TCP/IP Port
Value type: REG_DWORD
Value data: (available port)

You need to restart the computer for the new setting to become effective.

Registry key 2

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters

Registry value: DCTcpipPort
Value type: REG_DWORD
Value data: (available port)

You need to restart the Netlogon service for the new setting to become effective.

Despite the KB article above talking about Active Directory, the important piece to remember is that LSASS.EXE is utilized even by member servers in the domain.  Additionally, LSASS.EXE is the parent process that spawns the NetLogon process so it consumes any settings that are configured on the base OS build:

https://blogs.technet.microsoft.com/askpfeplat/2015/01/11/rpc-endpoint-mapper-returns-dynamic-port-incorrectly-when-active-directory-is-configured-to-use-static-port/

How it all fits together

In this particular environment, the servers had been customized in regards to the dynamic port range configuration and in regards to static ports for the NetLogon service.  After multiple rounds of logs and investigation, the final engineer eventually focused in on what each ProcessID’s active network ports were on the system by using the ‘netstat’ command in conjunction with the ‘findstr’ command:

netstat -ano | findstr 5000

Lync2013-CLS-NetstatMissingCLS

The engineer then took the ProcessID from the results and looked in Task Manager to find which service or executable was tied to that ProcessID.  What the engineer eventually discovered was that there was another Windows process bound to the CLS TCP ports – lsass.exe.

Lync2013-CLS-LSASSConflict

LSASS.EXE is the Windows Local Security Authority Subsystem which is responsible for all security processing on a server including authentication, security policy processing, audit policy processing, and token processing.  But why would LSASS.EXE be listening on the port CLS wants to use?  The answer to that question is two-fold:

  1. Since LSASS.EXE relies on the dynamic port range configuration of the Windows OS, it simply looks for a random port available upon boot up.
    • Note:  Given that the servers also had a static port configuration set for the NetLogon service that overlapped the CLS ports, it meant that even reboots would not have solved the issue because the same port would have been used after every single reboot.
  2. Since LSASS.EXE starts much earlier in the boot process than CLSAGENT.EXE, it has free reign to bind to the TCP ports that CLS needs because the CLS service isn’t running yet.

As a result of the port range configuration and the boot process order, CLS was effectively being starved out of a port needed to function.

The Fix?

In short, the fix was very simple:  change the dynamic port range and move the static NetLogon port configuration.

Dynamic Ports

This is a pretty easy change to implement using netsh commands:

Netsh int ipv4 set dynamicport tcp start=24419 num=16383
Netsh int ipv4 set dynamicport udp start=24419 num=16383

Netsh int ipv6 set dynamicport tcp start=24419 num=16383
Netsh int ipv6 set dynamicport udp start=24419 num=16383

NetLogon Static Ports

This is also a pretty easy change to implement through regedit:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Parameters

Registry value: TCP/IP Port
Value type: REG_DWORD
Value data: 30000

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Netlogon\Parameters

Registry value: DCTcpipPort
Value type: REG_DWORD
Value data: 30001

The Results…

Following the change of the dynamic port range and the change of the NetLogon configuration, you need to reboot the servers in question.  Low-and-behold…after doing so we had functional CLS:

Lync2013-CLSCmdlet-ServerFixed

A quick examination of netstat also confirmed that CLS was bound to the TCP ports, as expected:

Lync2013-CLS-FixedPorts

Short Re-Cap

In the end it was part configuration error and part dumb luck.  This customer had overlapped their dynamic TCP port and static TCP port allocations with ports that CLS wanted to use.  It was simply the luck-of-the-draw that LSASS.EXE had grabbed one of the CLS ports and all because LSASS.EXE starts when Windows boots…long before CLSAGENT.EXE starts, which is set to “Delayed Start” by default.  When you combined everything together, you get the case of CLS “running” but not actually working.

Short Aside:

If you are using Microsoft’s recommendations for TCP/UDP ports for Lync/Skype4B QoS, any port above 40803 should be dedicated for the various Lync/Skype4B MCUs.  Don’t overlap your dynamic or static OS ports in the same range that your MCUs will operate in…instead move it below the 40803 range, as we did in this fix. You may have to ensure your firewall rules are updated to reflect the new SrcPort for communications (if you are using internal firewalling), but it’s a small price to pay to be able to actually use CLS!

13Jun/16

Multiple E-911 Number Support in Skype4B

Update 9/28/2016 – Including information for Lync Phone Edition Cumulative Update that adds support for multiple E-911 number functionality.  Fixes for client support table.

Update 7/15/2016 – Including information for official release via CU3 for Skype for Business Server, including information for official release via July 5, 2016 CU for Office 2016 Skype for Business client.

E-911 is a passion of mine, largely due to some past experiences in my life that have proven the importance of proper E-911 setup within unified communications environments. Despite having done the configuration multiple times within many Lync Server deployments, there has always been a critical limitation within Lync Server in regards to E-911. Most US-based deployments never really encountered this issue because we here in the US have a single emergency number to call for all emergency types, but when you step into a country like Switzerland, a breakdown occurs because Switzerland offers multiple emergency numbers for different types of emergencies. Lync Server 2010/2013 couldn’t support the multiple number configuration of these countries within the context of the platform’s E-911 logic and resulted in an incomplete solution (and potentially dangerous workarounds). This problem remained throughout the Lync Server 2013 lifecycle and even through RTM of Skype for Business Server 2015 until July of 2016. Microsoft officially added multiple emergency number support within Skype for Business, thus alleviating the issues that plagued the platform prior to now. This is a BIG step forward in making the platform more flexible but isn’t without limitations. It’s a more holistic turn-key solution, but is still incomplete and has some pieces left to fill.

Understanding the Problem

Within Lync Server 2010/2013, there simply did not exist a way to support the dialing of multiple emergency numbers for countries that support it.  For example, in Switzerland you could dial 117 for Police, 144 for Ambulance, and 118 for Fire, in addition to 112 for the EU-centric global emergency number.  In Lync Server 2010/2013 you absolutely could account for users dialing those numbers within a location policy, but the mapping was a 1:M assocation:

Skype4B-E911-SingleNumberSupport
An E9-1-1 dial mask is effectively a normalization rule, defining all the different digits that are recognized to indicate an emergency call.

An E9-1-1 dial number is the singular number that all dial masks map to and is the number that actually gets dialed by the Skype for Business client.

As you can see, there is a 1:M mapping which effectively means that while you can have multiple numbers defined in the mask, they all get translated into a single emergency number of 112. What this also means is that you cannot support calling the other emergency numbers specified within the dial mask due to the lack of multiple number support calling within the Location Policy logic.

Note: Some readers would correctly point out that you could actually build normalization rules into dial plans to allow users to dial each of those numbers, and you would be correct. That being said, defining those numbers within a dial plan allows external users and mobile users to make calls to those numbers. The whole purpose of E9-1-1 is to precisely map the caller’s location to a known civic address location via LIS.  LIS functionality isn’t natively available for mobility clients nor external users. I am a strong opponent of allowing normalization rules in dial plans for emergency calls and do not advise this approach.

So what’s New?

The new addition is that Microsoft has added the ability to have the location policy recognize unique and independent combinations of emergency dialing masks and emergency dial numbers, effectively giving an M:N option for emergency numbers..

https://technet.microsoft.com/en-us/library/mt723406.aspx

When planning for multiple emergency numbers, keep the following in mind:

•You can define up to five emergency numbers for a given location.

•For each emergency number, you can specify an emergency dial mask, which is unique to a given location policy.

A dial mask is a number that you want to translate into the value of the emergency dial number value when it is dialed. For example, assume you enter a value of 212 in this field and the emergency dial number field has a value of 911. When a user dials 212, the call will be made to 911. This allows for alternate emergency numbers to be dialed and still have the call reach emergency services (for example, if someone from a country or region with a different emergency number attempts to dial that country or region’s number rather than the number for the country or region they are currently in). You can define multiple emergency dial masks by separating the values with semicolons. For example, 212;414. The string limit for a dial mask is 100 characters. Each character must be a digit 0 through 9.

•Each location policy has a single public switched telephone network (PSTN) usage that is used to determine which voice route is used to route emergency calls from clients using this policy. The usage can have a unique route per emergency number.

•If a location policy has both the EmergencyNumbers and DialString parameters defined, and the client supports multiple emergency numbers, then the emergency number takes precedence. If the client does not support emergency numbers, then the emergency dial string is used.

•For information about which Skype for Business and Lync clients support receiving multiple emergency numbers, dial masks, and public switched telephone network (PSTN) usages, see Supported clients.

To configure this new feature, you must abandon the GUI and revert to PowerShell by utilizing the new New-CsEmergencyNumber cmdlet. This cmdlet allows you to create the individual mask->number mappings, within the limitations above, of course.

Step 1 – Research your Emergency Number Needs

The first task you should take is simply to define your number mappings.  Ask yourself these questions:

  • Given the locale of the office location, how many different number types do I need to support?
  • Given the locale of the office location, do I need to account for other regions’ emergency numbers being dialed by visiting personnel?

Step 2 – Plan your Emergency Number Mappings

Once you have identified the needs above, you can create a table that outlines the configuration you will put into place:

Location Policy Name Emergency Dial String Emergency Dial Mask PSTN Usage
CH-BE-Bern-Hochschulstrasse-FL1 117 117 CH-BE-Bern-Hochschulstrasse-Emergency
CH-BE-Bern-Hochschulstrasse-FL1 144 144 CH-BE-Bern-Hochschulstrasse-Emergency
CH-BE-Bern-Hochschulstrasse-FL1 118 118 CH-BE-Bern-Hochschulstrasse-Emergency
CH-BE-Bern-Hochschulstrasse-FL1 112 112;999;911 CH-BE-Bern-Hochschulstrasse-Emergency

You’ll notice in my table above, I’ve accounted for each of the individual emergency types first and mapped them directly to their unique dial string. This configuration ensures that each user can dial each emergency number and not have that number be changed in any way. The last line of the table is the “catch all”, allowing users to dial the EU-centric emergency number ‘112’, along with some other emergency numbers from other locales (such as the UK and the US) that will automatically map to ‘112’. This final configuration helps to ensure that emergency calls complete for users who may not know the specific emergency number for a given local (think visitors).

Step 3 – Configure the Number Mappings

PowerShell will be your friend here, as it is the only method by which you can configure this new functionality. Do not attempt to configure this within the Control Panel web portal, as you won’t find it anywhere!  Open up your Skype for Business Server Management Shell and add the configuration:

$a = New-CsEmergencyNumber -DialString 117 -DialMask 117
$b = New-CsEmergencyNumber -DialString 144 -DialMask 144
$c = New-CsEmergencyNumber -DialString 118 -DialMask 118
$d = New-CsEmergencyNumber -DialString 112 -DialMask 112;999;911
Set-CsLocationPolicy -Identity CH-BE-Bern-Hochschulstrasse-FL1 -EmergencyNumbers @{add=$a,$b,$c,$d}

Step 4 – Configure Legacy Client Number Mappings

This is done via the CSCP GUI, just as before. Configure this for the legacy clients that still need emergency services information via the legacy location policy logic. Remember that these clients will be limited to a single emergency number, so make sure to utilize the most global emergency number such as ‘112’:

Skype4B-E911-LegacySingleNumber

Step 5 – Configure Location Policy PSTN Usage

Take a look at your PSTN Usage assigned to your location policy. Remember that your PSTN Usage determines the available voice routes for the calls users make, so you need to ensure that the voice route assigned to the PSTN Usage allows all the different emergency dial numbers you have configured.

Skype4B-E911-LegacySingleNumber
Skype4B-E911-MultipleNumberSUpport-PSTNUsage

If needed, make changes to your PSTN Usage, otherwise simply make note of the voice route(s) you need to edit and move on to step 6.

Step 6 – Edit your Emergency Number Voice Route

This can be accomplished by the CSCP GUI. Go into your voice routing configuration and edit the appropriate emergency voice route (the ones tied to the PSTN Usage in Step 5) to now match all the emergency numbers you have configured. Simply use a logical “|” (or) in the matching rule:

Skype4B-E911-MultipleNumberSupport-VoicRoute

Step 7 – Commit Changes

Commit all your changes and test, test, TEST!

Limitations of this Approach

Nothing is 100% fool-proof and the new multiple emergency number support falls in line with that statement.

Clients

There are now three separate clients that support the multiple E-911 number functionality, a large increase from the original support of one when this feature was first released:

Mobility, Lync for Mac 2011, or legacy 2013/2015 clients seem to be excluded from support.  In addition to that list are unknowns about Polycom VVX phones (or other 3PIP phones from Audiocodes, for example) and the new Skype for Business 2016 client for Mac.  Note:  Given that the new Mac client works off the same functionality as the mobility clients (UCWA), I can almost guarantee support is not available at this time.  I’m sure Microsoft will continue to add to this list as time goes on, but be aware of this limitation and check back on TechNet to find out the latest client supportability.

Note: Sadly, this is simply another reason to make sure you are staying up-to-date with software releases, as the best stuff is only available in the latest versions.

Servers

With Microsoft now releasing information on this, you must have the CU3 update installed for Skype for Business Server 2015.  Microsoft may add this functionality back into Lync Server 2013 (it seems to be back-porting much functionality these days), but I wouldn’t hold my breath.

Note: Sadly, this is simply another reason to make sure you are staying up-to-date with software releases, as the best stuff is only available in the latest versions.

Wrapping Up

This addition is a BIG-WIN for non-US based Skype4B deployments and adds a sorely missing feature. While I’ve focused mainly on non-US for this post, there are distinct cases where additional emergency numbers could be utilized within the US, such as corporations, manufacturers or hospitals that would require the ability to have multiple emergency numbers.  Those organizations could allow unique emergency number combinations for any number of scenarios that may meet internal life-safety requirements.

23May/16

Dissecting a failed CMS Migration in Lync Server 2013 due to LRS Meeting Portal

Ah…CMS…the Centralized Management Store…  How important you are, yet so little the respect you are given.  I gaze upon your XML and bask in its glory…  The simplicity and elegance of your knowledge of all things Lync/Skype4B…  The heartache your unknown ways cause when things don’t always go as planned…

For those following along with my blog, you may have noticed that I had posted a CMS migration guide a while back.  The guide itself is solid and I’ve used it within many different migrations within Lync Server 2010, Lync Server 2013 and Skype4B Server 2015.  Despite the fact it exists, it doesn’t mean that it is correct 100% of the time.  I was reminded of this last week when I encountered a unique CMS migration scenario that forced me to dig way in and use some out-of-the-box thinking to work around a limitation of the Move-CsManagementStore cmdlet.  In the end it was all OK but my pain will hopefully be your gain.

The Environment

The environment consisted of two Enterprise Edition Front End Pools:  one Lync 2010 pool and one Lync 2013 pool.  Each pool consisted of five servers.  The CMS store was located on the 2010 pool and the goal was to migrate it to the 2013 pool, as shown in the picture below:

Lync-CMStopology-start

Following along with the process of my CMS migration post, I was ready to run Step 7 from a server in the receiving pool.  I tee’d up the cmdlet, hit ‘Enter’, watched with a smile as the cmdlet progressed as expected and then my heart sank:

This cmdlet moves Central Management Server to the pool that contains this computer.

 Current State:
 Central Management Server Pool: "pool01.domain.com"
 Central Management File Store: "\\FQDN\share"
 Central Management Store: "SQLFQDN1.domain.com\sqlinst1"
 Central Management Store SCP: "SQLFQDN1.domain.com\sqlinst1"

 Proposed State:
 Central Management Server Pool: "pool02.domain.com"
 Central Management File Store: "\\FQDN\share"
 Central Management Store: "SQLFQDN2.domain.com\sqlinst2"
 Central Management Store SCP: "SQLFQDN2.domain.com\sqlinst2"


Do you want to move the Central Management Server, Central Management Store, and File Store in the current topology and assign permissions for computers in Active Directory? (Note: Please read the help provided for this cmdlet using
the Get-Help cmdlet before you proceed.)
[Y] Yes [A] Yes to All [N] No [L] No to All [S] Suspend [?] Help
(default is "Y"):A

...

WARNING: Move-CsManagementServer failed.
Move-CsManagementServer : Failed to execute the following PowerShell command -. "D:\Program Files\Microsoft Lync Server 2013\Deployment\Bootstrapper.exe" .
At line:1 char:1

The Error

Looking at the error, my expectation was that the cmdlet failed to install the CMS replication components on the local server. When I examined the HTML log file it became apparent that the error didn’t even have anything to do with the CMS replication components:

Lync-CMSMigration-Error
Installing MeetingRoomPortalInstaller.msi(Feature_Web_MeetingRoomPortal_Ext, Feature_Web_MeetingRoomPortal_Int)...Log file was: %TEMP%\Bootstrap-CsMachine-[2016_05_19][23_45_58].html
End executing command "9".
Error: Failed to execute the following PowerShell command - . "D:\Program Files\Microsoft Lync Server 2013\Deployment\Bootstrapper.exe" . 

Verify that the account used to run this cmdlet has sufficient permissions to run bootstrapper located at "D:\Program Files\Microsoft Lync Server 2013\Deployment\Bootstrapper.exe". 

Note: The move operation failed after modifying the topology. This means that there are no active Central Management services to replicate configuration changes. To complete the move please run this cmdlet again after the issues encountered during this run are resolved. If the issues cannot be resolved then run the cmdlet on the original pool with -force option to rollback the move operation.

After picking my heart up off the floor, I began to focus on the error.  MeetingRoomPortalInstaller.msi???  Why on earth would the CMS cmdlet be looking for that MSI?  For whatever reason the cmdlet was expecting it and it borked leaving CMS in a horrible state:

 Current State:
 Central Management Server Pool: "pool02.domain.com"
 Central Management File Store: "\\FQDN\share"
 Central Management Store: "SQLFQDN2.domain.com\sqlinst2"
 Central Management Store SCP: "SQLFQDN1.domain.com\sqlinst1"

The CMS topology document indicated that the new pool was the CMS master, but the Active Directory SCP object indicated that the old pool was the master.  Effectively both sides are referring one another to the other pool.  Great.

Abort!  Abort!

With the customer on the call with me, tensions are beginning to rise.  The order was given to rollback CMS.  So I try running the move commands from a server in the Lync 2010 pool…and that fails, too, with the 2010 shell complaining about entries in the XML file.  Ugh.  We’re officially in a pickle now, and we’ve got no path but forward.  I go grab some coffee and settle in for a long night.

Some Background Info

Going back to the error, I start asking questions..

Me:  “The MeetingRoomPortalInstaller.msi is utilized for Lync Room System management…do you utilize this in your environment?”

Customer:  “Uh…we don’t know.  We have many LRS systems, but we have no knowledge of this portal you’re talking about.”

Me:  “That software is not listed in Add/Remove Programs on this machine…is there a reason it is not installed on this machine?”

Customer:  “That server was recently added to our pool.  We didn’t install any additional software during the provisioning.”

*chatter…grumbling…cross talk*

Me:  “I’ve examined the web.config file for the LRS portal on one of the other servers and it doesn’t appear that it is truly configured for proper operation.  Instead it just looks like someone installed the MSI file and left it as is.”

Customer:  *silence*  “Uh…ok…  So now what?”

I go ahead and install the LRS portal components on the server and try running the move cmdlet again, but it complains that the new pool already owns CMS and won’t complete.  Ugh!

Digging Deeper

Apparently the LRS portal installer adds entries to the CMS topology when it gets installed on a server within a Front End pool and as a result, bootstrapper expects the software to be installed.  Since it wasn’t installed, bootstrapper fails which in turn resulted in the CMS move cmdlet failing as well.  Unfortunately for me it resulted in CMS being in a half-baked state where the topology doc says it moved but Active Directory says it didn’t.  Back to the logs…

Importing the LIS configuration data into the new Central Management Store.
Executing ImportLisConfigurationTask.
Importing Location Information Services (LIS) configuration.
Begin executing command "5": "Import-CsLisConfiguration -FileName "C:\Users\username\AppData\Local\Temp\2\Move-CsManagementServer-CsLisConfiguration-New-3-316cfe43-bca1-488b-a285-87a06c05ad0b.zip"".

Executing MoveCmsInTopologyTask.
Exporting Central Management Store configuration.
Importing Central Management Store configuration.
Begin executing command "6": "Import-CsConfiguration -FileName "C:\Users\username\AppData\Local\Temp\2\Move-CsManagementServer-CsConfiguration-New-2-722d35fe-c75a-4a87-b3c2-efec483a1bce.zip"".
End executing command "6".

Reading through the logs it showed that the CMS data itself did get moved to the new store.  Additionally, the CMS replication components did get installed on the new server even though bootstrapper failed on the LRS components:

Installing MgmtServer.msi(Feature_MGMTServer, Feature_FTA, Feature_Master)...success

Remember that at the moment, CMS is not functional.  I can’t start the replication services on the new pool because the AD connection point resolves to the legacy pool and the services stop when they detect a database schema mis-match:

Lync-2013Pool-CMSServiceStopError

I also can’t start the replication services on the legacy pool because the topology document says the new pool is the CMS owner and the services stop upon this detection:

Lync-2010Pool-CMSServiceStopError

The environment has no functional CMS replication until this can be resolved, so the clock is ticking and people are continuing to get ‘twitchy’.  I come back to what I know:

  1. CMS data did get moved to the new database
  2. Active Directory did not get updated to point to the new SQL server

I’ve got one last ditch effort to see if I can get this up and running before we have to call Microsoft support.  At this point, if I update AD to resolve to the new SQL server there is a chance that I can get CMS components to be functional again.

A Fix?

There are cmdlets available that allow you to forcibly change the CMS configuration point and Active Directory SCP object, so I tee up the following commands and hit ‘Enter’:

Set-CsConfigurationStoreLocation -SQLSERVERFQDN “SQLFQDN2.domain.com” -SQLINSTANCENAME “sqlinst2” -Verbose
Set-CsManagementConnection -STOREPROVIDER sql -CONNECTION “SQLFQDN2.domain.com\sqlinst2” -Verbose

Note:  Be sure to make note of the original configuration of the cmdlet values, in case you need to revert the changes back to the original pool!

I wait about 15 minutes for Active Directory replication to fully complete and then attempt to start the replication services on the new pool server.  I watch event log and filter for ‘LS File Transfer Agent Service, LS Master Replicator Agent Service, LS Replica Replicator Agent Service’.  As I’m watching entries come in, the expected entries from my CMS migration blog post come by indicating the server detected the new values in AD and has assumed active CMS master.

LS Master Replicator Agent
Lync-2013Pool-CMSMasterReplicatorSuccessLS File Transfer Agent
Lync-2013Pool-CMSFileTransferSuccess

I continue to run Step 10 of my post and examine CMS replication…within 5 minutes CMS replicas all report up to date.   Success!  I log in to Topology Builder and pull down the topology, and all entries are there as expected with the new pool showing as CMS master.  I check the LIS database and all 911 information is there.  With a hail-Mary pass, I somehow averted a disaster.

Bottom Line

This little find was wholly unexpected.  Who would’ve thought that an error like this could have resulted in CMS migrations going haywire?  Had I run the cmdlet from a server that already had the LRS portal components on it, I likely would not have had any issues and it would have gone without incident.  At the end of the day this was caused by dependent software (LRS meeting portal) not being installed on a new server that was recently added to a front end pool.  I guess this is something I should start checking for, ensuring all software is present and accounted for, before attempting CMS migrations.  This is also a wake-up call for administrators and architects to make sure that your change management processes alert you to inconsistencies like this!

16May/16

‘Preliminary Primary FileShareName Parameter is Unusable’ with Lync Server 2013

I’ve been working through a Lync Server 2010 to Lync Server 2013 migration as of recently, and the error below popped up after I upgraded an existing Lync Server 2010 SBA to Lync Server 2013:

Skype4B-EventID32080
Event 32080, LS Storage Service

A queue flush operation has encountered a file error.

Preliminary primary fileShareName parameter: is unusable.  Exception: System.ArgumentNullException: Value cannot be null.
Parameter name: path
     at System.IO.DirectoryInfo..ctor(String path)
     at Microsfot.Rtc.Internal.Storage.Sql.LysDal.ValidateFileShareName(StoreContext ctx, string fileShareName, String timestamp)
Cause:  There may be permission issues to the file share, local file location, temporary directory, or disk is full.

This is the first time I’ve seen this error – ever – so I was a bit perplexed as to what was causing this.  The error involved the Lync Server Storage Service (LYSS) which is not exactly the best published piece of Lync Server (or Skype for Business Server) so finding root cause might be a needle in a haystack.  LYSS, for the curious folks out there, looks like this from a conceptual point of view:

Skype4B-LyncServerStorageService-Conceptual

Note:  For more in-depth information about LYSS, see Mattias Kressmark’s blog post on the topic.

Given that the LYSS Database on each registrar is a temporary storage ground for many Lync Server related activities, it didn’t initially make much sense that LYSS was involving a file share. The only file share that could come to my mind was the Front End pool file share which is defined in topology.  When opening up topology and looking at the file share configuration, some alarm bells started to sound:

LyncServer-FileShareConfig-Invalid
File server FQDN:  NETBIOS
File share:  SHARE\FOLDER

I’ve changed the true names above to protect the innocent, but the configuration above gives enough of a picture.  To my eyes there were two significant issues:

  1. The File server FQDN was not defined as an FQDN.  It was defined as NETBIOS.
  2. The File share was not defined as a share.  It was defined as a share\folder.

Nearly all Microsoft documentation is pretty clear on defining the file share as FQDN\Share.  Could it be possible that this is truly the root cause?

A Short Aside

Almost any Lync/Skype4B architect would agree with avoiding #1 above – everything in Topology must be defined as an FQDN.  Or, well, it should be and Topology Builder should help enforce that.  Lync Server 2013 Topology Builder does not validate the File server FQDN format.  Go ahead and try it yourself…you’ll see no validation errors when you configure things with NETBIOS name within Lync Server 2013 Topology Builder.  Try and do that within Skype for Business Topology Builder, however, and you’ll see a validation error saying the file share is not in FQDN format:

Skype4B-ToplogyBuilderFileShare-Invalid

Some Lync/Skype4B architects may not completely agree with avoiding #2 above – that you should only define a share name and not share\folder.  Perform a quick internet search and you’ll find plenty of examples of file shares configured this way with no documented issues that result from it.  Additionally, Microsoft makes no specific statement about this issue in a KB article dedicated to documenting errors for unsupported File Share configuration.  Historically speaking, I have never configured a file share as share\folder, so I couldn’t be 100% certain it was unsupported or caused issues.

Aside Over

The SBA seemed to be perfectly functional, with exception to the error at the start of this post, and the front end pool (Lync Server 2013) was also functional (and did not exhibit the error) so I needed to try and remove variables from the equation.  It seemed there to be two potential root causes:

  1. File Share Permissions Issues
  2. SBA code issues with current file share configuration

Checking #1 proved to be very simple and we were able to confirm permissions were correct:

Group Permission Note
RTCHSUniversalServices Change Standard Requirement for Lync Server
RTCComponentUniversalServices Change Standard Requirement for Lync Server
RTCUniversalServerAdmins Change Standard Requirement for Lync Server
RTCUniversalConfigReplication Change Standard Requirement for Lync Server

So if it’s truly not the file share permissions it became an issue of trying to determine if the file store configuration simply wouldn’t work with the SBA code.

Digging In…

First step in obtaining more information is through CLS logs.  I already had the AlwaysOn scenario running on the entire environment so gathering the needed data would be easy.  Additionally, the Event 32080 error was only generated once per day and it was almost at the exact same time every day so I could pull CLS information for a very short period of time and more easily find the information. After pulling the data and looking at the Trace logs I was able to find the exact moment when the error was generated:

LyncSBA-LyssDalValidateFileShareName-BrokenLyssDal.ValidateFileShareName:lyssdal.cs
Preliminary primary fileShareName parameter: is unusable.  Exception: System.ArgumentNullException: Value cannot be null.

Looking at the second entry above showed that the code began to also access a local NTFS location of ‘C:\ProgramData\Microsoft\Lync Server’.  Examining that folder path showed that, indeed, there was a folder structure with that seemed to be created by the LYSS functionality:

LyncSBA-LyssDalValidateFileShareName-LocalNTFSFolders

Even better, I had Process Monitor loaded at the time the error was recorded and I could validate through Process Monitor logs that the SBA a) never attempted to contact any UNC-based resources and b) did successfully access the local NTFS folder location above:

LyncSBA-LyssDalValidateFileShareName-ProcMon

Interesting…very interesting…

Based on the information I’d found thus far, it certainly seems like the SBA code doesn’t like something with the file share.  How about the Front End Servers though?  What do they have to say about this, if anything?

After pulling the data and looking at the Trace logs I was able to find the exact moment when the Front End kicked off this process:

LyncFE-LyssDalValidateFileShareName-WorkingLyssDal.ValidateFileShareName:lyssdal.cs
Primary file share location is [\\NETBIOS\SHARE\FOLDER\1-WebServices-29\StorageService]

The Front End has no issues, whatsoever, and validates the share name and begins a process to connect to ‘1-WebServices-29\StorageService\DataExport\Date’.  Examining that folder path showed that, indeed, there was a folder structure with that included a folder for each front end in the pool, created by the LYSS functionality:

LyncFE-LyssDalValidateFileShareName-ShareNTFSFolders

Getting Closer…

Given that the Front End servers seemed to be ‘happy’, it certainly seems like the SBA code might have a specific issue with the file share configuration.  The file share is defined in topology and that gets replicated by CMS to each server in topology.  CMS replication was showing UpToDate for all nodes and I even went so far as examining the local XML data within the FE and SBA local databases to make sure they truly were working off the same topology data.  Indeed, each server had the correct data from CMS:

File Server Configuration (both servers)
Lync-LocalCMSComparison-FileShareConfigurationSBA File Share Configuration (SBA)
LyncSBA-LocalCMS-FileShareConfigurationFront End File Share Configuration (Front End)
LyncFE-LocalCMS-FileShareConfiguration

The servers do have correct CMS information and the SBA still doesn’t like what’s configured.  Running out of options in my arsenal, it lead me to a nearly final conclusion:  define a new file store configuration in topology (a supported file store configuration) and migrate content for the Lync Server 2013 infrastructure in topology.  This process has multiple steps and is well documented within these URLs:

  1. http://social.technet.microsoft.com/wiki/contents/articles/15374.change-the-file-store-location-for-lync-server-2013-pool.aspx
  2. http://ucken.blogspot.com/2014/01/presentation-issues-after-moving-lync.html

The Results Were?

So we went through the process and made the changes.  Updated Topology to use a correctly formatted file store and updated all related configuration items.  Restarted services.  Following all the changes, I was able to validate…failure.  The change didn’t seem to have any impact and the errors persist today.  A big thanks goes out to Amanda Debler for confirming that she too sees these errors in her SBA event logs even though the file store in that environment is specified in the recommended manner.

Bottom line:  I have no idea why the error appears nor any idea of negative ramifications the error indicates.  Sadly I don’t have a fix for this one…yet.  Even so, this customer would have had to change their file store configuration anyway since the Skype4B Topology Builder wouldn’t allow them to use the current one due to the NETBIOS name configuration.  Not a complete loss, but not the home-run I was hoping for.  🙁

Note:  The SBA was running January 2016 Cumulative Update for Lync Server 2013.

Note:  If you want to manually invoke this LyssDal.Cs process, run the Invoke-CsStorageServiceFlush cmdlet specifying a flushtype of “FullFlush”.  This cmdlet will generate the error on command.

09May/16

Lync Server 2013 Front End Patch Installer Fails with Error 1603

Another day, another odd error.  Another trip into the deep, dark depths of Windows.  Another enlightening find that reminded me of the inter-dependency of Lync, Windows, and SQL Server.

The error:

Product: Microsoft Lync Server 2013, Front End Server - Update 'Lync Server 2013 (KB3120728)' could not be installed.  Error code 1603.  Additional information is available in the log file D:\Source\Microsoft\05-Lync Server 2013 - Jan 2016 CU\Server.msp-computername-[2016-05-06][15-19-10]_log.txt

So how did this error come about, you ask?

The Back Story

This error was part of a new Front End Pool installation.  At this point in the process I had completed the following tasks:

  • SQL Express instances had been pre-installed
  • Lync Server 2013 Core Components were installed
  • Lync Server 2013 deployment wizard steps 1 & 2 were run
    • Local Configuration Store
    • Local Components and Services

The error itself was appearing when I was attempting to run the LyncServerUpdateInstaller.exe patch for the January 2016 Cumulative Update.  Typically this is a slam-dunk process and goes without issue, but the Front End Server patch failed and rolled back.  Examining the log file in the error message was ultimately helpful, but given the amount of information in there, it was truly finding a needle in a haystack.  But the needle was found:

Lync-CUInstaller-FirstLogError
Product: Microsoft Lync Server 2013, Front End Server -- Error 29024.  Error 0x80004005 (Unspecified error) occurred while executing command 'D:\Program Files\Microsoft SQL Server\110\Tools\Binn\osql.exe'. For more details check log file 'C:\users\username\AppData\Local\Temp\LCSSetup_Commands.log'.

A log file within a log file…interesting…  Alright, I’ll follow the bread crumbs:

Lync-CUInstaller-SecondLogError
Msg 5011, Level 13, State 9, Server computername\RTCLOCAL, Line 5
User does not have permission to alter database 'rtc', the database does not exist, or the database is not in a state that allow access checks.
Msg 5069, Level 16, State 1, Server computername\RTCLOCAL, Line 5
ALTER DATBASE statement failed.

The KB installer is calling an executable, osql.exe, and using a T-SQL script to initiate changes.  I had to look up each of the osql.exe command line switches, but the one that is most important to notice is the “-E”:

Uses a trusted connection instead of requesting a password

Effectively what that command means is “use Windows Integrated Authentication”, which thereby means that my user account should be used.  My user account has all the rights in the world (including sysadmin in SQL), so why is this failing?  I tried many, many things – even going so far as blowing away Lync (bootstrapper /scorch) and SQL databases – but none of them made any difference.  The CU installer would always fail with the same error every time.  Nothing seemed to make a difference.

The Plot Thickens

Given my failure and frustration, I fired up SQL Tracing:

Lync-SQLTracing-Failure

I was very, very surprised to see the “NT AUTHORITY\SYSTEM” account being used for the LoginName.  My user account is launching the application executable – why aren’t those credentials being used!?  Looking at Management Studio for the RTCLOCAL insteance, the “NT AUTHORITY\SYSTEM” account does have a login, but it is not granted any elevated permissions or rights:

Lync-SQLSYSTEM-ServerRoleRights

No sysadmin role means that it cannot alter databases within the instance.  That’s sort of an explanation, but why is this the first time I’m seeing this problem!?

Digging Further

The ultimate epiphany came when I began to look at how the LyncServerUpdateInstaller.exe worked.  The executable extracts .MSP files that contain each of the individual Lync Server application patches.  The .MSP file contains all the logic and T-SQL scripts that are being executed for this particular Front End Server patch.  The big difference is found in how the .EXE and .MSP differ:

  • My user account launches the LyncServerUpdateInstaller.exe executable
  • My user account is used to initially launch the .MSP files, but the Local System account actually runs the .MSP files.

Microsoft patch files get executed by the Local System account, so that explains why the -E switch to the osql.exe command was passing the “NT AUTHORITY\SYSTEM” credentials.  The osql.exe executable was being called by the .MSP file and that .MSP file was run with the SYSTEM account.  OK, fair enough, but why aren’t permissions correct on my SQL configuration, especially considering I’ve done this hundreds of times before without any previous issue?!

I looked at a few other server installs within this environment and within the TechNet virtual labs and there was one SQL Server login that was missing from this server:

Lync-SQLSYSTEM-WorkingServerRoleRights
BUILTIN\ADMINISTRATORS

This group was granted sysadmin rights, which meant that any local admin of the server had sysadmin rights within SQL.  Nearly any SQL administrator will advocate for not having the local server Administrator group as a login and generally I would agree that is a best practice.  Given all this information, however, it still didn’t explain why the patching process is failing so further research was required…

Continuing On

The ultimate “A-HA!” moment came when I ran across these articles whilst searching for the relationship between the Local System account and the built-in Administrators group:

https://msdn.microsoft.com/en-us/library/windows/desktop/ms684190(v=vs.85).aspx

The LocalSystem account is a predefined local account used by the service control manager. This account is not recognized by the security subsystem, so you cannot specify its name in a call to the LookupAccountName function. It has extensive privileges on the local computer, and acts as the computer on the network. Its token includes the NT AUTHORITY\SYSTEM and BUILTIN\Administrators SIDs; these accounts have access to most system objects.
https://technet.microsoft.com/en-us/library/cc778824(v=ws.10).aspx

System is a hidden member of Administrators. That is, any process running as System has the SID for the built-in Administrators group in its access token.

Go ahead and read those again.  See if the “A-HA!” moment comes to you, too…  OK…I’ll help you…

Effectively what the articles are saying is that the LOCAL SYSTEM SID is, by default, a bona-fide member of the BUILTIN\Administrators group because its token includes the Administrators group SID.  Taking it one step further:  What server role does that group have on the working servers SQL instances?…that’s right….sysadmin.  Since the .MSP file is attempting to access the SQL instance using the built-in SID for the LOCAL SYSTEM account, it has no access to actually update the databases because the Administrators group was not in the instance.

Note:  recall that the SYSTEM account did have a login within SQL, but the available server role rights were set to ‘public’ which means it basically had no rights to do much of anything.

A Fix?

I manually added in the “BUILTIN\Administrators” group to the offending Front End Servers local SQL instances and granted that group sysadmin rights.  I re-ran the LyncServerUpdateInstaller.exe updater again and…SUCCESS!!!

Lync-CUInstaller-Success
Done: Installing KB3120728 for Server.msp 

Had I not found those two articles, I may have never known the true reasons for the behavior I was seeing.  This was the fix though, making sure the “BUILTIN\Administrators” simply matched the configuration of the other servers, which also matched TechNet virtual labs configuration as well.

Wrap Up

Coming full circle: What had actually occurred was that the SQL team had removed the BUILTIN\Administrators group for security reasons, after I had initially pre-installed SQL (which at the time of my installation was included) and that removal was unbeknownst to me.  All of Microsoft’s standard Lync and Skype installers include that group for the SQL instances (and grant it sysadmin), so it truly is critical that the login exists for the purposes of patching.  As I saw, installation of the product occurred just fine but patches would begin to fail outright because the patching process uses the SYSTEM account and not a specific user account.

Note:  As an alternative workaround, you could grant the “NT AUTHORITY\SYSTEM” account in SQL sysadmin rights for the purposes of patching processes, but I doubt many people would want to undergo that additional management complexity.

Bottom line:  if you choose to change the sysadmin rights on your Lync Front Ends and remove the Administrators group, be aware of this issue and plan for workarounds accordingly!

Note:  This issue is another good case study that belongs in my other post, ‘The Dangers of SQL Server Security Hardening for Lync Server & Skype4B Server’, but I separated it into a distinct post for the sake of clarity and so that it would be more easily discoverable via search engines.