Tuesday, February 7, 2017

Outgoing Email Fail

Recently stood up a SharePoint 2016 farm and configured outgoing email with data from a working SharePoint 2010 farm however during testing, alerts were not being received.

  • Verified the 'Immediate Alert' job was running every five minutes successfully.
  • Double checked the config information for outgoing email in Central Admin.
  • Verified the SMTP server was configured to allow emails from the Web and App server.
  • ULS was not helpful
  • Restarted SharePoint Timer service
  • Restarted IIS
  • Even restarted all servers in the farm
Before enabling verbose logging and slogging through all that data, I decided to broaden my testing scope and I created a simple workflow that sent me an email upon trigger.  The workflow completed with an error "The e-mail message cannot be sent. Make sure the outgoing e-mail settings for the server are configured correctly. For more information, please read this article: http://go.microsoft.com/fwlink/?LinkID=323543&clcid=0x409"

This information was not immediately helpful but it did confirm that the issue was within the SharePoint farm and not with Exchange or some other outside service.

Taking another look at Outgoing E-mail Settings, I see one setting that is available for SharePoint 2016 that was not available in SharePoint 2010; "Use TLS connection encryption" which is defaulted to 'Yes'.  I Set it to 'No', restarted IIS an the SharePoint Timer service.  Resetting IIS and Timer was probably unnecessary but I wanted to be sure.

I was now able to receive alerts and workflow emails.  Turns out our email server was not using a compatible version of TLS.  It would have been nice to find an error with these details.

So, if you are not receiving alerts and system emails but you are sure your environment is configured correctly, disable TLS and see what happens.  If you need encryption, work with your Exchange admin to ensure compatible versions are available.

Friday, January 13, 2017

Feed Cache Repopulation Timer Job(s) fail

For the past few days, I have been combatting an issue with failing timer jobs on two SharePoint Farms and I believe I have finally resolved the issue.

Specifically, the following timer jobs were failing with the below message.

Failed Timer Jobs
  • User Profile Service Application - Feed Cache Full Repopulation Job
  • User Profile Service Application - Feed Cache Repopulation Job
And the message was:
Unexpected exception in FeedCacheService.IsRepopulationNeeded: Failed to Decrypt data... (Correlation=XXX...)

Failed to decrypt data?  Do I have a permission issue?  ULS provided no extra leads so I ramped up logging to verbose but unfortunately, that also provided no new leads. Quite a bit of research on the Internet gave me a few suggestions such as restart IIS which worked for 20 to 40 minutes before the errors returned.  Unsurprisingly, error returned after rebooting the farm.

I also verified...
  • Permission for the UPS service account
  • Permission AppFabric service account (since this is a caching issue and AppFabric handles it)
  • UPS and AppFabric were using the same service account (Not required)
  • UPS and AppFabric listed the same cache host server
  • Version of AppFabric.  I am running 1.1 CU 7
  • Version of .NET Framework.  I am running 4.6.2
  • Firewall was configure properly
In my searching I found a nice set of monitoring scripts by Filip Bosmans, found here and while enlightening, did not give me an immediate answer.

During my search, I came across this page authored by Wictor WilĂ©n which gave me a lead!

"The three queues, or rather the three Timer Jobs are created per Web Application..."

This is when I noticed in Central Admin, Job Definition page that I had two web apps (mysites and portal) and the following jobs for both we are scheduled to run every minute.
  • My Site Instantiation Interactive Request Queue
  • My Site Instantiation Non-Interactive Request Queue
  • My Site Second Instantiation Interactive Request Queue
What is the connection you may ask?  These jobs create the Newsfeed and Microblog for new My Site profiles. SharePoint 2013 and 2016 cache Newsfeeds and Microblogs.

Another section of Wictor's page gave me my next lead.

"The Timer Jobs are based on the SPWorkItemJobDefinition Job Definition Type. This is a really nice timer job implementation that has a queue per content database"

This is significant as I have approximately 30 content databases and a new My Sites web app so only a dozen or so profiles.  For a large company, this could mean that every minute, these three jobs run against every web application and touches ever content database.  Frankly, that is insane.  Why do services that create My Sites content need to run against the Portal Web app?  Again, searching the Internet yielded no satisfying answer.  If you, the reader knows please feel free to comment below.  As a test, I disabled the three jobs that were running on the Portal Web Application.  The Feed Cache jobs took longer to fail but ultimately, they did fail.

With this information I start to suspect the cache was filling up but that did not make sense because even if the service was only running on MySites Web Application, a large company would have dozens and maybe hundreds of MySite content databases to fill the cache.  This is when Filip's monitoring scripts provided the potential solution.  On his "How to read the results page", one of the first issues discussed is Background Garbage Collection which is a feature provided in AppFabric 1.1 CU3 but it is not turned on by default.  You have to manually change a related config file for the feature to actually work.  So I did.

"After you apply this cumulative update, AppFabric uses a nonblocking garbage collection (background server garbage collection). Nonblocking garbage collection is a new feature in the .NET Framework 4.5.
To apply this fix, follow these steps:
1. Upgrade the servers to the .NET Framework 4.5.
2. Install the cumulative update package.
3. Enable the fix by using the following setting <appSettings><add key="backgroundGC" value="true"/></appSettings> in the DistributedCacheService.exe.config file between
       </configSections>
   <appSettings><add key="backgroundGC" value="true"/></appSettings>
   <dataCacheConfig>
4. Restart the AppFabric Caching service for the update to take effect.
Note By default, the DistributedCacheService.exe.config file is located under the following directory:
%ProgramFiles%\AppFabric 1.1 for Windows Server"

My working theory is the cache was filling up and the 'failed to decrypt' message was not related to permission issues but to space issues.  Unfortunately I forgot to leave one of the farms in the original configuration as a control.  *shakes head in disappointment*  Warrants further research but I wanted to at least get this information out in the event it may assist someone else.