Friday, January 13, 2017

Feed Cache Repopulation Timer Job(s) fail

For the past few days, I have been combatting an issue with failing timer jobs on two SharePoint Farms and I believe I have finally resolved the issue.

Specifically, the following timer jobs were failing with the below message.

Failed Timer Jobs
  • User Profile Service Application - Feed Cache Full Repopulation Job
  • User Profile Service Application - Feed Cache Repopulation Job
And the message was:
Unexpected exception in FeedCacheService.IsRepopulationNeeded: Failed to Decrypt data... (Correlation=XXX...)

Failed to decrypt data?  Do I have a permission issue?  ULS provided no extra leads so I ramped up logging to verbose but unfortunately, that also provided no new leads. Quite a bit of research on the Internet gave me a few suggestions such as restart IIS which worked for 20 to 40 minutes before the errors returned.  Unsurprisingly, error returned after rebooting the farm.

I also verified...
  • Permission for the UPS service account
  • Permission AppFabric service account (since this is a caching issue and AppFabric handles it)
  • UPS and AppFabric were using the same service account (Not required)
  • UPS and AppFabric listed the same cache host server
  • Version of AppFabric.  I am running 1.1 CU 7
  • Version of .NET Framework.  I am running 4.6.2
  • Firewall was configure properly
In my searching I found a nice set of monitoring scripts by Filip Bosmans, found here and while enlightening, did not give me an immediate answer.

During my search, I came across this page authored by Wictor WilĂ©n which gave me a lead!

"The three queues, or rather the three Timer Jobs are created per Web Application..."

This is when I noticed in Central Admin, Job Definition page that I had two web apps (mysites and portal) and the following jobs for both we are scheduled to run every minute.
  • My Site Instantiation Interactive Request Queue
  • My Site Instantiation Non-Interactive Request Queue
  • My Site Second Instantiation Interactive Request Queue
What is the connection you may ask?  These jobs create the Newsfeed and Microblog for new My Site profiles. SharePoint 2013 and 2016 cache Newsfeeds and Microblogs.

Another section of Wictor's page gave me my next lead.

"The Timer Jobs are based on the SPWorkItemJobDefinition Job Definition Type. This is a really nice timer job implementation that has a queue per content database"

This is significant as I have approximately 30 content databases and a new My Sites web app so only a dozen or so profiles.  For a large company, this could mean that every minute, these three jobs run against every web application and touches ever content database.  Frankly, that is insane.  Why do services that create My Sites content need to run against the Portal Web app?  Again, searching the Internet yielded no satisfying answer.  If you, the reader knows please feel free to comment below.  As a test, I disabled the three jobs that were running on the Portal Web Application.  The Feed Cache jobs took longer to fail but ultimately, they did fail.

With this information I start to suspect the cache was filling up but that did not make sense because even if the service was only running on MySites Web Application, a large company would have dozens and maybe hundreds of MySite content databases to fill the cache.  This is when Filip's monitoring scripts provided the potential solution.  On his "How to read the results page", one of the first issues discussed is Background Garbage Collection which is a feature provided in AppFabric 1.1 CU3 but it is not turned on by default.  You have to manually change a related config file for the feature to actually work.  So I did.

"After you apply this cumulative update, AppFabric uses a nonblocking garbage collection (background server garbage collection). Nonblocking garbage collection is a new feature in the .NET Framework 4.5.
To apply this fix, follow these steps:
1. Upgrade the servers to the .NET Framework 4.5.
2. Install the cumulative update package.
3. Enable the fix by using the following setting <appSettings><add key="backgroundGC" value="true"/></appSettings> in the DistributedCacheService.exe.config file between
       </configSections>
   <appSettings><add key="backgroundGC" value="true"/></appSettings>
   <dataCacheConfig>
4. Restart the AppFabric Caching service for the update to take effect.
Note By default, the DistributedCacheService.exe.config file is located under the following directory:
%ProgramFiles%\AppFabric 1.1 for Windows Server"

My working theory is the cache was filling up and the 'failed to decrypt' message was not related to permission issues but to space issues.  Unfortunately I forgot to leave one of the farms in the original configuration as a control.  *shakes head in disappointment*  Warrants further research but I wanted to at least get this information out in the event it may assist someone else.

3 comments:

  1. Hi Barrett,

    Thanks for the detailed and insightful description on this issue. We been having this on our SP 2016 MinRole farm. All servers are ok apart from one of the Search Servers. DC runs only on the 2 DC Servers in the minrole deployment. On the Failing server AppFapric service does not run same as the twin DC server which runs without the error. We tried clearing SharePoint cache and restarting the server but still errors are coming. Any help will be appreciated.

    Kind regards,
    Ali

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Hi,

    I have set up a 3 server farm (1 WFE+DC, 1 App+Search and 1 SQL Server) and modified the AppFabric config file right after installation of prerequisites. The Feed Cache errors still occur until I restart the DC service in Central Admin. Both servers on Oct. 2017 CU.

    ReplyDelete