Monday, December 29, 2014

SharePoint 2013: The Execute method of job definition Microsoft.SharePoint.Diagnostics.SPDiagnosticsMetricsProvider

Problem

You see the following event occuring daily at about 6:12 AM on your SharePoint 2013 farm servers:

Log Name:      Application
Source:        Microsoft-SharePoint Products-SharePoint Foundation
Date:          [date/time]
Event ID:      6398
Task Category: Timer
Level:         Critical
Keywords:      
User:          [DOMAIN/FarmServiceAcct]
Computer:      [SharePoint Server]
Description:
The Execute method of job definition Microsoft.SharePoint.Diagnostics.
SPDiagnosticsMetricsProvider (ID 9cde39fb-4971-4a03-9612-0978098691d7) 
threw an exception. More information is included below.

An update conflict has occurred, and you must re-try this action. The 
object SPWebService was updated by [DOMAIN/FarmServiceAcct], in the 
OWSTIMER (8696) process, on machine [SharePoint Server].  View the 
tracing log for more information about the conflict.
Event Xml:
...

This error is generated if the Config Refresh timer job finds out-of-sync caches among the SharePoint servers.  System topology includes one application server and two web front end servers.

This posting unfortunately does not present a solution, but documents troubleshooting steps and reference articles for future reference.

Troubleshooting
  1. Check cache ID
    1. APP1: 2248964
    2. WFE1: 2248968
    3. WFE2: 2248970
      Note: this value is changing constantly.  if you open each cache.ini file on each machine separately, you may get different values - not because the caches are out of sync but due to the cache ID changing from when you open the file on one machine to when you open it on another machine.  To get an accurate snapshot of this value at any moment in time for all machines, have remote sessions open on all machines simulataneously, and then in quick succession make copies of the file on each one.  Then open these copies to determine the actual cache ID.
  2. Clear cache
    1. First, stopped Timer service on all SharePoint servers.
    2. Then on each server, starting with App1:
      1. Navigated to C:\ProgramData\Microsoft\SharePoint\Config.
      2. Looked for current GUID folder (check dates).
      3. Deleted all XML files in this folder.
      4. Replaced contents of cache.ini with "1".
    3. Started timer service on APP1
      1. Waited for it to rebuild fully.
        Note: if you have security software (McAfee, etc) installed, it will significantly consume resources scanning the creation of all of the new cache files, temporarily adversely impacting performance.
      2. 1658 XML cache files generated
    4. Started timer service on WFE1
      1. Waited for it to rebuild fully.
      2. 1658 XML cache files generated
      3. Refreshed application logs APP1 and WFE1
    5. Started timer service on WFE2
      1. Waited for it to rebuild fully
      2. 1658 XML cache files generated
  3. Verify cache IDs
    1. After cache rebuild completed, checked contents of each cache.ini:
      1. APP1: 2248998
      2. WFE1: 2248998
      3. WFE2: 2248998
        See note above on getting accurate values for ID.
  4. Verify solution
    1. Check event logs one day later: same issue recurring
    2. Check event logs five days later: same issue recurring.
Solution
  1. None found at this time 12/29/14.
References
Notes
  • Also checked development farm servers: APP1: 2733002, WFE1: 2733006 and WFE2: 2733006. After performing the above procedure: APP1: 2733071, WFE1: 2733071 and WFE2: 2733071.  Same experience: issue continues to occur after clearing cache.

1 comment:

Anonymous said...

Until Microsoft will add timer job configuration on server level and not on farm level, similar issues will remain.