Rocket iCluster

Continuous Sync Checks: Should we run them in Rocket iCluster?

  • 1.  Continuous Sync Checks: Should we run them in Rocket iCluster?

    Posted 11-11-2020 12:22
    Edited by Mark Watts 11-19-2020 18:31
    In this post we are going to discuss ways to both increase the fidelity of your HA/DR deployment and reduce the administrative overhead required to maintain that reliable state.  There are two primary modes to run sync checks in iCluster and the two can be deployed together to take advantage of the strengths of both.  Picking the time periods of low production activity for intense validations and selecting when continuous sync checks provide the most benefits, help us leverage the capability of deep fidelity checks versus early notification so we have high confidence and restful nights. 

    What is 'fidelity' when discussing logical replication?  It refers to 'the degree of exactness with which something is copied or reproduced'.  The primary way we validate fidelity in iCluster is with Synchronizations Checks, an automated process to compare matching object existence, associated attributes and contents of each object pair (source and target) in the replication selection. 

    The main Sync Check strategies when validating fidelity are either periodic (perhaps once a day) batch runs (usually nightly) or something called a 'Continuous Sync Check' (CSC from here forward).  There is also on-demand Sync Checks but that is not a 'strategy' so lets omit that discussion for now.  A CSC is a sync check that runs continuously while replication is active.  The main concept is that the sync check builds a work list of all the objects in scope for the replication group and then proceeds to sync check or compare each object against the matching BACKUP node object.  When it reaches the end of the list, it starts again at the beginning and goes through the work list again and again while replication is active.  A Sync Check process is allowed 'one per group'.  That means that if there is an active CSC running, an additional on-demand or batch sync check will be rejected.  You would need to end the Continuous Sync Check first, then submit the single pass Sync Check followed by restarting the CSC if it is desired to be active again. 

    The iCluster continuous sync check provides quick notification in the monitor when some object pair fails the compare (is out of sync). With the multiple options available there is little need to run nightly scheduled sync checks and you can refer to the monitor to view the current health or fidelity of the objects selected in the replication group. 

    When you create or change a replication group you can have a continuous sync check configured to start when the group starts by utilizing the CONTSYNC (*YES) on the DMADDGRP or DMCHGGRP command.  Select the desired additional options available to control how the Sync Check process will behave.  To control the resource consumption of a CSC process, use the "Delay Between Objects" value to have the activity operate in accordance to your system balance.  The default value is 200 milliseconds.  I suggest a starting value of approximately 600 and then adjust as desired.

    Automatic CSC invocation when a group starts and ends


    Alternatively, to start a continuous sync check for an active group use the command STRCNSC.  Here is an example command syntax using 600 milliseconds delay between each object pair.  STRCNSC TARGET(<backup node name>) GROUP(<group name>) DELAY(600)
    This can be submitted from a custom CLP or from a job scheduler to control the period of time you want Continuous Sync Checks active.  End a Sync Check with command ENDHASC.  The only parameter required is the Replication Group name.

    Use STRCNSC to invoke a CSC on an ACTIVE group



    The CSC does generate QAUDJRN user journal entries in the System Audit Journal  If you have CSC running for all replication groups and the delay value is very low, the activity to the QAUDJRN can become significant.  If you issue a DSPJRN JRN(QSYS/QAUDJRN) USRPRF(DMCLUSTER) you will see a number of journal entries of code U (user) and types \o, \A, \B, \C, and \g.  Many IBM Power Systems are today configured with high performance controllers and SSD devices and the additional I/O to the QAUDJRN does not cause the system to break a sweat.  However, if you wish to control or reduce the QAUDJRN activity you can use one (or both) of two strategies; 1. Increase the "Between Object Delay" value. 2. Only run CSC for the more critical replication groups.

    QAUDJRN entries for user DMCLUSTER with CSC ACTIVE

    Make sure you have automatic journal management turned on for QAUDJRN (and all journals used in replication) so that the journal receivers no longer required are automatically deleted and storage requirements are automatically maintained with very little effort once it is set up. 
    That is a different BLOG post but find that area with iCluster command:  DMWRKJRN *ALL/*ALL *NO 

    Let's discuss one final topic around Sync Checks and that is "How do you review an exception from a CSC?" 

    As we covered already, a Sync Check compares object existence, contents, or/and object attributes to validate that replication is healthy.  As each object pair is compared, if there is an exception, iCluster records it and displays the exception count in the monitor.  You can then see the exception count in the OOS column by each replication group where a SYNC CHECK or CSC was run.  You can also, if you wish, go to the BACKUP Node and run command DMSCRPT and turn on the desired parameters to get a bigger picture of your SC strategy.  Discovery into the object details will let you know what aspect of the objects failed to compare.  It is important to know, even though the object was flagged during this pass of the sync check, in many exception cases, the object is still replicating with no interruption.  A simple example of that is the Object Description field.  If for some reason, the object description of the source and target object do not match, iCluster would identify that as an OOS (out of sync) condition.  Even though that attribute is minor and is not what we would consider anything to be concerned about, it does reveal the possibility that there was some sort of break in the original sync process or a violation of the access rules of your BACKUP node, or other mysteries of the Universe.  So you can decide, do I just go fix the description and let the next pass of the CSC discover the object has been repaired?  Or should I activate the object within iCluster and let it automatically repair the object with a new image of the object.  When objects are small, the decision is easy.  Let iCluster reimage the object.  If the object is very large, then some additional considerations should be considered before activating the object and refreshing it.

    One Bonus factor of a CSC we should include, that is very important to me, is highly active files.  What if a CSC is running and replication has some latency or the file in validation processing is changing at a (very) high rate?  It is possible in that instant that the attribute compare (although there is logic to keep this at a minimum) could result in a 'false positive'.  What does that mean?  It is the possibility that a couple of specific attributes that are changing at a high rate do not compare during this pass of a CSC. (the change has happened on the source but not yet occurred on the target causing the compare to throw an exception)  The attributes to watch out for are the CNR (Current number of records) and NDR (Number of deleted records).  If you are reviewing OOS exceptions during a high system busy period and you see these exceptions (CNR or NDR), I advise you to wait for the next pass of the sync check.  Some people think that iCluster is correcting these automatically when they are suddenly gone, but instead what actually happens, is the next compare cycle the process discovers they now compare successfully and the exception is cleared.  The only correction actions of a CSC are controlled by the Repair options in the command OR the next pass of the objects dictates the exception condition no longer exists and is removed.  As a result, look at the exception code when you are using CSC while the system is (very) busy, and understand what exceptions are anticipated.  You can also, on the BACKUP Node, run command DMSCRPT and with the details turned on will actually list the comparator failure details.  In that case, if the NDR or CNR difference listed is a small number of records than you can be confident in ignoring the error until later (after it has failed again during a less busy time).  If instead the record count difference is large, then you will know the file is likely actually OOS and the next pass of the sync check while the system is quiet and latency resolved will not find the situation improved.  One strategy to eliminate these 'false positive' exceptions if you have determined that the exception is invalid in this case, is to turn OFF checking for this attribute.  That can be done at the replication group level and can be further restricted to only a handful of files if the selection to the group is created for this purpose. All other replication groups can continue to check all the Default attributes as desired (or even different custom attributes).

    The point of all this is that iCluster provides high performance and accurate replication.  Sync Checks help you confirm the fidelity of the configuration and the reliability of your HA/DR strategy.  These automated processes are designed to reduce administrative overhead while providing some advanced validation to your environment.  Batch Sync Checks and Continuous Sync Checks can be deployed together to take full advantage of the strengths of both.  Batch Sync Checks run nightly are typically a once a day validation at a time selected when the system is most quiet.  This leverages the availability of excess resources for lowest competition with production activity and effectively eliminates any false positives.  Continuous Sync Checks have the power of notifying us of an OOS condition as early as possible so that both the causation can be identified and a correction can be started immediately.  Knowing what the report and monitor is revealing to you about your environment is key to making it easy, reliable and stress free. 
         


      ------------------------------
      Mark Watts
      Rocket Software
      ------------------------------