Rocket iCluster

Expand all | Collapse all

Continuous Sync Checks: Should we run them in Rocket iCluster?

  • 1.  Continuous Sync Checks: Should we run them in Rocket iCluster?

    ROCKETEER
    Posted 11-11-2020 12:22
    Edited by Mark Watts 11-19-2020 18:31
    In this post we are going to discuss ways to both increase the fidelity of your HA/DR deployment and reduce the administrative overhead required to maintain that reliable state.  There are two primary modes to run sync checks in iCluster and the two can be deployed together to take advantage of the strengths of both.  Picking the time periods of low production activity for intense validations and selecting when continuous sync checks provide the most benefits, help us leverage the capability of deep fidelity checks versus early notification so we have high confidence and restful nights. 

    What is 'fidelity' when discussing logical replication?  It refers to 'the degree of exactness with which something is copied or reproduced'.  The primary way we validate fidelity in iCluster is with Synchronizations Checks, an automated process to compare matching object existence, associated attributes and contents of each object pair (source and target) in the replication selection. 

    The main Sync Check strategies when validating fidelity are either periodic (perhaps once a day) batch runs (usually nightly) or something called a 'Continuous Sync Check' (CSC from here forward).  There is also on-demand Sync Checks but that is not a 'strategy' so lets omit that discussion for now.  A CSC is a sync check that runs continuously while replication is active.  The main concept is that the sync check builds a work list of all the objects in scope for the replication group and then proceeds to sync check or compare each object against the matching BACKUP node object.  When it reaches the end of the list, it starts again at the beginning and goes through the work list again and again while replication is active.  A Sync Check process is allowed 'one per group'.  That means that if there is an active CSC running, an additional on-demand or batch sync check will be rejected.  You would need to end the Continuous Sync Check first, then submit the single pass Sync Check followed by restarting the CSC if it is desired to be active again. 

    The iCluster continuous sync check provides quick notification in the monitor when some object pair fails the compare (is out of sync). With the multiple options available there is little need to run nightly scheduled sync checks and you can refer to the monitor to view the current health or fidelity of the objects selected in the replication group. 

    When you create or change a replication group you can have a continuous sync check configured to start when the group starts by utilizing the CONTSYNC (*YES) on the DMADDGRP or DMCHGGRP command.  Select the desired additional options available to control how the Sync Check process will behave.  To control the resource consumption of a CSC process, use the "Delay Between Objects" value to have the activity operate in accordance to your system balance.  The default value is 200 milliseconds.  I suggest a starting value of approximately 600 and then adjust as desired.

    Automatic CSC invocation when a group starts and ends


    Alternatively, to start a continuous sync check for an active group use the command STRCNSC.  Here is an example command syntax using 600 milliseconds delay between each object pair.  STRCNSC TARGET(<backup node name>) GROUP(<group name>) DELAY(600)
    This can be submitted from a custom CLP or from a job scheduler to control the period of time you want Continuous Sync Checks active.  End a Sync Check with command ENDHASC.  The only parameter required is the Replication Group name.

    Use STRCNSC to invoke a CSC on an ACTIVE group



    The CSC does generate QAUDJRN user journal entries in the System Audit Journal  If you have CSC running for all replication groups and the delay value is very low, the activity to the QAUDJRN can become significant.  If you issue a DSPJRN JRN(QSYS/QAUDJRN) USRPRF(DMCLUSTER) you will see a number of journal entries of code U (user) and types \o, \A, \B, \C, and \g.  Many IBM Power Systems are today configured with high performance controllers and SSD devices and the additional I/O to the QAUDJRN does not cause the system to break a sweat.  However, if you wish to control or reduce the QAUDJRN activity you can use one (or both) of two strategies; 1. Increase the "Between Object Delay" value. 2. Only run CSC for the more critical replication groups.

    QAUDJRN entries for user DMCLUSTER with CSC ACTIVE

    Make sure you have automatic journal management turned on for QAUDJRN (and all journals used in replication) so that the journal receivers no longer required are automatically deleted and storage requirements are automatically maintained with very little effort once it is set up. 
    That is a different BLOG post but find that area with iCluster command:  DMWRKJRN *ALL/*ALL *NO 

    Let's discuss one final topic around Sync Checks and that is "How do you review an exception from a CSC?" 

    As we covered already, a Sync Check compares object existence, contents, or/and object attributes to validate that replication is healthy.  As each object pair is compared, if there is an exception, iCluster records it and displays the exception count in the monitor.  You can then see the exception count in the OOS column by each replication group where a SYNC CHECK or CSC was run.  You can also, if you wish, go to the BACKUP Node and run command DMSCRPT and turn on the desired parameters to get a bigger picture of your SC strategy.  Discovery into the object details will let you know what aspect of the objects failed to compare.  It is important to know, even though the object was flagged during this pass of the sync check, in many exception cases, the object is still replicating with no interruption.  A simple example of that is the Object Description field.  If for some reason, the object description of the source and target object do not match, iCluster would identify that as an OOS (out of sync) condition.  Even though that attribute is minor and is not what we would consider anything to be concerned about, it does reveal the possibility that there was some sort of break in the original sync process or a violation of the access rules of your BACKUP node, or other mysteries of the Universe.  So you can decide, do I just go fix the description and let the next pass of the CSC discover the object has been repaired?  Or should I activate the object within iCluster and let it automatically repair the object with a new image of the object.  When objects are small, the decision is easy.  Let iCluster reimage the object.  If the object is very large, then some additional considerations should be considered before activating the object and refreshing it.

    One Bonus factor of a CSC we should include, that is very important to me, is highly active files.  What if a CSC is running and replication has some latency or the file in validation processing is changing at a (very) high rate?  It is possible in that instant that the attribute compare (although there is logic to keep this at a minimum) could result in a 'false positive'.  What does that mean?  It is the possibility that a couple of specific attributes that are changing at a high rate do not compare during this pass of a CSC. (the change has happened on the source but not yet occurred on the target causing the compare to throw an exception)  The attributes to watch out for are the CNR (Current number of records) and NDR (Number of deleted records).  If you are reviewing OOS exceptions during a high system busy period and you see these exceptions (CNR or NDR), I advise you to wait for the next pass of the sync check.  Some people think that iCluster is correcting these automatically when they are suddenly gone, but instead what actually happens, is the next compare cycle the process discovers they now compare successfully and the exception is cleared.  The only correction actions of a CSC are controlled by the Repair options in the command OR the next pass of the objects dictates the exception condition no longer exists and is removed.  As a result, look at the exception code when you are using CSC while the system is (very) busy, and understand what exceptions are anticipated.  You can also, on the BACKUP Node, run command DMSCRPT and with the details turned on will actually list the comparator failure details.  In that case, if the NDR or CNR difference listed is a small number of records than you can be confident in ignoring the error until later (after it has failed again during a less busy time).  If instead the record count difference is large, then you will know the file is likely actually OOS and the next pass of the sync check while the system is quiet and latency resolved will not find the situation improved.  One strategy to eliminate these 'false positive' exceptions if you have determined that the exception is invalid in this case, is to turn OFF checking for this attribute.  That can be done at the replication group level and can be further restricted to only a handful of files if the selection to the group is created for this purpose. All other replication groups can continue to check all the Default attributes as desired (or even different custom attributes).

    The point of all this is that iCluster provides high performance and accurate replication.  Sync Checks help you confirm the fidelity of the configuration and the reliability of your HA/DR strategy.  These automated processes are designed to reduce administrative overhead while providing some advanced validation to your environment.  Batch Sync Checks and Continuous Sync Checks can be deployed together to take full advantage of the strengths of both.  Batch Sync Checks run nightly are typically a once a day validation at a time selected when the system is most quiet.  This leverages the availability of excess resources for lowest competition with production activity and effectively eliminates any false positives.  Continuous Sync Checks have the power of notifying us of an OOS condition as early as possible so that both the causation can be identified and a correction can be started immediately.  Knowing what the report and monitor is revealing to you about your environment is key to making it easy, reliable and stress free. 
         


      ------------------------------
      Mark Watts
      Rocket Software
      ------------------------------


    • 2.  RE: Continuous Sync Checks: Should we run them in Rocket iCluster?

      Posted 03-24-2021 02:11
      Hi Mark,

      When looking at out of sync objects which reports should you be running, where (source/target) and when to keep some sort of historical record of out of sync objects. The reporting does help to identify why objects are out of sync. 

      What would be a good practice to run specific overnight reports as a point in time syc check before you move to a continuos sync check and which reports should be running for CSC? What process and reports are recommended to slowly move to Batch Sync Checks and Continuous Sync Checks together, ultimately having no objects out of sync. 

      Thank You,

      ------------------------------
      Charles Charalambous
      DXC Technology
      ------------------------------



    • 3.  RE: Continuous Sync Checks: Should we run them in Rocket iCluster?

      ROCKETEER
      Posted 03-24-2021 15:44
      Hi Charles,

      Thanks for the questions and participation in the Forum.

      If there is any confusion it typically comes from al the various options and flexibility of how you want the sync check results delivered to you.  You do have to consider how you want to consume the sync check results and then match that with the options available to the IT team. One option is to turn on all of the delivery options and then consider what results communication information is too much and turn those off.

      You can submit a sync check from either node, Primary or Backup nodes (source or target).  I typically recommend the request is placed in the job schedule entry on the Primary Node.  Within the parameters of the SC request there is an option to have the task generate a resulting report and options of where the report should be delivered (including an option to send a copy of the report direct to an email destination or multiple destinations upon completion).  The results of the Out of Sync count are also displayed in the Cluster Monitor and you can choose to start your investigation there. 

      Also, from the BACKUP node, you can request a report generated from the collected statistics of the last completed sync check audits using the command DMSCRPT which can include summary information and also can include the actual comparison data that mismatched in the sync check attributes.  For instance if a file is identified as out of sync with reason code NAU, the resulting report can include that the NAU code reveals that there was a "Number of Authorized Users" difference and further reveals details of the number of AUTHORIZED USERS on Source and the number of AUTHORIZED USERS on the target for this exception file. 

      To correct the difference you can take one of three options:
      1. You can activate the OOS object and iCluster will save and send a fresh copy of the object and restore it and remove the exception when successful.
      2. Because this is an 'AUTHORITY' attribute and iCluster sync checks can be submitted with a 'repair' option for these, a new Sync Check could be submitted with the repair option *AUTH, and the file difference would be resolved without resending the entire object (and would clear the exception).
      3. It is possible to do a bit more investigation, especially if the automated options did not resolve the exception.  For instance we could retrieve the object authority for the object with command DSPOBJAUT on both source and target and manually compare the results.  An example of error that would need manual intervention is that the source node object contains a user that does not exist on the target node (likely due to an exclusion).  A choice is required to resolve the exception to either remove the extra authorized user from the source object list or include the existence of the user on the target.  Once that is resolved, running a new sync check with the *AUTH repair option would resolve the exception and clear it from the monitor and reports.

      There is some personal preference in how you manage the cluster in the answer for your last set of questions.  Remember first that the sync check does not 'insure' there are no out of sync objects, it 'assures' there are no out of sync objects.  Although there are some repair options, most of the time the value of an audit or sync check is to validate that the environment 'remains in sync'.  We set up the application selections for replication, synchronize the starting point and iCluster replication identifies and replicates the changes for objects.  A nightly or continuous sync check assures the objects are equal.  As long as no replication environment management strategies are not violated or no errors in replication occur, the environment of the target remains synchronized with the source node.  The sync check validates that condition and can automatically repair some differences on the fly (*AUTH, *JRN, *CBU, *CRD differences are automatically corrected during the sync check process with no intervention whatsoever). 

      If an environment has a period of time that is relatively quiet with zero latency, a once a day, quiet time sync check is typically adequate to confirm the environment remains synchronized.  If you run it again 24 hours later (daily) and it returns error free, that is a reliable indicator that the application environment is being replicated with high fidelity. 

      Then what does a Continuous Sync Check (CSC) provide above and beyond a periodic sync check?  The primary benefit of a CSC is early identification of an OOS state.  Since the sync checks or validations are running along side replication continuously, you could become aware of a synchronization problem much earlier in the event that caused the situation.  With added service does take a little getting used to and it requires us to be aware of situations that can display as OOS when in fact the real issue is latency.  For instance, if there was a process that introduced latency that increased to 30 minutes delay to the BACKUP node replication apply, a sync check also running in that condition would be expected to identify objects that do not match in spite of a healthy replication environment otherwise.  Just delaying taking any action to correct the OOS objects would likely result in the system healing itself once the latency condition was resolved.

      If you want to combine CSC and Periodic Sync Checks from batch to take advantage of the best features of both, a set of Scheduled jobs or a CLP to provide the best timing of each would be a good strategy to get both reliable low latency batch sync checks and early notification value delivered from CSC validations.



      ------------------------------
      Mark Watts
      Rocket Software
      ------------------------------



    • 4.  RE: Continuous Sync Checks: Should we run them in Rocket iCluster?

      Posted 05-06-2021 20:20
        |   view attached
      Hi Mark,

      Firstly thanks for the prompt and detailed replies to previous questions. The forum does hold a lot of good information and interesting topics.
      This Question is specific to our environment but could apply to other customers. Could be posting in the wrong question category so sorry in advance.
      We create a new library called U2TEST on the source system then create objects, etc. Latter we decide to delete library U2TEST, rocket recreates library U2TEST as an empty library on the target system.
      We believe this is related to the system rules around U2S* libs. Additional detail is provided in the attached uploaded file.
      Currently on source 3 libs prefixed U2S* exist. On target 14 libs prefixed U2S* exist.
      Our internal support team meets weekly to ensure we have no OOS issues. Currently we are tracking like the Long March 5B rocket plummeting out of control but we know we are in good hands with mission control - Rocket Software. Just some rules re-alignment and minor understanding and re-configurations, should be back in synch.
      Q2. This Q may be completely out of context but raised by the Team. If you create a new library on the Target, this auto creates a new rule on the target. If the new library is deleted on the target why is the rule not removed?

      Thank You,       


      ------------------------------
      Charles Charalambous
      DXC Technology
      ------------------------------

      Attachment(s)

      docx
      U2Sup Rocket Groups.docx   28 KB 1 version


    • 5.  RE: Continuous Sync Checks: Should we run them in Rocket iCluster?

      ROCKETEER
      Posted 05-07-2021 09:56
      Hi Charles,

      Your question is one I have thought about and have filed an enhancement request regarding the challenge for discussion.  You too can report or request enhancements through the Rocket customer portal or Community, same as when you have a question or request assistance from Rocket Support.

      To explain what is happening let's discuss what happens when you manually remove an old library selection from a replication group and it's contents.  The steps are to end the replication group, remove the selections, run command 'DMSETPOS GROUP(grpname) JRN(*ALL) JRNPOSLRG(*LASTAPY)', and restart the group normally.  iCluster will rebuild the metadata and restart replication based on the revised selection criteria.

      When we add a generic selection as in your example, "U2S*", all existing and any new libraries will automatically be added on-the-fly, when they are created and automatically begin replicating in near real time.  If in your generic selections you also included not only the contents of the generic set, but also included a generic selection for the library object in QSYS then here is what happens when the generic library is deleted: 

      1. The library on the primary node is deleted per the request on the primary node (whether automated or manual)
      2. iCluster would perform the deletion of all the contents of the library on the target
      3. If the group selection includes the library object, the target side library is also deleted.
      4. Everything is cleaned up right? yes, except there is still an explicit request to replicate the library and contents specified in the group that requires the group to be ended to remove them.  Another possibility is to leave the automatically generated selections but instead change them to *EXCLUDE, however we must stop the group before the change can be made.
      5. If no selection updates are made, the next time the replication groups are ended and restarted normally, either manually or though the scheduled restart perhaps weekly, for each selection in the group that does not have a primary library exists, the target library is created automatically in anticipation that new objects will follow in replication.   

      As part of the required cleanup there are two strategies we can use. 
      1. When you run your *FULL sync checks, include the option to check for OBSOLETE objects and set the option to DELETE all found OBSOLETES.  Again we need to include the selection for the library object and the contents for this to work smoothly.  Although iCluster would continue to recreate the ghostly target libraries each time the groups are restarted, they would be empty and the sync check would delete them again when used with these options.
      2. If you would like to instead to remove the selections that are responsible for the OBSOLETE libraries creation, you can identify them easily when you run the sync check, check for OBSOLETEs but omit the option to delete them.  Now that you have your list of OBSOLETEs is a requested report, run the sync check again and allow iCluster to delete them off the target and remove the exceptions from your sync check report and status display.  Then from your report you can end the group and identify which selections should be removed, delete them and request a rebuild of the metadata before you restart.  (the step mentioned above in paragraph #2) 

      I think this example above also answers why a new auto-generated selection rule is not removed after a library in selection is deleted.  The group must be ended to remove or modify an existing selection.  
       
      Thanks for the feedback Charles.  We are happy you and your team are getting some value from the Forum. ​Please encourage everyone in your team that might benefit from the Forum to enroll and take advantage of the information shared.


      ------------------------------
      Mark Watts
      Rocket Software
      ------------------------------



    • 6.  RE: Continuous Sync Checks: Should we run them in Rocket iCluster?

      PARTNER
      Posted 05-09-2021 20:31
      Hi Mark,

      In paragraph 2 of your response to Charles, you stated that the metadata will be rebuilt by using the process described. What's the impact of doing this? How regularly can this be done? For example, if the process you described were executed daily immediately prior to the scheduled sync checks, would this have a negative impact? Would this resolve the issue Charles is describing?

      Thanks,

      ------------------------------
      Warwick Craig
      IBMi Technical Specialist
      AMP
      ------------------------------



    • 7.  RE: Continuous Sync Checks: Should we run them in Rocket iCluster?

      ROCKETEER
      Posted 05-10-2021 12:07
      Hi Warwick,

      Thank you for participating in our discussion.

      There are two potential impacts of running the DMSETPOS command and timing the rebuild process before starting a sync check.  
      • The group is ended, the DMSETPOS command is performed, and the group is restarted.  There is a delay in the completed restart of the group due to the metadata rebuild activity.  (The replication groups today can even detect an issue and automatically rebuild metadata as an automated clean start.)  The more objects in selection for the group, the longer it could potentially take.  It is mostly an I/O process so it doesn't require a great deal of CPU but if it is performed while there is a significant load on the PRIMARY system, the delay in start up can be unsettling.  On the new Power 9 systems with improved I/O performance outfitted with SSD drives, the start delay is typically of little concern.
      • Notice that if you have 'Suspended objects' reported, during the rebuild metadata process their report to the monitor are cleared. Does that mean the condition that caused the condition is resolved and the suspended object is now in sync?  Not necessarily.  Since any suspension exception from iCluster was cleared, the next sync check for the group would report if the Audit failed for the object pair(s).  Or also possible, the next update processed for the object(s) could have the suspended condition return if the reason for the suspension still remains. 
      • Also note that if there were OOS exceptions in the monitor for the group, rebuilding metadata does not clear them as it requires 1. a successful activation, 2. a new sync check run with fidelity confirmation, or 3. a command run to PRGHASC on the backup node to clear a sync check exception.  A DMSETPOS start does not clear Sync Check results from previous validations.
          You might ask "How do i detect if there is an active Metadata Rebuild process active?".  It's actually pretty easy to do these steps.
      • From the iCluster monitor on the primary node, use option 94 next to the replication group that you have noticed a startup delay.
      • From the Work with Active Job display use option 10 (Work with call stack) next to the PGM-HADDAS job for your group.
      • See the following step process being performed... Notice in the 'Procedure' column, the call stack indicates it is performing the procedure "rebuildMetadataForSETPOS..."  (If the HADDAS job is not yet active, we can perform the same check on the OMGROUPJOB and potentially see process "AuditAllObjects" that must complete before the rebuild will commence.  Very large groups with 1000s of objects selected may take a few minutes to complete.  Don't cancel it because the next time you restart it will have to start from the beginning again.  If in doubt, contact Rocket Support)
      • When that step completes normal startup will resume. 
      Rebuild Metadata for DMSETPOS

      Your other questions:
      How regularly can this be done?
      The replication group must be ended before you can submit the request (DMSETPOS).  You could perform it each time the group is ended however I don't believe it is necessary to run it as part of your nightly process.  Remember this command is recommended in the article when we remove a selection or we add a new EXCLUSION to a replication group.  The other time a DMSETPOS is used is when a replication group is ended 'uncontrolled' and some recovery intervention is required to get the group restarted successfully.  (I have to say if someone requests a controlled end of a group but only allow it 30 seconds to end, that could result in an 'immediate' end of the group and require recovery actions.  iCluster V8.3.1 ends very quickly using the default end times in the command.)

      For example, if the process you described were executed daily immediately prior to the scheduled sync checks, would this have a negative impact?
      The first thing that comes to mind is that if we planned to run a sync check immediately after the DMSETPOS startup, (which a sync check soon after would be a good idea since we may have cleared some suspended exceptions), we would likely need to include some delay value before a SC start to allow the group to completely start before the SC is initiated and depending on how long the group was ended, some time to insure there is no latency in replication that could cause our sync check to not have accurate results.  

      Would this resolve the issue Charles is describing?
      Recall the process I described for Charles was to run a sync check for obsoletes to identify libraries that exist on the backup but do not exist on the primary node.  Capture the report that lists the obsolete libraries. Run the *FULL sync check again, this time with Check OBSOLETE and DELETE OBSOLETE *YES to automatically delete all the extra libraries from the backup node.  End the replication group.  Go to the group with the obsolete libraries and remove the selections for the library contents and the library object.  Then run the DMSETPOS process to assimilate the new selection for the group.  When the group is started again normally, it will not recreate the empty libraries on the target.   (If we want these libraries to never get created again even if we still have a generic selection and the same library name is used for a temporary library in the future, we could change the selection for the explicit library name to an EXCLUDE specifier.)
      All the steps are required to avoid the group from re-creating the target side empty libraries.

      ------------------------------
      Mark Watts
      Rocket Software
      ------------------------------