Rocket iCluster

iCluster Synchronization - Synchronizing all groups

  • 1.  iCluster Synchronization - Synchronizing all groups

    ROCKETEER
    Posted 05-13-2021 19:07
    Synchronization is a big topic.  It is part of the process for logical replication initiation that must be performed well to ensure replication is reliable and the BACKUP node begins at a known point in time and data position.  There are several different methods and scenarios and not a single step list that will apply to all of them like the one I am going to share below.  This method is one that does require the utilization of save of the libraries that are included in scope for replication.  As we know, the best possible save will be one when the application has been ended, users are not on and active.  Below are the high level steps to produce a reliable start every time.

    If this is a new install and first time synchronization start, it is assumed the BACKUP node has been seeded with a recent backup and restore, all network settings are set, iCluster is successfully installed, nodes are created, analysis is completed, replication groups created with each group default journals are defined, and finally all object selections have been added to the configuration.  Once this is all completed, this is the same synchronize starting point of systems that have been running for a long while but for some reason, a step to resynchronize is now necessary.  The main difference is the later is an established and tested config whereas a new installation should expect that some tuning and configuration adjustments remain as a possible next step after the synchronization activity is complete.  It is rare that some adjustment is not necessary to make everything operate smoothly and optimally.

    Let's begin with the question, "Is there already a scheduled full save where the system is available to I.T. maintenance?".  With many businesses now facing a continuous availability demand, having the operations window to take the tape save of the system is the most difficult part of a resync.  If you already have an outage scheduled, we want to leverage that outage time for our synchronization save.  If not, we will need to estimate how long it will take and request the interruption.  

    Assuming that has been negotiated and approved, here are the steps starting at the approved outage.
    a. <optional> Once I have all the groups created with the default journals defined and set, I prefer to go ahead and run a DMMRKPOS as a dry run in advance of the scheduled activity.  This helps identify if there are any issues with the group definitions and also provides a benchmark estimate of how long you can expect it to take.  But be aware that after the DMMRKPOS is run, all the applications are now all journaled so be aware some additional journal management activity may be necessary if you perform this step long in advance before the actual working DMMRKPOS step is performed.  Also, if the DMMRKPOS encounters exclusive locks it will take a little longer to complete compared to running the task with the applications quiesced.  The last MARKED position is the one the start will utilize.  

    1. At the scheduled time, end the applications, application's SBS and disconnect users.
    2. The iCluster SBS and Nodes are active. Execute the iCluster function for each replication group to mark the position of replication startup from the PRIMARY node  < DMMRKPOS >.  The event log on the PRIMARY node will contain DMMRKPOS completion messages for each replication group.
    3. Perform the Save of the applications and data.
            * This is where admins get creative with the capability they have to image their system.  Will they use a tape device from production? Will they FLASH the storage to another system and back it up from the FLASH image?  Will they save everything to save files and then back up the save files to tape?  And many other strategies...  The idea is to complete the imaging as quickly as possible and make the system available to the user community again. The accurately captured image of the system data coupled with the successful MARKED position is all that is needed during the user interruption. Review the job log to confirm all objects were saved successfully. Once these two steps are complete the system applications can be restarted and available again for production workload.
    4. Transport the media to be restored to the BACKUP node location.
    5. To ensure a successful restore of your data, the target libraries (of the libraries we will be restoring) should be cleared on the BACKUP node.  This prevents the system from auto generating new logicals over the existing physical files during the restore and eliminates OBSOLETE objects residing in the target libraries.
    6. The user profiles should be restored first, followed by libraries, IFS paths, DLO if included and restore authority.
    7. Examine the joblog of the restore to verify all objects were restored successfully.  Start the iCluster SBS and Nodes in the cluster.
    8. Perform an iCluster Sync Check with command DMSTRSC with type *OBJATTR for each group.  Include repair options for *AUTH, *CBU, *CRD.  You are mostly watching for a significant number of errors that might indicated there was a problem with the restore process.  Recall the production system has been active (how many hours or days since the backup was initiated and the restore completed?) so we would expect some differences.  A clean restore job log confirms we are good to go.
    9. iCluster SBS and Nodes remain active, if not, start them and verify NODES are active.  
    10. Start replication at the MARKED Position.  I recommend you start the SYSTEM group first (or whatever you have called the group that contains your user profiles and authorities)  DMSTGRP GROUP(SYSTEM) STRAPY(*YES) USEMARKED(*YES)
    Start the remainder of your replication groups.
    DMSTGRP GROUP(groupname) STRAPY(*YES) USEMARKED(*YES)

    Next we would monitor the progress of the replication groups.  Once they have resolved the latency on the system, submit Sync Checks for each replication group.  I recommend using the following command.  
    ===> DMSTRSC GROUP(GROUPNAME) OUTPUT(*NONE) LOCK(*NO) SBMJOB(*YES) DLTOBSOBJ(*YES)
             RUNCHKSUM(*NO) RUNCHKOBSL(*YES) REPAIR(*AUTH *JRN *CBU *CRD)

    That's it.  Next time let's talk about synchronizing replication over the network.

    #IBMi

    ​​

    ------------------------------
    Mark Watts
    Rocket Software
    ------------------------------