What to do with iCluster while DC-DR data link is down for some time and back again?

Forum|Forum|2 years ago
August 25, 2023
3 replies
7 views

Satid Singkorapoom
Participating Frequently

I'm with a system integrator helping a bank replace its current core bank with a new one on IBM i and iCluster is used for replication from DC to DR machines. We have installed the new DC and DR machines and is now in pre-production test phase. My customer's IT team comes up with a test scenario for testing iCluster like this :

1) A few bank tellers enter test transactions they prepare in advance (open new accounts, deposit, withdrawal, etc.) in DC machine while iCluster in running.

2) After 15 minutes, my customer cut the data link between DC and DR machines but all tellers continue enterining test transactions until they are finished (expect to be some 40 minutes more).

3) When telllers finish entering test transactions, DC-DR data link is restored.

4) The customer waits until I confirm to them that the replication of all test transactions from DC are replicated to DR machine is complete before they shut down core bank application in DC and start it in DR machine and check if they see all the test transactions in DR machine.

My question is what actions should I take with iCluster from beginning to end of this test (with reference to iCluster Main Menu in DC machine) to handle iCluster to ensure it works as it should? There used to be data link down a few times before during night times and I noticed from DC machine that the DR node status was *FAILED and the replication group status was *UNKNOWN even after the data link was back to work and all I did was end both group and node and restart them again. But since no one worked in the DC machine yet, I do not know if this is what should be done in the test scenario above.

If you need more information before providing me suggestion, let me know.

Thanks in advance.

------------------------------
Satid Singkorapoom
IBM i SME
Rocket Forum Shared Account
------------------------------

Mark Watts
Rocketeer
Forum|Forum|2 years ago
August 28, 2023

I'm with a system integrator helping a bank replace its current core bank with a new one on IBM i and iCluster is used for replication from DC to DR machines. We have installed the new DC and DR machines and is now in pre-production test phase. My customer's IT team comes up with a test scenario for testing iCluster like this :

1) A few bank tellers enter test transactions they prepare in advance (open new accounts, deposit, withdrawal, etc.) in DC machine while iCluster in running.

2) After 15 minutes, my customer cut the data link between DC and DR machines but all tellers continue enterining test transactions until they are finished (expect to be some 40 minutes more).

3) When telllers finish entering test transactions, DC-DR data link is restored.

4) The customer waits until I confirm to them that the replication of all test transactions from DC are replicated to DR machine is complete before they shut down core bank application in DC and start it in DR machine and check if they see all the test transactions in DR machine.

My question is what actions should I take with iCluster from beginning to end of this test (with reference to iCluster Main Menu in DC machine) to handle iCluster to ensure it works as it should? There used to be data link down a few times before during night times and I noticed from DC machine that the DR node status was *FAILED and the replication group status was *UNKNOWN even after the data link was back to work and all I did was end both group and node and restart them again. But since no one worked in the DC machine yet, I do not know if this is what should be done in the test scenario above.

If you need more information before providing me suggestion, let me know.

Thanks in advance.

------------------------------
Satid Singkorapoom
IBM i SME
Rocket Forum Shared Account
------------------------------

Hi Satid,

Thank you for your question and Forum post.

I would first advise that if you have an upcoming planned test that you make sure you have moved to the latest version of iCluster on all nodes and applied all available updates as well. Also check that the version of IBM i OS is also up to date and an iCluster supported version.

Your recovery scenario in previous link outage fits expectations. With the default settings, iCluster replication groups will end due to the failed communications when unable to heartbeat/handshake between nodes. It is anticipated the node that is the metadata owner (typically the Primary node) will remain ACTIVE and any additional nodes defined may be FAILED (in this scenario).

There are some heartbeat settings at the node level that could cause iCluster nodes to retry less frequently and more retries before declaring a failure that may help when there are frequent comms interruptions. However, there are tradeoffs for not checking the link as aggressively when you want replication to end if there is a communication failure. Some consideration is necessary for how you want to tune iCluster characteristics before adjusting these settings.

The scenario you have described, and your past recovery steps are valid. When it is simply a communication failure/interruption, iCluster apply processes will continue to apply any received and staged transactions until a node declares are failure or all transactions are applied. Depending on the quantity of transactions that are waiting to be applied, replication could take a few minutes to complete the transactions apply. In your example it sounds as though there is no anticipated latency at the time of the test so they woud end with little or no delay. You really want to monitor and manage the backup node apply processes (HADTUP) and allow them to end controlled if possible. When that happens the risk of any stage or apply transactions experiencing a corruption in Stage or with the database are very low. In that case. your restart procedure should work without exception.

If startup failure occurs for one or more group, examine the error messages on both source and target nodes and see if there remains a communication challenge or if there is some other error condition. Most errors can be resolved by trying again or issuing a DMSETPOS with a *LASTAPY operand for the group. A sample invocation using iCluster 9.1.1 is below:

DMSETPOS GROUP(SAMPLE) JRN(*ALL) JRNPOSLRG(*LASTAPY)

Review the messages issued and if no failures were received, attempt to start the group again normally.

If more action or error determination needs to occur, please contact Rocket Support.

------------------------------
Mark Watts
Software Engineer
Rocket Software Inc
Waltham MA US
------------------------------

Like

S

Satid Singkorapoom
Author
Participating Frequently
Forum|Forum|2 years ago
August 28, 2023

Hi Satid,

Thank you for your question and Forum post.

I would first advise that if you have an upcoming planned test that you make sure you have moved to the latest version of iCluster on all nodes and applied all available updates as well. Also check that the version of IBM i OS is also up to date and an iCluster supported version.

Your recovery scenario in previous link outage fits expectations. With the default settings, iCluster replication groups will end due to the failed communications when unable to heartbeat/handshake between nodes. It is anticipated the node that is the metadata owner (typically the Primary node) will remain ACTIVE and any additional nodes defined may be FAILED (in this scenario).

There are some heartbeat settings at the node level that could cause iCluster nodes to retry less frequently and more retries before declaring a failure that may help when there are frequent comms interruptions. However, there are tradeoffs for not checking the link as aggressively when you want replication to end if there is a communication failure. Some consideration is necessary for how you want to tune iCluster characteristics before adjusting these settings.

The scenario you have described, and your past recovery steps are valid. When it is simply a communication failure/interruption, iCluster apply processes will continue to apply any received and staged transactions until a node declares are failure or all transactions are applied. Depending on the quantity of transactions that are waiting to be applied, replication could take a few minutes to complete the transactions apply. In your example it sounds as though there is no anticipated latency at the time of the test so they woud end with little or no delay. You really want to monitor and manage the backup node apply processes (HADTUP) and allow them to end controlled if possible. When that happens the risk of any stage or apply transactions experiencing a corruption in Stage or with the database are very low. In that case. your restart procedure should work without exception.

If startup failure occurs for one or more group, examine the error messages on both source and target nodes and see if there remains a communication challenge or if there is some other error condition. Most errors can be resolved by trying again or issuing a DMSETPOS with a *LASTAPY operand for the group. A sample invocation using iCluster 9.1.1 is below:

DMSETPOS GROUP(SAMPLE) JRN(*ALL) JRNPOSLRG(*LASTAPY)

Review the messages issued and if no failures were received, attempt to start the group again normally.

If more action or error determination needs to occur, please contact Rocket Support.

------------------------------
Mark Watts
Software Engineer
Rocket Software Inc
Waltham MA US
------------------------------

Dear Mark

Thanks for your very informative response that enlightens me. BTW, would you be able to provide then URL from which I can access iClsuter's available updates? My iCluster's release is 4RICLUS *BASE 5050 *CODE ICLUSTER V9R1M0.

------------------------------
Satid Singkorapoom
IBM i SME
Rocket Forum Shared Account
------------------------------

Like

Mark Watts
Rocketeer
Forum|Forum|2 years ago
August 29, 2023

Dear Mark

Thanks for your very informative response that enlightens me. BTW, would you be able to provide then URL from which I can access iClsuter's available updates? My iCluster's release is 4RICLUS *BASE 5050 *CODE ICLUSTER V9R1M0.

------------------------------
Satid Singkorapoom
IBM i SME
Rocket Forum Shared Account
------------------------------

Hi Satid,

You can subscribe to news and alerts updates on the Rocket Support Community Portal. After you sign in, or look for the 'Sign up' link, find your user name in the upper right corner. Select 'Notifications'.

Or, go to this URL: https://www.rocketsoftware.com/manage-your-email-preferences

Provide your email address to receive the messages, agree to the Privacy Notice information and find the Product Family for the for the solution you want to receive news and alerts. Make your selection and Click on 'Update your Preferences'. That's it! At the bottom of the notices you receive, there is a link to unsubscribe if your preferences or interest changes.

------------------------------
Mark Watts
Software Engineer
Rocket Software Inc
Waltham MA US
------------------------------