Created On:  31 January 2013

Problem:

Enterprise Servers will not start automatically, or if they do start, 173 error "called program file not found" is reported against programs that are successfully deployed in the right place and should be found.

A customer had noticed that SEPs would occasionally hang and begin taking up extra CPU time, and they wanted a way to occasionally clean up any such hung SEPs.  The customer wrote a script, to be executed with root permissions, to run automatically at 3:00 AM, which would invoke the Linux/UNIX command "shutdown".  This would reboot the entire Linux/UNIX machine without first shutting down ES.  They used this as a method of cleaning up any SEPs that might be hung.

The trouble is that this causes the OS to send a SIGTERM signal to all running processes.  It is hard to predict the order in which the OS would send these signals, or the order in which processes associated with Enterprise Server would terminate: cassi, casmgr, mfcs, mfds, eslm, etc.  They would probably terminate in random order.

When each "cas" process receives a SIGTERM signal, on its way down it tries to write updated information to the ES registry, and it relies on the "mfds" process to do this.  But if "mfds" has already terminated, the registry cannot be updated, and this can cause corruption in the registry.

The corrupted registry is what was causing Enterprise Servers to fail when starting next time, or to report 173 "not found" against valid programs when starting next time.

Resolution:

The best solution is to debug and figure out why SEPs are hanging in the first place, and fix that underlying issue, so periodic reboots won't be necessary.

But to clean up hung SEPs, instead of just booting the machine, it is necessary to shut down ES in an orderly way.  A script could be invoked prior to rebooting a machine, which (with the appropriate COBOL environment variables set) would invoke "casstop" for each Enterprise Server, and then wait 5 minutes, before proceeding.  The script could also look for hung SEPs by examining the output of "ps -ef", and manually kill any hung SEPs, before allowing the machine to reboot, thus "mfds" would be the last ES process to terminate.  The problem can be avoided by making sure mfds is really the last ES process to terminate.

Incident #2595187