Skip to main content
Summary Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAIL
Article Number 14331
Environment UNIX Orbix 6.3.4
Question/Problem Description Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAIL
Orbix servers stop working after some time and raise these warnings in the log:
(IT_ATLI2_IP:101) W - ATLI2 failure receiving data with minor_code 1230242771 occurred in TCPConnectionImpl::readable()
(IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::recvmsg() failing with system error 131 ('Connection reset by peer')
(IT_ATLI2_IP:103) W - ATLI2 failure occurred in TCP connection from 127.0.0.1.34562 to 127.0.0.1.53079 after sending 53 bytes and receiving 0 bytes
(IT_ATLI2_IOP:105) W - ATLI2 Failure occurred on connection to 127.0.0.1.53079: ::recvmsg() failed in TCPConnectionImpl::readable() with: Connection reset by peer
(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE
(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE
(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/TRANSIENT:1.0: minor = 0x495404C5 (IT_ATLI2_IOP:CONNECTION_CLOSED_SENDING_BUFFER), completion status = NO
(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE
(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE
IDL:omg.org/CORBA/OBJ_ADAPTER:1.0: minor = 0x49540500 (IT_POA:LOCATION_DOMAIN_UNAVAIL), completion status = NO

Node daemon log containing many warnings of this kind

(IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::fcntl(F_GETFL) failing with system error 9 ('Bad file number')
(IT_NodeDaemon:2014) F - Process does not exist. process name: Server.replica01
(IT_ATLI2_IP:101) W - ATLI2 failure creating connection with minor_code 1230242767 occurred in IPPoolImpl::prepare_socket()


Also, Orbix locator log file shows errors when trying to activate servers

(IT_POA_LOCATOR:68) W - could not contact node daemon "iona_services.node_daemon.localhost" to find/activate POA. IDL:omg.org/CORBA/TRANSIENT:1.0: minor = 0x49540B40 (IT_NodeDaemon:PROCESS_ALREADY_EXISTS), completion status = MAYBE
(IT_POA_LOCATOR:5) I - PERSISTENT POA removed from cache.
POA name: POA
ORB Name: my.server..replica01
(IT_POA_LOCATOR:25) W - POA could not be activated in replica.
POA name: POA
ORB Name: my.server.replica01

Clarifying Information
Error Message
Defect/Enhancement Number
Cause

In the particular case where this problem was observed, the node daemon did run out of free file descriptors. It was therefore unable to open any more sockect connections which did also lead to the already running Orbix servers loosing their connections to the node daemon. The locator was affected as well and did log the warning that it cannot contact the node daemon anymore.

The key indicator for this problem is the node daemon's error when calling the function fcntl(), resulting in a "Bad file number" error. This is shown in the following log line of the node daemon:

(IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::fcntl(F_GETFL) failing with system error 9 ('Bad file number')

 For each process the maximum number of file descriptors that can be used is limited. That limit can be queried by running "ulimit -n". On many modern Unix operating systems the default file descriptor limit should already be in the range of multiple thousands, which should generally be enough.

In Orbix domains where hundreds of servers are managed by a single node daemon instance (i.e. these servers all run on the same physical machine), a file descriptor limit of only 1024 might not be sufficiently high. It will need to be increased.

Resolution The fix to this problem is to raise the file descriptor limit for the node daemon process. This is done using the tool ulimit. E.g. to set the file descriptor limit to 10000, run "ulimit -n 10000" either system wide (affecting all processes on the machine) or only on the shell that starts the node daemon.
Workaround
Notes
Attachment
Created date: 06 September 2011
Last Modified: 13 February 2013
Last Published: 23 June 2012
First Published date: 10 September 2011

#Orbix
#KnowledgeDocs