Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAIL

Dominique Sacre · 2013-05-17T17:44:00+00:00

Summary Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAIL Article Number 14331 Environment UNIX Orbix 6.3.4 Question/Problem Description Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAILOrbix servers stop working after some time and raise these warnings in the log:(IT_ATLI2_IP:101) W - ATLI2 failure receiving data with minor_code 1230242771 occurred in TCPConnectionImpl::readable()(IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::recvmsg() failing with system error 131 ('Connection reset by peer')(IT_ATLI2_IP:103) W - ATLI2 failure occurred in TCP connection from 127.0.0.1.34562 to 127.0.0.1.53079 after sending 53 bytes and receiving 0 bytes(IT_ATLI2_IOP:105) W - ATLI2 Failure occurred on connection to 127.0.0.1.53079: ::recvmsg() failed in TCPConnectionImpl::readable() with: Connection reset by peer(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/TRANSIENT:1.0: minor = 0x495404C5 (IT_ATLI2_IOP:CONNECTION_CLOSED_SENDING_BUFFER), completion status = NO(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE(IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBEIDL:omg.org/CORBA/OBJ_ADAPTER:1.0: minor = 0x49540500 (IT_POA:LOCATION_DOMAIN_UNAVAIL), completion status = NONode daemon log containing many warnings of this kind (IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::fcntl(F_GETFL) failing with system error 9 ('Bad file number')(IT_NodeDaemon:2014) F - Process does not exist. process name: Server.replica01(IT_ATLI2_IP:101) W - ATLI2 failure creating connection with minor_code 1230242767 occurred in IPPoolImpl::prepare_socket() Also, Orbix locator log file shows errors when trying to activate servers (IT_POA_LOCATOR:68) W - could not contact node daemon "iona_services.node_daemon.localhost" to find/activate POA. IDL:omg.org/CORBA/TRANSIENT:1.0: minor = 0x49540B40 (IT_NodeDaemon:PROCESS_ALREADY_EXISTS), completion status = MAYBE(IT_POA_LOCATOR:5) I - PERSISTENT POA removed from cache.POA name: POAORB Name: my.server..replica01(IT_POA_LOCATOR:25) W - POA could not be activated in replica.POA name: POAORB Name: my.server.replica01 Clarifying Information Error Message Defect/Enhancement Number Cause In the particular case where this problem was observed, the node daemon did run out of free file descriptors. It was therefore unable to open any more sockect connections which did also lead to the already running Orbix servers loosing their connections to the node daemon. The locator was affected as well and did log the warning that it cannot contact the node daemon anymore. The key indicator for this problem is the node daemon's error when calling the function fcntl(), resulting in a "Bad file number" error. This is shown in the following log line of the node daemon: (IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::fcntl(F_GETFL) failing with system error 9 ('Bad file number') For each process the maximum number of file descriptors that can be used is limited. That limit can be queried by running "ulimit -n". On many modern Unix operating systems the default file descriptor limit should already be in the range of multiple thousands, which should generally be enough. In Orbix domains where hundreds of servers are managed by a single node daemon instance (i.e. these servers all run on the same physical machine), a file descriptor limit of only 1024 might not be sufficiently high. It will need to be increased. Resolution The fix to this problem is to raise the file descriptor limit for the node daemon process. This is done using the tool ulimit. E.g. to set the file descriptor limit to 10000, run "ulimit -n 10000" either system wide (affecting all processes on the machine) or only on the shell that starts the node daemon. Workaround Notes Attachment Created date: 06 September 2011 Last Modified: 13 February 2013 Last Published: 23 June 2012 First Published date: 10 September 2011 #Orbix#KnowledgeDocs

Forum|Forum|12 years ago
May 17, 2013
0 replies
0 views

+2

Dominique Sacre
Rocketeer

Summary	Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAIL
Article Number	14331
Environment	UNIX Orbix 6.3.4
Question/Problem Description	Orbix servers in a domain stop working and raise IT_POA:LOCATION_DOMAIN_UNAVAIL Orbix servers stop working after some time and raise these warnings in the log: (IT_ATLI2_IP:101) W - ATLI2 failure receiving data with minor_code 1230242771 occurred in TCPConnectionImpl::readable() (IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::recvmsg() failing with system error 131 ('Connection reset by peer') (IT_ATLI2_IP:103) W - ATLI2 failure occurred in TCP connection from 127.0.0.1.34562 to 127.0.0.1.53079 after sending 53 bytes and receiving 0 bytes (IT_ATLI2_IOP:105) W - ATLI2 Failure occurred on connection to 127.0.0.1.53079: ::recvmsg() failed in TCPConnectionImpl::readable() with: Connection reset by peer (IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE (IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE (IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/TRANSIENT:1.0: minor = 0x495404C5 (IT_ATLI2_IOP:CONNECTION_CLOSED_SENDING_BUFFER), completion status = NO (IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE (IT_GIOP:105) W - exception occurred while sending LocateRequest: IDL:omg.org/CORBA/COMM_FAILURE:1.0: minor = 0x49540200 (IT_GIOP:CONNECTION_LOST), completion status = MAYBE IDL:omg.org/CORBA/OBJ_ADAPTER:1.0: minor = 0x49540500 (IT_POA:LOCATION_DOMAIN_UNAVAIL), completion status = NO Node daemon log containing many warnings of this kind (IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::fcntl(F_GETFL) failing with system error 9 ('Bad file number') (IT_NodeDaemon:2014) F - Process does not exist. process name: Server.replica01 (IT_ATLI2_IP:101) W - ATLI2 failure creating connection with minor_code 1230242767 occurred in IPPoolImpl::prepare_socket() Also, Orbix locator log file shows errors when trying to activate servers (IT_POA_LOCATOR:68) W - could not contact node daemon "iona_services.node_daemon.localhost" to find/activate POA. IDL:omg.org/CORBA/TRANSIENT:1.0: minor = 0x49540B40 (IT_NodeDaemon:PROCESS_ALREADY_EXISTS), completion status = MAYBE (IT_POA_LOCATOR:5) I - PERSISTENT POA removed from cache. POA name: POA ORB Name: my.server..replica01 (IT_POA_LOCATOR:25) W - POA could not be activated in replica. POA name: POA ORB Name: my.server.replica01
Clarifying Information
Error Message
Defect/Enhancement Number
Cause	In the particular case where this problem was observed, the node daemon did run out of free file descriptors. It was therefore unable to open any more sockect connections which did also lead to the already running Orbix servers loosing their connections to the node daemon. The locator was affected as well and did log the warning that it cannot contact the node daemon anymore. The key indicator for this problem is the node daemon's error when calling the function fcntl(), resulting in a "Bad file number" error. This is shown in the following log line of the node daemon: (IT_ATLI2_IP:102) W - ATLI2 failure caused by function ::fcntl(F_GETFL) failing with system error 9 ('Bad file number') For each process the maximum number of file descriptors that can be used is limited. That limit can be queried by running "ulimit -n". On many modern Unix operating systems the default file descriptor limit should already be in the range of multiple thousands, which should generally be enough. In Orbix domains where hundreds of servers are managed by a single node daemon instance (i.e. these servers all run on the same physical machine), a file descriptor limit of only 1024 might not be sufficiently high. It will need to be increased.
Resolution	The fix to this problem is to raise the file descriptor limit for the node daemon process. This is done using the tool ulimit. E.g. to set the file descriptor limit to 10000, run "ulimit -n 10000" either system wide (affecting all processes on the machine) or only on the shell that starts the node daemon.
Workaround
Notes
Attachment

Created date:	06 September 2011
Last Modified:	13 February 2013
Last Published:	23 June 2012
First Published date:	10 September 2011

#Orbix
#KnowledgeDocs

Recent badge winners

Sign up

Please log in or register:

Welcome to the Rocket Forum!

Please log in or register:

Scanning file for viruses.

This file cannot be downloaded