Saturday, July 31, 2010

Only One Node Starts in a 2 RAC Cluster.

Recently we had a server move from one DC to other.Posting the issue faced on the one of the system.

Configuration as below,

1).AIX Machine
2).11.1.0.7 Database
3).2 Node Cluster.

We had a clear shutdown before the move.And upon starting the Cluster & Database, only one Node comes up.Which ever starts first will have full running CRS/CSS Stack ,the other node fails to start CSS and throws following error in the crsd.log.

20xx-xx-xx xx:xx:xx.xxx: [ COMMCRS][903]clsc_connect: (600000000033e030) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_njzdb12_crs))
20xx-xx-xx xx:xx:xx.xxx: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9
20xx-xx-xx xx:xx:xx.xxx: [ CRSRTI][1] CSS is not ready. Received status 3 from CSS. Waiting for good.

Googling for the above error mostly points to the issue with /tmp/.oracle or /var/tmp/.oracle directory.The files under which has been deleted when the CRS was running.It was true that one of my colleague has deleted those log files.The easy fix told was to reboot both the machine.We tried but no luck.

Upon checking the cssd logs i could see the below error.

ocssd.log:[ CSSD]2010-03-05 17:48:21.908 [84704144] >TRACE: clssnmReadDskHeartbeat: node 3, vm-lnx-rds1173, has a disk HB, but no network HB, DHB has rcfg 0, wrtcnt, 2, LATS 1185024, lastSeqNo 2, timestamp 1267791501/1961474

Which just flashed to my brain to check the cluster interconnect.Yes.The Cluster Interconnect is not pinging between the servers.Informed to Sysadmins and got it fixed.Then it came out quite well.

No comments: