2010. 2. 3. 15:47 오라클
ragmain check 데몬의 비정상적인 증가
10g RAC 환경에서 있는 bug 인데 racgmain check 데몬이 비정상적으로 fork 되면서 메모리 사용율이 올라가게 되어 결국 나중엔 시스템을 사용할수 없는 지경까지 이르게 됨.
oracle 26024 1 0 Dec 6 ? 0:00 /oracle/crs/bin/racgmain check
oracle 23218 1 0 Dec 6 ? 0:00 /oracle/crs/bin/racgmain check
oracle 23179 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 27277 1 0 Dec 6 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 1028 1 0 Dec 5 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 7991 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 15324 1 0 Dec 3 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 14314 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 10895 1 0 Dec 4 ? 0:00 /oracle/ora10/bin/racgmain check
oracle 404 1 0 Dec 3 ? 0:00 /oracle/ora10/bin/racgmain check
해결책은 아래와 같이 CRS bundle #2 patchset을 적용시키거나 workaround 방법을 써서 조치해 주어야 함.
=====================================================================================
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.1.0.6Information in this document applies to any platform.
Oracle Server Enterprise Edition - Version: 10.1.0.2 to 10.2.0.4
Symptoms
System slows down and many "racgmain check" processes may appear in ps output. CRS log would show the following messages.
oracle@HA5-ZW05:[/home/oracle] ps -ef|grep "racgmain check"|wc -l
1290
~~~~
CAAMonitorHandler :: 0:Action Script /opt/oracle/product/crs/bin/racgwrap(check) timed out for ora.harac1.vip! (timeout=60)
CheckResource error for ora.harac1.vip error code = -2
CAAMonitorHandler :: 0:Could not join /opt/oracle/product/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0,
other: Abnormal termination of the child
~~~~
Cause
crsd.bin invokes the racgmain to check the status of the resources that are managed by CRS. The racgmain is invoked through the wrapper script racgwrap.If the resource action timed out, crsd kills the action script, which is racgwrap, while racgmain process will not be killed. Over time, this might create lot of orphan racgmain processes in the system. This would eventually slow down the due to the resource contention at the OS level.
Internal bug:6196746 addresses this issue.
Solution
- This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included in CRS bundle patch from bundle #2 onwards.
- Following option could be used as a temporary workaround until the patch is applied.
1. Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on ALL Nodes
2. Edit the file racgwrap and modify the last 3 lines from:
~~~
$ORACLE_HOME/bin/racgmain "$@"
status=$?
exit $status
to:
# Line added to fix for Bug 6196746
exec $ORACLE_HOME/bin/racgmain "$@"
~~~
3. Kill all the orphan racgmain processes running.
$ ps -ef|grep "racgmain check"
oracle 18701 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
oracle 14653 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
oracle 24517 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
$ kill -9 <PID of racgmain>