 |
≫ |
|
|
 |
パッチ名: PHSS_35862
パッチ摘要: s700_800 11.11 Serviceguard A.11.16.00
作成日: 07/03/28
公開日: 07/03/30
ハードウェアプラットフォームおよびOSリリース:
s700: 11.11
s800: 11.11
現象:
PHSS_35862:
1. 不具合:JAGag21443 SR:8606465899
select()システムコールがシグナルによって割り込まれると、cmcldが異常終
了し、ノードがセーフティタイマーによってリセットされます。以下のメッ
セージがsyslogファイルに記録されます。
cmcld[2257]: Aborting! select failed (file:
lcomm/local_server.c, line: 1165)
cmcld[2257]: select for port 46356 failed with
Interrupted system call
cmcld[2257]: select for port 46100 failed with
Interrupted system call
cmcld[2257]: 29, 95774e60, 8ef
cmcld[2257]: 17 (read)
cmcld[2257]: 19 (read)
cmcld[2257]: 20 (read)
cmcld[2257]: 21 (read)
cmcld[2257]: 26 (read)
cmcld[2257]: 27 (read)
cmcld[2257]: 28 (read)
cmcld[2257]: 29 (read)
cmcld[2257]: Aborting! select failed (file:
rcomm/comm_ip.c, line: 443)
cmcld[2257]: 33, 95774e60, 8f0
cmcld[2257]: 22 (read)
cmcld[2257]: 23 (read)
cmcld[2257]: 24 (read)
cmcld[2257]: 25 (read)
cmcld[2257]: 30 (read)
cmcld[2257]: 31 (read)
cmcld[2257]: 32 (read)
cmcld[2257]: 33 (read)
cmcld[2257]: Aborting! select failed (file:
rcomm/comm_ip.c, line: 443)
cmclconfd[2256]: The Serviceguard daemon,
/opt/cmcluster/bin/cmcld[2257], died upon receiving
signal number 6.
2. 不具合:JAGag25946 SR:8606470887
高負荷状態のシステムで、複数のボリュームグループを同時にアクティブ化す
ると、一部のvgchangeコマンドが次のようなエラーで終了することがあります。
vgchange: Failed to establish a connection with clvmd
for volume group /dev/vg17
3. 不具合:JAGaf46362 SR:8606386208
クラスタの再編成時に、cmviewclコマンドが不正に、実行中のパッケージのス
テータスを"down"と表示することがあります。
4. 不具合:JAGag12644 SR:8606456223
パッケージパラメータSERVICE_CMDで指定したスクリプト/プログラムが存在し
ないか実行パーミッションを持っていないと、cmsrvassistdが、定義済みの最
大サービス再起動回数に達するまでサービスの再起動を試行し続けます。その
ため、最大サービス再起動回数が"infinite"の場合、cmsrvassistdが大量の
CPUを使用し、(シングルCPUシステムの場合)実質的にシステムを占有します。
5. 不具合:JAGag28374 SR:8606473752
ユニプロセッサシステム上でServiceguardを実行すると、cmcldがCPUを100%消
費するため、ハングやシステムTOCが発生することがあります。マルチプロセ
ッサシステムでは、この問題は起きません。
6. 不具合:JAGag16350 SR:8606460296
VxVM-CVM-pkg付きで構成されているクラスタに対して"cmhaltcl -f"を実行す
ると、cmcldが異常終了し、ノードTOCが発生することがあります。
VxVM-CVM-pkgに対して実行したcmhaltpkgがエラーになった後、この問題が起
きます。
以下のエラーがsyslogに記録されます。
cmcld: Synchronous Notification to client Veritas -
vxclustd with pid did not return within specified
timeout 45.
cmcld: Aborting: sdbapi/srv_sdb.c 2060
(Synchronous Notification to external client did not
return within specified timeout.
...
cmclconfd: The Serviceguard daemon, /usr/lbin/cmcld
died upon receiving signal number 6.
7. 不具合:JAGaf61508 SR:8606401571
40文字より長い記述付きのディスクデバイスを持つシステム上で、
cmapplyconf、cmcheckconf、cmviewcl、cmqueryclのようなServiceguardコマ
ンドを実行すると、SIGSEGVエラーが起きたことを示す以下のようなメッセー
ジがcmclconfdから表示されることがあります。
$ cmcheckconf -v -C cluster.ascii -p pkg.conf
Checking cluster file: cluster.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Gathering configuration information ... Done
Gathering configuration information ... Done
Gathering configuration information ..
Gathering storage information ..
Found 52 devices on node <node1>
Found 62 devices on node <node2>
Analysis of 114 devices should take approximately 13
seconds
0%----10%----20%.....
Gathering Network Configuration ........ Done
Error: Unable to receive device query message from
<node2>: Software caused connection abort
Error: Unable to receive device query message from
<node1>: Software caused connection abort
Internal error: Error waiting for messages: Error 0
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c2t0d0 on node <node1>. Use
pvcreate to give the disk an identifier.
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c1t0d0 on node <node1>. Use
pvcreate to give the disk an identifier.
cmclconfdからのコアのスタックトレース(セグメンテーション違反)は以下の
ようになっています。
#0 0xc016fb18 in strcpy+0x440 () from ./usr/lib/libc.2
#1 0x1c53bc in io_hw_path_to_str+0xf4 ()
#2 0x20e54 in disk_inquire+0x410 ()
#3 0x219d4 in disk_list_inquire+0x250 ()
#4 0x23958 in disk_info_query+0x23c ()
#5 0x27310 in disk_query+0x5e4 ()
#6 0x2c2b8 in get_tcp_messages+0x15dc ()
#7 0x2e448 in main+0x149c ()
8. 不具合:JAGag20034 SR:8606464337
インデックス値の小さいインタフェース(たとえば、lan1)上にスタンバイイン
タフェースを構成し、それよりインデックス値の大きいインタフェース(たと
えば、lan2)上に一次インタフェースを構成すると、ServiceguardがIPv6アド
レスをフェイルオーバーしません。
9. 不具合:JAGaf77716 SR:8606417883
ネットワーク構成に問題があると(たとえば、スタンバイlanカードの構成が不
正な場合)、cmcheckconf/cmapplyconfが異常終了してコアダンプが取られるこ
とがあります。次のような出力が表示されます。
# cmcheckconf -C cluster.ascii
Begin cluster verification...
Abort(coredump)
冗長出力を有効にすると、ネットワーク構成情報の収集中に異常終了したこと
がわかります。
.....
Gathering Network Configuration ......Abort(coredump)
以下のメッセージがsyslogに記録されることがあります。
cmclconfd[15991]: DLPI error 1, unix error 0 sending to
aa080009167f
cmclconfd[15991]: Problem with network interface 0:
Connection timed out
スタックトレースは次のようになっています。
#0 0xc020a110 in kill () from /usr/lib/libc.2
#1 0xc01a500c in raise () from /usr/lib/libc.2
#2 0xc01e53f0 in abort_C () from /usr/lib/libc.2
#3 0xc01e544c in abort () from /usr/lib/libc.2
#4 0x3aae4 in cl_cassfail ()
#5 0x14c948 in cf_private_evaluate_dlpi_connectivity ()
#6 0x14f9ec in cf_private_evaluate_network_probing ()
#7 0x175904 in cf_private_find_config ()
#8 0x175f50 in cf_find_config ()
#9 0x68754 in config_main ()
#10 0x78394 in main ()
10.不具合:JAGag29645 SR:8606475212
ノード上にTEAC製CD-ROMドライブが存在するとcmquerycl/cmcheckconf/
cmapplyconfがsyslogにエラーを記録することがあります。ただし、コマンド
は正常に終了します。
次のようなメッセージがsyslog.logに記録されます。
cmclconfd[3730]: Unable to open disk /dev/rdsk/c0t0d0:
Error 0
11.不具合:JAGaf16346 SR:8606355632
cmgetconfのような構成コマンドが、実際にはIDを持っているディスクのIDが
ないというメッセージを表示した後、エラーで終了します。
Warning: The disk at /dev/dsk/c25t0d0 on node kelvin
does not have an ID, or a disk label.
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c25t0d0 on node kelvin. Use
pvcreate to give the disk an identifier.
以下のエラーがsyslogに記録されます。
Feb 6 20:01:07 kelvin cmclconfd[6345]: Unable to open
disk /dev/dsk/c25t0d0: Resource temporarily unavailable
Feb 6 20:02:20 kelvin cmclconfd[6345]: Physical volume
/dev/dsk/c25t0d0 in volume group /dev/vgXX does not
have an ID!
問題点の説明:
PHSS_35862:
1. 不具合:JAGag21443 SR:8606465899
システムコールに対する割り込みに起因するエラーの場合、select()システム
コールは再試行を行っていませんでした。
解決方法:
システムコールに対する割り込みに起因するエラーの場合は、最大10回再試行
を行うようにselect()システムコールを修正しました。
2. 不具合:JAGag25946 SR:8606470887
cmlvmdが許容する最大接続数が僅か5でした。これでは少なすぎます。
解決方法:
最大許容接続数を4096に増やしました。
3. 不具合:JAGaf46362 SR:8606386208
クラスタの再編成時に通信エラーを検出すると、コマンドは実行中のパッケー
ジのステータスを"down"と表示していました。
解決方法:
通信エラーを検出したら、"down"ではなく"unknown"ステータスを表示するよ
うにコマンドのコードを修正しました。
4. 不具合:JAGag12644 SR:8606456223
サービスを実行する前に、サービススクリプト/プログラムが存在しかつ実行
パーミッションを有しているかチェックしていませんでした。
解決方法:
実行パーミッションを有するスクリプト/プログラムが存在する場合にだけサ
ービスの再起動を試みるようにコードを修正しました。
5. 不具合:JAGag28374 SR:8606473752
この問題の原因は、Serviceguardが同じプロセス内で複数のスレッド優先順位
を使用していたためです。
解決方法:
スレッド間の同期を追加しました。
6. 不具合:JAGag16350 SR:8606460296
VxVM-CVM-pkgに対してcmhaltpkgを実行してはいけないということがcmhaltpkg
のマンページに明記されていませんでした。
解決方法:
VxVM-CVM-pkgに対してcmhaltpkgを実行してはいけないということをcmhaltpkg
のマンページに明記しました。
7. 不具合:JAGaf61508 SR:8606401571
ディスク記述用バッファのサイズが小さすぎたため、長いディスク記述を処理
できませんでした。
解決方法:
最大256文字のディスク記述を処理できるようにバッファサイズを拡張しまし
た。
8. 不具合:JAGag20034 SR:8606464337
インデックス値の小さいインタフェースを最初に処理してブリッジネットに追
加していました。しかし、このシナリオでは、一次インタフェースがまだブリ
ッジネットに追加されていないので、インデックス値の小さいインタフェース
上に構成されたスタンバイインタフェースには、IPv6が構成されているかどう
かがわかりません。
解決方法:
すべてのインタフェースエントリをデータベースに追加してからプラミング
(plumbing)ルーチンを実行するようにコードを修正しました。
9. 不具合:JAGaf77716 SR:8606417883
ネットワーク構成が不正か、断続的にネットワーク問題が発生するシステムの
場合、ネットワークプロービング時に構成コマンドが異常終了することがあり
ました。この問題の原因は、サブネットのないネットワークカードに対してプ
ロービングを行おうとしていたためです。
解決方法:
サブネットがゼロでない場合にだけプロービングを行うようにコードを修正し
ました。
10.不具合:JAGag29645 SR:8606475212
固有の周辺装置記述を持つTEAC製CD/DVDデバイスをプロービングから除外して
いませんでした。その結果、それらがCD/DVDとして検出されませんでした。
解決方法:
特定のTEAC製CD/DVDデバイスを認識して除外するようにcmclconfdを修正しま
した。
11.不具合:JAGaf16346 SR:8606355632
このタイプ(EAGAIN)のデバイスオープンエラーが起きるのは、FC60ディスクア
レイだけです。ただし、このエラーは一過性で、リソースが一時的に使用でき
なくなったというメッセージが返されるかどうかは、ハードウェアやそのファ
ームウェアでのタイミングによります。この場合、直ちにデバイスオープンの
再試行を行えば必ず成功するというのが調査の結論です。
解決方法:
デバイスオープンがEAGAINでリターンしたら、再試行を行うようにコードを修
正しました。
-----------------------------------------------------------------------------
Patch Name: PHSS_35862
Patch Description: s700_800 11.11 Serviceguard A.11.16.00
Creation Date: 07/03/28
Post Date: 07/03/30
Hardware Platforms - OS Releases:
s700: 11.11
s800: 11.11
Products:
Serviceguard A.11.16.00
Filesets:
Cluster-Monitor.CM-CORE,fr=A.11.16.00,fa=HP-UX_B.11.11_32/64,v=HP
ATS-CORE.ATS-RUN,fr=A.11.16.00,fa=HP-UX_B.11.11_32/64,v=HP
Package-Manager.CM-PKG,fr=A.11.16.00,fa=HP-UX_B.11.11_32/64,v=HP
Package-Manager.CM-PKG-MAN,fr=A.11.16.00,fa=HP-UX_B.11.11_32/64,v=HP
Cluster-Monitor.CM-CORE-MAN,fr=A.11.16.00,fa=HP-UX_B.11.11_32/64,v=HP
Automatic Reboot?: No
Status: General Release
Critical:
Yes
PHSS_35862: ABORT HANG PANIC
cmcld aborts when the select() system call is
interrupted by a signal. This results in the node
being reset by the safety timer.
cmsrvassistd will loop when the script or program
specified in a package SERVICE_CMD parameter does not
exist or does not have execute permission, attempting to
restart the service until the defined maximum service
restart count has been reached. If the count is infinite
cmsrvassistd will take large amounts of CPU effectively
taking over a single cpu system.
Serviceguard on uniprocessor systems can lead to
cmcld consuming 100% of cpu resulting in a hang or system
TOC. This does not apply to multi-processor systems.
"cmhaltcl -f" run on a cluster configured with
VxVM-CVM-pkg can cause cmcld to abort resulting in a
node TOC. This happens after running cmhaltpkg on the
VxVM-CVM-pkg which fails.
Serviceguard commands such as cmapplyconf, cmcheckconf,
cmviewcl or cmquerycl may abort with core dump due to
SIGSEGV from cmclconfd when run on a system which have
disk devices with descriptions longer than 40 characters.
cmcheckconf and cmapplyconf can abort and core dump when
there are network configuration problems such as invalid
configurations on standby lan cards.
PHSS_35302: ABORT PANIC
cmcld aborts with Failed to set keep alive.
The following message will be logged in the syslog:
vmunix: Serviceguard Aborting!
vmunix: Cause: setsockopt failed
vmunix: (File: rcomm/comm_ip_setup.c, Line: 1135)
vmunix: Aborting! setsockopt failed
vmunix: (file: rcomm/comm_ip_setup.c, line: 1135)
cmcld: Failed to set keep alive: Invalid argument
cmcld: Aborting! setsockopt failed
System repeatedly TOCs when AUTOSTART_CMCLD is set
to 1 if VxVM-CVM-pkg is unable to start.
Reuse of memory during reprobe of DGC disks can lead
cmclconfd to SIGSEGV resulting in command failures.
PHSS_34759: ABORT CORRUPTION PANIC
In view_system_pkg(), error_str_p and node_name_p
structures are freed twice inside a loop which may
cause cmviewcl to abort due to internal memory
allocation problems.
The Serviceguard NMAPI interface fails if the file
descriptor used to connect to cmgmsd is greater than the
default FD_SETSIZE, i.e. 24576 causing data corruption
of the client process.
Use of cl_log call instead of context specific cl_clog
for logging inside cmlvm daemon brings down cmlvmd which
eventually TOC the node.
Formation of 2 clusters may potentially result in
packages running on 2 nodes at the same time and may
potentially result in data corruption issues.
PHSS_34503: ABORT HANG
Corruption in link level messages can lead to cmcld
SIGSEGV even with checksumed messages, the stack traces
of the resulting core files can vary but often include
one of the functions ns_if_setgood or dlpi_recv.
A socket call failure due to insufficient available
memory causes cmcld to abort.
UDP messages were not marked as invalid even if there
were invalid values for length and offset fields in the
message, causing cmclconfd to exit without receiving
the message and/or cmviewcl to spin indefinitely.
PHSS_33836: PANIC HANG ABORT
When the timer loop thread is stuck (not holding
cm_lock) or the system clock is not advancing, cmcld
threads will not be scheduled. This prevents cmcld
timeout and prevents the safety timer being updated
resulting in all nodes being TOC'd.
A failfast package which is being halted as a result of
a cmhaltnode -vf will not be re-started on an adoptive
node if the halting node TOC's as a result of a hung
package halt script.
A small timing window exists that can cause the
configuration daemon (cmclconfd) to go into an
infinite loop on a node that is being configured
out of the cluster.
Asymmetrical subnet information does not get properly
removed in the cluster database. The subnet objects
remain but they are orphaned with no associated node
objects. When the node is re-started through
SG Manager cmomd aborts with a signal 6 error.
PHSS_33834: ABORT
During a cluster reformation, the cmcld core dumps
with a segmentation violation. The stack trace of the
core shows the function verify_required_replies() is
causing the segmentation violation.
PHSS_32733: ABORT PANIC
Serviceguard configuration daemon cmclconfd can create
core dump if disk hardware path as shown by ioscan is
greater than 30 characters.
cmcld may core dump upon encountering a corrupted DLPI
packet or when it is full polling with a remote node
that has been deleted from the cluster with the
following messages in syslog:
cmcld: Failed to lookup dlpi peer node 2
cmcld: This might be due to a corrupted dlpi packet
If a node is joining and an online change adds a
package and starts it, this may cause an assertion
failure and cmcld core on the joining node.
Under rare circumstances, cmcld may abort with the
following message:
"Bad election state handling node failure"
During local lan failover, the cmcld daemon dies
of a SIGSEGV. The stack trace of cmcld is:
#0 0x60000000c0327300:0 in
T_19_f81_cl___doprnt_main+0x99b0 () from
/usr/lib/hpux32/libc.so.1
#1 0x60000000c0315570:0 in _doprnt+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c033e000:0 in vsnprintf+0xa0 ()
from /usr/lib/hpux32/libc.so.1
#3 0x42eba40:0 in fr_log_nolock () at
utils/fr/fr.c:159
#4 0x42ce190:0 in cl_vlog () at
utils/cl_log.c:590
#5 0x42ce3c0:0 in cl_log () at
utils/cl_log.c:630
#6 0x41b6930:0 in ns_if_down () at
netsen/hpux/os_ns_interface.c:716
#7 0x41c02c0:0 in ns_card_error_return () at
netsen/hpux/os_ns_poll.c:1215
#8 0x42382f0:0 in dlpi_recv () at
rcomm/comm_dlpi.c:1450
#9 0x4239ac0:0 in cl_comm_dlpi_loop () at
rcomm/comm_dlpi.c:1764 #10
0x60000000c00bad20:0 in
__pthread_bound_body+0x170 () from
/usr/lib/hpux32/libpthread.so.1
Invalid data can be specified in the
USER_NAME field for the access control
policies in the cluster ascii file and
a cmapplyconf will complete without
error. When a cmapplyconf is
re-executed to correct this, and if
the cluster is running, cmcld will
abort, resulting in a node TOC.
PHSS_32731: OTHER ABORT
A defect was introduced which broke the ability for
Serviceguard to handle ip addresses in the cmclnodelist
file. This will result in the customer seeing
"Permission Denied" errors in response to commands.
Serviceguard daemons, cmgmsd for example, could not
handle more than 2048 clients: cmgmsd fails with
SIGBUS if the number of open sockets exceeds 2048.
PHSS_31075: CORRUPTION ABORT PANIC HANG
The package startup of a package using VxVM disk
groups can result in data corruption if any disk groups
used by the package are currently imported on another
node. This could occur if the package had previously
failed to halt on another node without this being
manually cleaned up or if the disk group was manually
imported for administration purposes.
Serviceguard daemon cmcld might abort if cluster is
configured with multiple heartbeats and one of the
networks is highly loaded or a local switch is
happening on it.
When no buffer space is available, the LVM daemon
aborts, causing the cluster daemon to also abort,
leading to a TOC.
Multiple cmhaltnode commands issued at the same time
may cause a deadlock in cmcld. The symptoms vary for
this problem. When cmcld is in this deadlock
situation, all the package commands will hang.
Some other times, there could be infinite cluster
reformations happening in the cluster.
Category Tags:
defect_repair enhancement general_release critical panic
halts_system corruption manual_dependencies
Path Name: /hp-ux_patches/s700_800/11.X/PHSS_35862
Symptoms:
PHSS_35862:
1. Defect: JAGag21443 SR: 8606465899
cmcld aborts when the select() system call is interrupted
by a signal. This results in the node being reset by the
safety timer. The following messages will be logged in
the syslog file:
cmcld[2257]: Aborting! select failed (file:
lcomm/local_server.c, line: 1165)
cmcld[2257]: select for port 46356 failed with
Interrupted system call
cmcld[2257]: select for port 46100 failed with
Interrupted system call
cmcld[2257]: 29, 95774e60, 8ef
cmcld[2257]: 17 (read)
cmcld[2257]: 19 (read)
cmcld[2257]: 20 (read)
cmcld[2257]: 21 (read)
cmcld[2257]: 26 (read)
cmcld[2257]: 27 (read)
cmcld[2257]: 28 (read)
cmcld[2257]: 29 (read)
cmcld[2257]: Aborting! select failed (file:
rcomm/comm_ip.c, line: 443)
cmcld[2257]: 33, 95774e60, 8f0
cmcld[2257]: 22 (read)
cmcld[2257]: 23 (read)
cmcld[2257]: 24 (read)
cmcld[2257]: 25 (read)
cmcld[2257]: 30 (read)
cmcld[2257]: 31 (read)
cmcld[2257]: 32 (read)
cmcld[2257]: 33 (read)
cmcld[2257]: Aborting! select failed (file:
rcomm/comm_ip.c, line: 443)
cmclconfd[2256]: The Serviceguard daemon,
/opt/cmcluster/bin/cmcld[2257], died upon receiving
signal number 6.
2. Defect: JAGag25946 SR: 8606470887
When activating multiple volume groups at the same time
in a heavy loaded system, some of the vgchange commands
might fail with the following error message:
vgchange: Failed to establish a connection with clvmd
for volume group /dev/vg17
3. Defect: JAGaf46362 SR: 8606386208
The cmviewcl command might show the status of packages
as 'down' during cluster reformation even though package
is actually running
4. Defect: JAGag12644 SR: 8606456223
cmsrvassistd will loop when the script or program
specified in a package SERVICE_CMD parameter does not
exist or does not have execute permission, attempting to
restart the service until the defined maximum service
restart count has been reached. If the count is infinite
cmsrvassistd will take large amounts of CPU effectively
taking over a single cpu system.
5. Defect: JAGag28374 SR: 8606473752
Serviceguard on uniprocessor systems can lead to
cmcld consuming 100% of cpu resulting in a hang or system
TOC. This does not apply to multi-processor systems.
6. Defect: JAGag16350 SR: 8606460296
"cmhaltcl -f" run on a cluster configured with
VxVM-CVM-pkg can cause cmcld to abort resulting in a
node TOC. This happens after running cmhaltpkg on the
VxVM-CVM-pkg which fails.
The following errors are reported in syslog:
cmcld: Synchronous Notification to client Veritas -
vxclustd with pid did not return within specified
timeout 45.
cmcld: Aborting: sdbapi/srv_sdb.c 2060
(Synchronous Notification to external client did not
return within specified timeout.
...
cmclconfd: The Serviceguard daemon, /usr/lbin/cmcld
died upon receiving signal number 6.
7. Defect: JAGaf61508 SR: 8606401571
Serviceguard commands such as cmapplyconf, cmcheckconf,
cmviewcl or cmquerycl may fail with similar messages due
to SIGSEGV from cmclconfd when run on a system which
have disk devices with descriptions longer than 40
characters.
$ cmcheckconf -v -C cluster.ascii -p pkg.conf
Checking cluster file: cluster.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Gathering configuration information ... Done
Gathering configuration information ... Done
Gathering configuration information ..
Gathering storage information ..
Found 52 devices on node <node1>
Found 62 devices on node <node2>
Analysis of 114 devices should take approximately 13
seconds
0%----10%----20%.....
Gathering Network Configuration ........ Done
Error: Unable to receive device query message from
<node2>: Software caused connection abort
Error: Unable to receive device query message from
<node1>: Software caused connection abort
Internal error: Error waiting for messages: Error 0
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c2t0d0 on node <node1>. Use
pvcreate to give the disk an identifier.
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c1t0d0 on node <node1>. Use
pvcreate to give the disk an identifier.
The stack trace for the core from cmclconfd with
segmentation violation would look like:
#0 0xc016fb18 in strcpy+0x440 () from ./usr/lib/libc.2
#1 0x1c53bc in io_hw_path_to_str+0xf4 ()
#2 0x20e54 in disk_inquire+0x410 ()
#3 0x219d4 in disk_list_inquire+0x250 ()
#4 0x23958 in disk_info_query+0x23c ()
#5 0x27310 in disk_query+0x5e4 ()
#6 0x2c2b8 in get_tcp_messages+0x15dc ()
#7 0x2e448 in main+0x149c ()
8. Defect: JAGag20034 SR: 8606464337
Serviceguard does not failover IPv6 addresses when the
standby is configured on a lower-index interface such
as lan1 and primary is configured on higher-index such
as lan2.
9. Defect: JAGaf77716 SR: 8606417883
cmcheckconf and cmapplyconf can abort and core dump when
there are network configuration problems such as invalid
configurations on standby lan cards. The following output
is seen:
# cmcheckconf -C cluster.ascii
Begin cluster verification...
Abort(coredump)
If verbose output is enabled you can see we abort while
gathering network configuration information:
.....
Gathering Network Configuration ......Abort(coredump)
The following messages can be seen in syslog:
cmclconfd[15991]: DLPI error 1, unix error 0 sending to
aa080009167f
cmclconfd[15991]: Problem with network interface 0:
Connection timed out
The stack trace would look like:
#0 0xc020a110 in kill () from /usr/lib/libc.2
#1 0xc01a500c in raise () from /usr/lib/libc.2
#2 0xc01e53f0 in abort_C () from /usr/lib/libc.2
#3 0xc01e544c in abort () from /usr/lib/libc.2
#4 0x3aae4 in cl_cassfail ()
#5 0x14c948 in cf_private_evaluate_dlpi_connectivity ()
#6 0x14f9ec in cf_private_evaluate_network_probing ()
#7 0x175904 in cf_private_find_config ()
#8 0x175f50 in cf_find_config ()
#9 0x68754 in config_main ()
#10 0x78394 in main ()
10. Defect: JAGag29645 SR: 8606475212
cmquerycl, cmcheckconf and cmapplyconf may log errors
in syslog if CD-ROM drives from TEAC are present in a
node, though the command succeeds.
The following message may be seen in syslog.log:
cmclconfd[3730]: Unable to open disk /dev/rdsk/c0t0d0:
Error 0
11. Defect: JAGaf16346 SR: 8606355632
Configuration commands such as cmgetconf fail after
reporting disks do not have an ID when they do:
Warning: The disk at /dev/dsk/c25t0d0 on node kelvin
does not have an ID, or a disk label.
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c25t0d0 on node kelvin. Use
pvcreate to give the disk an identifier.
The following errors are reported in syslog:
Feb 6 20:01:07 kelvin cmclconfd[6345]: Unable to open
disk /dev/dsk/c25t0d0: Resource temporarily unavailable
Feb 6 20:02:20 kelvin cmclconfd[6345]: Physical volume
/dev/dsk/c25t0d0 in volume group /dev/vgXX does not
have an ID!
PHSS_35302:
1. Defect: JAGag09971 SR: 8606453198
With Serviceguard Extension for RAC, when the
filesystem where /etc/cmcluster resides becomes full
and Oracle is trying to request a group membership
change, the messages like the following will appear in
syslog:
cmgmsd[1997]: Unable to apply the configuration change
due to insufficient disk space.
cmgmsd[1997]: ERROR: commit_cdb_txn: Failed to commit
transaction(28,No space left on device)
This could ultimately manifest itself in various Oracle
failures.
2. Defect: JAGag14977 SR: 8606458777
The Serviceguard daemon cmcld does not terminate upon
receipt of SIGILL.
3. Defect: JAGag03979 SR: 8606446615
Incorrectly formatted IP addresses in the cluster
ascii file are not correctly detected by cmcheckconf
and cmapplyconf resulting in confusing error messages.
The IP addresses are reported as 255.255.255.255 rather
than the text that was entered in the ascii file. For
example the entry:
HEARTBEAT_IP 16.113.153.bad
results in the following error:
Network interface lan0 on node ogre has a different IP
address (16.113.153.12 != 255.255.255.255)
4. Defect: JAGag11741 SR: 8606455170
A system can become unresponsive during a cmquerycl if
there are a large number of logical volumes configured
on the system. For example a system configured with
1400 logical volumes was unresponsive for 10 minutes
while cmquerycl was running. Similar delay may be
experienced with other Serviceguard commands like
cmgetconf, cmapplyconf.
5. Defect: JAGag11719 SR: 8606455144
When cmgmsd cannot be halted correctly within timeout,
cmcld hits an assertion and the node TOC if the safety
timer is still enabled. But, there is no core from
cmgmsd to understand the reason why it could not halt.
6. Defect: JAGag13927 SR: 8606457625
cmquerycl command aborts when the cluster configuration
contains DGC devices having long hardware paths
with the output as given below. Similar abort may
be experienced with other Serviceguard commands like
cmgetconf, cmapplyconf.
........
Gathering storage information
Found 23 devices on node omztcl2
Analysis of 23 devices should take approximately 5
seconds
0%----10%----20%----30%----40%----50%----60%----70%
----80%----90%----100%
Unable to receive device query message from omztcl2:
Software caused connection abort
Could not send message to node omztcl2: Software caused
connection abort
Assertion failed: conn->inuse, file:
config/config_storage.c, line:2399
7. Defect: JAGag06135 SR: 8606448943
The following warning message will be logged in the
flight recorder log, even if the kernel ticks since
boot are advancing.
FAILURE : Kernel ticks_since_boot has not been
advanced for 4.00 seconds, which is greater than or
equal to maximum allowable interval of 10.00 seconds.
8. Defect: JAGag11398 SR: 8606454772
cmcld aborts with Failed to set keep alive.
The following message will be logged in the syslog:
vmunix: Serviceguard Aborting!
vmunix: Cause: setsockopt failed
vmunix: (File: rcomm/comm_ip_setup.c, Line: 1135)
vmunix: Aborting! setsockopt failed
vmunix: (file: rcomm/comm_ip_setup.c, line: 1135)
cmcld: Failed to set keep alive: Invalid argument
cmcld: Aborting! setsockopt failed
9. Defect: JAGag13268 SR: 8606456893
A package which uses VxVM disk groups will fail
to start and will report that a disk group may be
imported on another node even if it is not if cmviewcl
fails. The following will be seen in the package
log file:
check_dg: Error DG may still be imported on HOST
10. Defect: JAGaf91590 SR: 8606432148
When hostname is greater than 12 characters, cmviewcl
will truncate the hostname to only 12 characters.
11. Defect: JAGag05782 SR: 8606448540
System repeatedly TOCs when AUTOSTART_CMCLD is set
to 1 if VxVM-CVM-pkg is unable to start.
12. Defect: JAGag20225 SR: 8606464542
cmquerycl, cmcheckconf and cmapplyconf may
encounter errors if DVD-RW drives are present
in a node.
The following message may be seen in syslog.log:
cmclconfd[9660]: Error looking up device
/dev/dsk/c17t1d0: /dev/config is not open.
PHSS_34759:
1. If a package fails to start or halt successfully for
some reason, temporary files like vgchange_pids$$,
mount_pids$$, umount_pids$$ and fsck_pids$$ may be left
over in the root directory "/".
2. In rare circumstances, the cmviewcl command may abort due
to internal memory allocation problems.
3. In the package control script templates, the explanation
fields for the VGCHANGE examples are inaccurate. The
examples and log messages from the control script do not
take shared volume group activation mode into account.
4. The Serviceguard NMAPI interface fails if the file
descriptor used to connect to cmgmsd is greater than the
default FD_SETSIZE, i.e. 24576 causing data corruption
of the client process. This is applicable only
for SGeRAC and Oracle client processes.
5. A 2-node Serviceguard cluster with a cluster lock may
form two clusters if all heartbeat networks experience
prolonged heavy network congestion and if there are
frequent kernel hangs during a cluster reformation. This
will result in data integrity problem.
6. The cmcheckconf and cmapplyconf commands may fail with a
misleading error message when a Standby LAN in the
cluster configuration has been disconnected or has
failed.
7. For Serviceguard cluster configurations that do not have
the SGeRAC product installed a SCSI bus reset is not
issued at the appropriate time for exclusively activated
volume groups.
8. Error message output by cmcheckconf and cmapplyconf is
not helpful if the bridged network assignment changes.
9. Result of cmcheckconf -k and cmapplyconf -k are
different when volume group listed in cluster config
ascii file is powered off.
10. Portscan brings down cmlvmd which eventually TOC the
node.
11. As a prerequisite to support HP Integrity Virtual
Machine (HPVM) nodes as a member of a Serviceguard
cluster, the quiescence period during cluster
reformation for HPVM guests need to be extended.
PHSS_34503:
1. Corruption in link level messages can lead to cmcld
SIGSEGV even with checksumed messages, the stack traces
of the resulting core files can vary but often include
one of the functions ns_if_setgood or dlpi_recv.
2. In a reforming cluster that has the
NETWORK_FAILURE_DETECTION parameter set to
INONLY_OR_INOUT, full network polling would not be
performed even if the primary lan has missed the
maximum number of inbound polling packets thus causing
a local lan failover to the standby lan not to occur.
3. cmcld aborts with "Not enough space" on socket
allocation. The following message will be logged in the
syslog:
vmunix: Failed to allocate a socket: Not enough space
vmunix: Service Guard Aborting!
vmunix: Cause: socket failed
vmunix: (File: rcomm/comm_ip_setup.c, Line: 538)
vmunix: Aborting! socket failed
4. Serviceguard automatically plumbs standby network
interfaces for IPv6 use, even when IPv6 is not being
used in the cluster configuration.
5. Sometimes a package using an psmmon EMS resource may not
come up, when Serviceguard is re-started.
6. If the hacl-cfg UDP port is scanned by Linux utilities
like nmap and amap, Serviceguard commands potentially
fail for ten minutes. If inetd logging is enabled the
following message is logged to syslog:
"inetd[27802]: hacl-cfg/udp: Server failing
(looping), service terminated."
Sometimes cmviewcl ends up spinning forever with the
following output from cmviewcl:
"Protocol failure talking with cmclconfd on
10.144.196.135 (5)"
PHSS_33836:
1. Under some circumstances cmrunnode may issue an error
message "Detected Partition" although the command
succeeds. The node being started has an asymmetrical
subnet configured. Also sometimes starting the same
type of node through the SG Manager, the cmomd process
aborts with signal 6.
cmomd process terminated with signal 6, Aborted.
#0 0xc0214128 in kill+0x10 () from /usr/lib/libc.2
#1 0xc01ab554 in raise+0x24 () from /usr/lib/libc.2
#2 0xc01f0df0 in abort_C+0x160 () from /usr/lib/libc.2
#3 0xc01f0e4c in abort+0x1c () from /usr/lib/libc.2
#4 0x81d0 in crash_handler (s=11) at om/om_main.c:226
#5
#6 0xc019b178 in tree_concatenate+0x8 ()
from /usr/lib/libc.2
#7 0xc019c4d0 in real_free+0x498 () from /usr/lib/libc.2
#8 0xc019f698 in free+0x340 () from /usr/lib/libc.2
#9 0xc96dccc4 in cf_private_evaluate_ip6_partition
(cl=0x401cabb0, scope=25, ret=0x7bff5388,logh=0x7bff5174,
flags=336) at config/config_net_evaluate.c:1616
#10 0xc96dd224 in cf_private_evaluate_network_probing
(cl=0x401cabb0, scope=25, flags=336, logh=0x7bff5174) at
config/config_net_evaluate.c:1718
#11 0xc9710ef8 in cf_private_find_config (cl=0x401cabb0,
scope=25, flags=336, make_copy=1, logh=0x7bff5174) at
config/config_query.c:889
#12 0xc97113a8 in cf_find_config (cl=0x401cabb0,
scope=25, flags=336, logh=0x7bff5174) at
config/config_query.c:981
#13 0xc9712000 in cf_validate_network (cl=0x401cabb0,
flags=336, logh=0x7bff5174) at
config/config_query.c:1183
#14 0xc94345ac in cmp_validate_network_connections
(cl=0x401b28f0, vflag=0, log=0x40011480 "OMOB") at
providers/cmprovider/cmp_utils.c:2316
#15 0xc9454478 in exec_method_op (context=0x400388d0
"OMOB", providerOp=0x40167560 "OMOB")
at providers/cmprovider/cluster/cmp_cluster_node.c:1665
#16 0xc9454e1c in cmp_op_SGClusterNodeContainment
(context=0x400388d0 "OMOB", providerOp=0x40167560
"OMOB")
at providers/cmprovider/cluster/cmp_cluster_node.c:1772
#17 0xc9428944 in CMProviderOperation (providerOp=
0x40167560 "OMOB")
at providers/cmprovider/cmp_provider.c:1346
#18 0xc93c81c4 in _OMProviderOperation
(provider=0x40021958
"OMOB", providerOp=0x40167560 "OMOB") at
om/cm_provider.c:487
#19 0xc93dc508 in CMProviderOperation
(providerOp=0x40167560
"OMOB") at om/cm_provider_linkage.c:439
#20 0xc9370c58 in _OMExecMethodOp (
class_name=0x40166910 "SGClusterNodeContainment",
method_name=0x40012608 "start",
2. cmcld has not been updated to handle the drivers for
the newer supported cluster lock interface cards such
as the Ultra160 and Ultra320 cards. Therefore cmcld
defaults the cluster lock timings to the default (worst
case) leading to longer failover times than would be
expected, approximately 60 seconds rather than 30
seconds which would be seen for c720 driver for a
simple 2 node cluster with 2 second node timeout.
3. Under very rare circumstances all the nodes in the
cluster may TOC at the same time, when the timer loop
thread is stuck (not holding cm_lock) or the system
clock is not advancing. This prevents cmcld timeout and
prevents the safety timer from being updated resulting
in a TOC.
4. Too many "Unable to stat /etc/cmcluster/cmclconfig,
No such file or directory" messages fill up syslog.
5. After deleting a node from the cluster, the
configuration daemon (cmclconfd) on the deleted
node goes into an infinite loop.
6. A failfast package which is being halted as a result of
a cmhaltnode -vf will not be re-started on an adoptive
node if the halting node TOC's as a result of a hung
failfast package halt script.
7. When using the NFS toolkit scripts in a Serviceguard
package control script, if the package is halting and
the NFS scripts are unable to cleanly shutdown NFS, the
Serviceguard package script logs the following message
in the log:
Node "nodename": Package start failed at Wed Dec 7
09:22:19 EST 2005
This is a misleading message, since in reality, the
package stop failed, not the package start.
8. The documentation in the package control script states
that the HA NFS script should be named "ha_nfs.sh"
instead of the correct name "hanfs.sh".
PHSS_33834:
1. This is an enhancement to allow to change the
activation mode of a Volume Group from shared mode to
exclusive mode or vice versa using vgchange while the
Volume Group is already been activated on a single
node. This enhancement has a dependency on the
following patches:
LVM commands patch PHCO_33310 and Kernel patch
PHKL_33390.If these patches are not installed, this
enhancement is disabled.
2. Under very rare circumstances, when the cluster reforms
and if there is a communication problem that occurs
during this very small window of the cluster
reformation, the cmcld core dumps with a segmentation
violation. The stack trace of the core shows the
function verify_required_replies() is
causing the segmentation violation as follows:
#0 0x40c2310:0 in verify_required_replies+0x200 ()
#1 0x40c3310:0 in
cdb_send_coord_begin_req_event_handler+0x170 ()
#2 0x42b77b0:0 in cl_event_loop+0xd40 ()
#3 0x60000000c00b3d20:0 in __pthread_bound_body+0x170
() from /usr/lib/hpux32/libpthread.so.1
3. When a node is removed from a cluster by issuing
command 'cmapplyconf -C cluster.ascii' on the node that
is to be removed the following error messages will show
up:
Permission denied to <node name>
Exceeded waiting for next incarnation, proceeding
anyway..
Cannot connect to configuration daemon (cmclconfd) on
node
Unable to log transaction outcome to all nodes
4. While it is not supported to do an online change for
the cluster lock device, cmapplyconf neither logs an
error message nor does it update the cluster lock if
the cluster lock disk is changed online to a disk
belonging to the same volume group. The command should
report an error since changing the cluster lock online
is not allowed.
5. When running cmquerycl without the "-k" option, in a
cluster with a large number of disks, the system
performance of the cluster nodes becomes severely
degraded and the cmquerycl command can take over an
hour to complete. While the cmquerycl command is
running, other applications on the cluster nodes may
also experience degraded performance because cmclconfd
uses the block device files to examine each device.
This means the kernel must spend unnecessary CPU cycles
examining the buffer cache when cmclconfd closes the
device and also cmclconfd may block other processes
which want to perform file system operations because
the block device close holds a high level file system
semaphore.
6. In case of a local lan failover, cmsnmpd fails to send
out a hpmcSGLocalSwitch trap. This can happen when
there are no package IP addresses and only one IPv4 and
one IPv6 IP address on the subnets being switched. This
problem is more likely to occur on subnets using link
aggregation, such as APA or InfiniBand.
7. When using Serviceguard Manager to configure a cluster,
if one of the IPv6 IP addresses configured is longer
than 20 characters, the cluster configuration operation
will fail.
8. Incorrect usage of a Serviceguard command that is not
supported for customer use can result in high CPU usage
by the cmclconfd daemon. The cmclconfd daemon will not
exit unless the customer kills the daemon.
PHSS_32733:
1. When identd functionality is disabled by using the -i
option, the following message is logged to syslog
each time cmclconfd is executed:
"cmclconfd running with weak security (identd
disabled)"
This can result in many duplicate messages if many
cmclconfd daemons are started. For example each
Serviceguard command will result in at least one
of these messages.
2. Serviceguard SNMP Subagent may send hpmcSGNodeFailed
trap instead of hpmcSGNodeHalted trap when node halted
successfully and did not actually fail.
3. Serviceguard configuration daemon cmclconfd can create
core dump if disk hardware path as shown by ioscan is
greater than 30 characters.
4. If a VOLUME_GROUP in the cluster configuration file
is removed or commented out and cmapplyconf is
performed on the file, the cluster ID will be
removed silently from that volume group, even if it
is currently active via a running package. The
next attempt to start the package will fail as a
result. The user should be made aware of this change.
5. When a LAN interface configured in an IP subnet of a
cluster without any other remote interfaces in the
same subnet and the LAN cable to this interface has
been disconnected or severely damaged, the cmcld daemon
on this node aborts during node start up with the
following messages in syslog:
cmcld: Unable to send DLPI info request, Bad file number
cmcld: Aborting! Failed to communicate with DLPI
6. The cmscancl command uses swlist to determine the
version of Serviceguard and Cluster Object Manager
loaded on the server. This command cannot account
for the Serviceguard product contained in the OE
bundles, hence, it cannot properly report the version
of Serviceguard loaded.
Example lines found in scancl.out:
HPUX11i-OE-MC B.11.11.0206 HP-UX Mission
Critical Operating Environment Component (ssttzd02)
HPUX11i-OE-MC B.11.11.0206 HP-UX Mission
Critical Operating Environment Component (ssttzd03)
HPUX11i-OE-MC B.11.11.0206 HP-UX Mission
Critical Operating Environment Component (ssttzd01)
7. If one node crashes while another node is joining the
cluster, a third node in the cluster may also crash with
the following message:
"Assertion failed:
(cm_cluster->e_state == CM_COMM_VERIFY_COORDINATOR ||
cm_cluster->e_state == CM_COMM_VERIFY_MEMBER), file:
cm/membership.c, line: 118"
8. When cmgmsd log file is switched by gmsetlog -f command,
the previous log file is not closed.
9. cmcld may core dump upon encountering a corrupted
DLPI packet. This can also happen when the cluster
configuration parameter, NETWORK_FAILURE_DETECTION
is set to INONLY_OR_INOUT and a node has been deleted
from the cluster configuration while the cluster is
online (and the cluster has not been halted and
restarted in the online configuration change).
Serviceguard may try to poll with the network
interface on the deleted node, causing this problem.
In either case the following messages will be
seen in syslog.log:
cmcld: Failed to lookup dlpi peer node 2
cmcld: This might be due to a corrupted dlpi packet
10. If a node is attempting to join the cluster and
an online configuration change adds a new package
and starts it, this may cause an assertion failure
because the package pointer was null, and cmcld
will core dump on the joining node with a stack
trace such as the following:
#0 0x60000000c0343410:0 in kill+0x30 () from
/usr/lib/hpux32/libc.so.1
#1 0x60000000c023a430:0 in raise+0x30 () from
/usr/lib/hpux32/libc.so.1
#2 0x60000000c02fc370:0 in abort+0x190 ()
from /usr/lib/hpux32/libc.so.1
#3 0x435a1d0:0 in cl_assfail (module=2,
assertion=0x40017a70 "p_ptr ! = NULL",
file=0x40017a80 "pkg/pkg_owner_handler.c",
line=1701) at utils/cl_log.c:1220
#4 0x42683b0:0 in
coordinator_to_node_events_handler
(event=0x400ff710) at
pkg/pkg_owner_handler.c:1701
#5 0x42258d0:0 in pm_remote_msg_event
(event=0x400ff710) at pkg/pkg_comm.c:137
#6 0x42220d0:0 in pm_event_handler
(event=0x400ff710) at pkg/pkg.c:624
#7 0x434cdd0:0 in cl_event_loop (arg=0x0)
at utils/cl_event.c:460
#8 0x60000000c00b3d20:0 in
__pthread_bound_body+0x170 () from
/usr/lib/hpux32/libpthread.so.1
11. Under rare circumstances, cmcld may abort with the
following message:
"Bad election state handling node failure"
12. This message appears in syslog:
"inetd[891]: hacl-cfg/udp: Server failing
(looping),service terminated."
13. The cmsnmpd subagent trap, hpmcSGLocalSwitch may
be lost if the trap destination IP address
happens to be on the network interface that is
in the process of doing a local LAN failover.
The trap may be sent before the local LAN
failover has completed, so it may be lost.
14. During local lan failover, the cmcld daemon dies
of a SIGSEGV. The stack trace of cmcld is:
#0 0x60000000c0327300:0 in
T_19_f81_cl___doprnt_main+0x99b0 () from
/usr/lib/hpux32/libc.so.1
#1 0x60000000c0315570:0 in _doprnt+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c033e000:0 in vsnprintf+0xa0 ()
from /usr/lib/hpux32/libc.so.1
#3 0x42eba40:0 in fr_log_nolock () at
utils/fr/fr.c:159
#4 0x42ce190:0 in cl_vlog () at
utils/cl_log.c:590
#5 0x42ce3c0:0 in cl_log () at
utils/cl_log.c:630
#6 0x41b6930:0 in ns_if_down () at
netsen/hpux/os_ns_interface.c:716
#7 0x41c02c0:0 in ns_card_error_return () at
netsen/hpux/os_ns_poll.c:1215
#8 0x42382f0:0 in dlpi_recv () at
rcomm/comm_dlpi.c:1450
#9 0x4239ac0:0 in cl_comm_dlpi_loop () at
rcomm/comm_dlpi.c:1764 #10
0x60000000c00bad20:0 in
__pthread_bound_body+0x170 () from
/usr/lib/hpux32/libpthread.so.1
15. When the network polling interval
(NETWORK_POLLING_TIMEOUT in the cluster
configuration file) is configured to 30 seconds,
the LAN interface will be marked
down even for single miss in the link level messages.
Because of this, the LAN will be down for one poll
interval. Messages such as the following may appear
in syslog:
Jun 29 16:00:37 zeon cmcld: lan6 failed
Jun 29 16:00:37 zeon cmcld: Subnet 10.0.0.0 switched
from lan6 to lan1
Jun 29 16:00:37 zeon cmcld: lan6 switched to lan1
Jun 29 16:00:37 zeon cmcld: Switched 10.0.0.2 from
lan6 to lan1
Jun 29 16:00:37 zeon cmcld: Finished moving off lan6
Jun 29 16:01:07 zeon cmcld: Interface lan1 missed 1
both send & receive packet(s), being marked
doubtful. [1]
Jun 29 16:01:07 zeon cmcld: Interface lan1 has max
misses of send and receive packets.1.
Jun 29 16:01:07 zeon cmcld: lan1 failed
Jun 29 16:01:07 zeon cmcld: Subnet 10.0.0.0 down
Jun 29 16:01:37 zeon cmcld: lan1 recovered
Jun 29 16:01:37 zeon cmcld: Subnet 10.0.0.0 up
16. In a CVM configuration, the SG package scripts are
run prior to the completion of the CVM startup.
Specifically, the VxVM-CVM-pkg script completes.
This causes the SG packages to start; however,
the vxrecover process initiated by the
VxVM-CVM-pkg has not completed its recovery,
therefore CVM disk groups used in packages may
not be ready to use. Messages will appear in the
package log. For example, if the DG is being used
for SGeRAC, the messages would look like this:
- Node "laurent": Activating disk
group dbdg with non-exclusive option.
- Node "laurent": Activating disk
group ops_dg with non-exclusive option.
- Starting GSD
Successfully started GSD on local node
- Starting Oracle instance RAC1
PRKP-1001 : Error starting instance RAC1 on
node laurent
ORA-27302: failure occurred at: skgpwreset1
ORA-27303: additional information: invalid
shared ctx
ORA-27146: post/wait initialization failed
ORA-27300: OS system dependent operation:semget
failed with status: 28
ORA-27301: OS failure message: No space left
on device
ORA-27302: failure occurred at: sskgpbitsper
SQL>
These error messages will vary based on the types
of packages which are accessing disk groups.
For more details about which packages are affected
and how to fix this problem please see the Special
Installation Instructions section.
17. A package run/halt/mod command could be processed
when cluster reformation is still in progress. This
could cause cmcld to hit an assertion, and the node
would TOC.
18. cmviewcl -p package would display the package status
as 'unknown', when in fact the cmviewcl -v would
show that the package is 'running'.
19. Invalid data can be specified in the
USER_NAME field for the access control
policies in the cluster ascii file and
a cmapplyconf will complete without
error. When a cmapplyconf is
re-executed to correct this, and if
the cluster is running, cmcld will
abort, resulting in a node TOC.
The following message will be logged
in syslog when an invalid username
is applied:
Jul 12 11:34:08 sly cmcld: ERROR:
Invalid user name in RBA
Privilege lookup
The following messages will be logged
in syslog when the invalid username
is corrected:
Jul 12 11:35:06 sly cmcld:
cdb_db_handle_lookup - More than
one found
Jul 12 11:35:06 sly cmcld: CDB Prepare -
Unable to delete /acps/sly/*, object
does not exist
Jul 12 11:35:06 sly cmcld: CDB Prepare -
Unable to perform configuration
operation 2. Return value is 22.
Jul 12 11:35:06 sly cmcld: Aborting:
cdb/cdb_db_server.c 1937 (Failed to
roll back config change
Jul 12 11:35:06 sly cmcld:
cdb_db_handle_lookup - More than one
found
Jul 12 11:35:10 sly cmclconfd[6699]: The
Serviceguard daemon, /usr/lbin/cmcld[6700],
died upon receiving signal number 6.
PHSS_32731:
1. A defect was introduced which broke the ability for
Serviceguard to handle ip addresses in the cmclnodelist
file. This will result in the customer seeing
"Permission Denied" errors in response to commands.
This is only an issue before the cluster is configured.
Once the cluster is configured, the cmclnodelist file
is ignored and access is controlled via the access
control policies (ACPs) specified in the cluster
configuration file. These ACPs only allow the hostname
to be specified and not IP addresses. These hostnames
must also be associated with all permanent network
interface cards by adding the ip addresses to the
/etc/hosts file.
A second defect in the code that does the ip address
resolution via /etc/hosts made it fail to find the
correct hostname if the addresses in the /etc/hosts
file were not in the right order. This too would
result in various command failures with the
message "Permission Denied".
Additionally, the code was supposed to allow the use of
aliases in the /etc/hosts file, but if the ACPs were
set up using CLUSTER_NODE or ANY_SERVICEGUARD_NODE, a
root user in that cluster might be denied the ability
to execute cmapplyconf if the command came in on one of
the aliased interfaces.
2. SG version A.11.16.00 limits the minimum value for
Quorum Server (QS) polling interval to 60 seconds.
This change lowers the minimum value allowed to 10
seconds.
3. For Online node reconfiguration in SGeRAC cluster to
happen the cluster cannot have any LVM VG activated
in shared mode. cmapplyconf and cmlvmd will prevent
the online node reconfiguration from going through if
there are VGs activated in shared mode, and the
following error messages will appear on the terminal:
Error: VG: <vg name> is currently
activated in shared mode by node <node name>
Error: One or more volume groups are activated in
shared mode. We cannot do on line node configuration.
Please deactivate all SLVM VGs activated in shared
mode before doing online node configuration.
It is desired to have this restriction removed in order
to make it possible to have LVM VG activated in shared
mode and not receive these errors during Online node
reconfiguration in SGeRAC clusters.
4. Remote cluster configuration through cmapplyconf
reports the following errors:
Begin cluster verification...
Adding node abc to cluster xyz.
Error: Permission denied to abc.
<repeated 10 times>
Unable to verify cluster configuration change
completion, proceeding anyway..
Completed the cluster creation.
Remote cluster configuration is not allowed in
11.16.00, but the way it is handled with the errors
noted above is not user-friendly and does not make
this clear.
5. A new internal file called cmknowncmds now exists in
in /etc/cmcluster. T he file was originally shipped as
an empty file giving the customer no idea the
importance of the file. If this file is accidentally
deleted the cluster can fail to start and messages
are logged in syslog that say: "Error opening
/etc/cmcluster/cmknowncmds"
6. Serviceguard daemons, cmgmsd for example, could not
handle more than 2048 clients: cmgmsd fails with
SIGBUS if the number of open sockets exceeds 2048.
7. Serviceguard configuration commands such as cmquerycl
and cmapplyconf do not recognize and support IP
over InfiniBand network interfaces when the interfaces
are configured in the systems. The output of the
cmquerycl command and the cluster ascii file generated
by this command do not show any information about
these interfaces. NOTE: This item is mentioned here
in this patch for HP-UX 11.11 only for completeness,
but this does not indicate introduction of support of
InfiniBand on HP-UX 11.11. The corresponding
SG 11.16.00 Patch for HP-UX 11.23 (PHSS_32740) does
introduce Infiniband Support on HP-UX 11.23.
PHSS_31075:
1. SG version A.11.16.00 will not recognize disk arrays
such as Model 10, Model 20 and Model 30 disk arrays.
Configuration commands like cmquerycl/cmcheckconf will
fail with the following error for these disks:
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/<disk> on node
<node_name>. Use pvcreate to give the disk an
identifier.
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/<disk> on node
<node_name>. Use pvcreate to give the disk an
identifier.
2. If the OpenView Operations (OVO) library exists at
/opt/OV/lib/libopccv.sl then every time cmsnmpd
attempts to send an OPC message, an error is written to
the cmsnmpd log at /var/adm/SGsnmpsuba.log and the call
fails so OVO does not get the message. If the OVO
library does not exist at this location, then no errors
are logged, since cmsnmpd will not attempt to send OPC
messages.
The error logged is:
***Could not load the shared
library: '/opt/OV/lib/libopccv.sl', Exec format error
This error is repeated many times, once for every
attempt to send a message. The full error reported by
dld when it tries to do the load is:
***/usr/lib/dld.sl: Can't shl_load() a library
containing Thread Local
Storage: /usr/lib/libpthread.1
/usr/lib/dld.sl: Exec format error
3. Duplicate roles are not allowed in Access Control
Policies. When wild-card access control policies
are defined using ANY_USER from ANY_SERVICEGUARD_NODE,
you cannot define another role. Configuration commands
like cmcheckconf/cmapplyconf will fail with the
following error:
Error: Duplicate access control policy for user john
at line 25. Either remove policy for ANY_USER from
ANY_SERVICEGUARD_NODE or remove policies for john.
4. A pkg will fail to start or halt due to the add/delete
IP ioctls encountering the transient ENOMEM (Not enough
space) error without SG retrying the add/delete IP
ioctls. The pkg control logs will show something
like:
"Failed to add IP 192.1.1.1 to subnet xyz: Not
enough space"
5. Serviceguard daemon cmcld might abort if cluster is
configured with multiple heartbeats and one of the
networks is highly loaded or a local switch is
happening on it. The following messages can be seen in
Syslog:
node2 cmcld: Pausing HB connection to xx.xx.xx.xx
node2 cmcld: Timed out node node1.
node2 cmcld: Attempting to form a new cluster
node2 cmcld: Assertion failed: icp->in_state ==
CL_CONN_INBOUND_READY, file: rcomm/comm_ip_state.c,
line: 183
6. Multiple cmhaltnode commands issued at the same time
may cause a deadlock in cmcld. The symptoms vary for
this problem. When cmcld is in this deadlock
situation, all the package commands will hang.
Some other times, there could be infinite cluster
reformations happening in the cluster.
7. Serviceguard commands such as cmquerycl, cmcheckconf and
cmapplyconf fail with errors indicating non-uniform
connections detected between two IPv6 addresses
similar to the following:
Error: Non-uniform connection detected,
nodeA lanX fec0::1 successfully received from nodeB
lanY fec0::2
but nodeB lanY fec0::2 did not receive from nodeA
lanX fec0::1
This could be due to heavy network traffic, or heavy
load on nodeA
8. The cmsnmpd subagent will send a hpmcSGLocalSwitch
trap with incorrect lan names when there are multiple
lans with different length names. For example, if
lan4 fails over to lan11, the trap message sent may
show: "lan4 failed over to lan1".
9. cmapplyconf command fails with the message: "The
configuration change is too large to process while
the cluster is running. Split the configuration
change into multiple requests or halt the cluster."
For configurations with a large number of nodes and/or
packages, this limit is too small and makes doing
certain configuration operations cumbersome.
10. In a stressed environment, the cmsnmpd may become
unresponsive and log "***Error: get_all_status()
failed" in /var/adm/SGsnmpsuba.log. When this happens,
Serviceguard SNMP traps and mib table will no longer be
maintained and the subagent may yield incorrect data
about the cluster.
11. Volume group deactivation fails at package halt time,
causing package halt to also fail. Package control log
file shows these errors,
vgchange: Couldn't deactivate volume
group "[volume group name]": Device busy
ERROR: Function deactivate_volume_group
ERROR: Failed to deactivate [volume group name]
12. Package control script log files are created with
incorrect permissions.
13. For a particular package, if that package uses VxVM
disk groups then the VxVM disk groups might get
activated on multiple nodes, leading to possible
data corruption. If CVM disk groups and/or LVM volume
groups are used without VxVM disk groups then those
packages are not affected.
This problem can happen if that package fails to
halt on one node, leaving the VxVM disk group(s)
active and then a user attempts to start the package
on another node without first having cleaned up the
VxVM disk groups that were left active on the node
where the package halt failed.
This problem could also occur if the user had manually
imported a VxVM disk group for maintenance reasons on
one node, and then accidentally started the package
using that VxVM disk group without first deporting the
VxVM disk group.
For more details about which packages are affected
and how to fix this problem see the Special
Installation Instructions section.
14. The what (1m) command will not display any information
about Serviceguard package control scripts.
15. The cmgetconf command fails to report the converted
access control policies after rolling upgrade.
16. If running cmviewcl or cmgetconf from a hpux cluster
against a cluster (with -C) running on linux with lock
lun configured, that information will be missing.
17. A message may be seen in the syslog during a cluster
or a node halt that may mislead users into thinking
that it is an error message.
The message is something like below:
Lost connection with a process (pid = <pid>)
participating in configuration changes for transaction
ID (x-xxx-x-xx-x-x-x-x-x-x-x-x) (<error = errid>,
<error string>)
18. For large timeout values in cluster configuration,
Serviceguard displays overflowed garbage values in the
operation log of Serviceguard Manager.
19. A TOC occurs and the following error shows up in syslog:
Sep 2 14:32:28 u962004w cmlvmd: Failed to accept
connections from commands: No buffer space available
20. If there are no volume groups on a node, the
configuration commands (cmquerycl, cmcheckconf,
cmapplyconf, cmgetconf) may fail, and cmclconfd
may abort. Sample error:
# cmgetconf
Error: Unable to receive device query message from
nscooby: Software caused connection abort
sample cmclconfd stack trace:
#0 0xc019da98 in free+0x140 () from /usr/lib/libc.2
#1 0x267b8 in lvm_query+0x117c ()
#2 0x2bfa0 in get_tcp_messages+0x12c4 ()
#3 0x2e448 in main+0x149c ()
PHSS_31071:
1. This patch only contains an enhancement that allows
Serviceguard to utilize the services of the identd
daemon. Therefore there is no external symptom.
Defect Description:
PHSS_35862:
1. Defect: JAGag21443 SR: 8606465899
The select() system call was not retried when it failed
because of an interrupted system call.
Resolution:
select() system call is retried for a maximum of ten
times when it fails because of an interrupted system
call.
2. Defect: JAGag25946 SR: 8606470887
The maximum number of connections that can be accepted
by cmlvmd is 5. It is too small.
Resolution:
The number of connections is now increased to 4096.
3. Defect: JAGaf46362 SR: 8606386208
During cluster reformation command might experience
the communication failure which can result in
command showing 'down' status for running packages.
Resolution:
If command encounters this communication problem then
instead of 'down', 'unknown' status will be displayed.
4. Defect: JAGag12644 SR: 8606456223
There was no check to see whether the service script
or program exists or has execute permission before
attempting to run the service.
Resolution:
Only allow service restart to be attempted if service
script or program exists with execute permission.
5. Defect: JAGag28374 SR: 8606473752
The issue resulted from Serviceguard's use of multiple
thread priorities within the same process.
Resolution:
Added additional synchronization between the threads.
6. Defect: JAGag16350 SR: 8606460296
The cmhaltpkg man page does not clearly state that
cmhaltpkg must not be run on VxVM-CVM-pkg.
Resolution:
The cmhaltpkg man page has been modified to clearly
state that cmhaltpkg should not be run on VxVM-CVM-pkg.
7. Defect: JAGaf61508 SR: 8606401571
The buffer size which is used for the disk description
was not enough to handle the long disk descriptions.
Resolution:
The buffer size is extended to 256 to handle long disk
descriptions.
8. Defect: JAGag20034 SR: 8606464337
Lower index interface is processed first and added to
the bridged net. In this scenario, the standby
configured on lower index interface does not find the
primary interface in the bridge net to figure out if
IPv6 is configured.
Resolution:
The plumbing routine is done only after all interface
entries are added to the database.
9. Defect: JAGaf77716 SR: 8606417883
Configuration commands can abort during network probing
on a system which has an invalid network configuration
or has intermittent network problems. The abort was as a
result of attempting to probe network cards which did
not have a subnet.
Resolution:
Changes were made to only probe when the subnet is not
zero.
10. Defect: JAGag29645 SR: 8606475212
TEAC CD/DVD devices were not being excluded from
probing due to their unique peripheral descriptions
resulting in them not being detected as CD/DVDs.
Resolution:
cmclconfd has been modified to recognize and exclude
specific TEAC CD/DVD devices.
11. Defect: JAGaf16346 SR: 8606355632
This device open failure with EAGAIN occurs only from
FC60 disk arrays. The nature of the occurrence is very
transient and the return message of the resource
temporarily not available can be due to the timing on
the hardware and its firmware. Investigation was done
to conclude that an immediate retry of the device open
will always be successful.
Resolution:
Add to retry the device open only when it returns with
EAGAIN.
PHSS_35302:
1. Defect: JAGag09971 SR: 8606453198
cmgmsd required that group membership transactions be
committed to the /etc/cmcluster/cmclconfig cluster
binary file on all nodes before the transaction would
complete. If this filesystem is full the transaction
would fail resulting in Oracle errors. However, group
membership information is transient and does not have
to be written to disk.
Resolution:
cmgmsd transactions no longer fail if there is not
enough disk space to write them to the binary
configuration file. An error is written to syslog but
the transaction completes preventing Oracle errors. The
transaction will be written to disk on nodes which have
enough space.
2. Defect: JAGag14977 SR: 8606458777
cmcld was coded to ignore SIGILL.
Resolution:
code to ignore SIGILL removed.
3. Defect: JAGag03979 SR: 8606446615
cmcheckconf fails when an incorrectly formatted IP
address is specified in the cluster configuration
ascii file giving an error message that is misleading.
Resolution:
The cmcheckconf code has been modified to specify an
error message that would be appropriate.
4. Defect: JAGag11741 SR: 8606455170
cmquerycl opens the block logical volume device instead
of the raw logical volume while querying logical
volumes. This results in overhead of closing the block
logical volume device in terms of holding the
filesystem alpha semaphore.
Resolution:
The code is modified to open the raw logical volume
device rather than the block logical volume device.
5. Defect: JAGag11719 SR: 8606455144
There was no core file from cmgmsd as it was not
aborted when it fails to halt correctly within timeout.
Resolution:
cmcld will now send an abort signal to cmgmsd if the
daemon fails to halt within timeout thus causing cmgmsd
to dump a core file. In the future this change helps
troubleshooting the underlying problem.
6. Defect: JAGag13927 SR: 8606457625
Disks which identify as DGC are probed twice when
configuration commands such as cmapplyconf are run.
During the second probe the hardware path of the device
is written to the wrong area of memory resulting in
possible failure of cmclconfd and subsequent command
failure as a result.
Resolution:
Code modified to copy the hardware path to the correct
location in memory.
7. Defect: JAGag06135 SR: 8606448943
When 3 heartbeats are exchanged between coordinator and
member node in the same tick, "ticks_since_boot not
advancing for last 4 seconds" is logged in the flight
recorder log. This message is misleading.
Resolution:
Code has been changed such that there will not be any
warning message if 3 heartbeats are received in same
tick but start giving warning message if 4 heartbeats
are received in same tick.
8. Defect: JAGag11398 SR: 8606454772
Failure to set keep alive with EINVAL causes cmcld to
abort.
Resolution:
Close the TCP connection when we failed to set keep
alive with EINVAL. This would trigger the rcomm
health monitor to reestablish the connection.
9. Defect: JAGag13268 SR: 8606456893
The package control script was checking the exit status
of "cmviewcl | sed" which is set to the exit status of
the sed command when the exit status of cmviewcl was
required. This meant that cmviewcl was never retried as
it was supposed to if the cmviewcl command failed.
Resolution:
The code was modified to check the status of cmviewcl
so it could be retried if it fails.
10. Defect: JAGaf91590 SR: 8606432148
When hostname is greater than 12 characters, cmviewcl
will truncate the hostname to only 12 characters.
Resolution:
Enhanced the code to display the hostname up to
31 characters by modifying cmviewcl output format
dynamically based on host name length.
11. Defect: JAGag05782 SR: 8606448540
When system multi-node package VxVM-CVM-pkg fails
at startup, it cause the node to TOC. If
AUTOSTART_CMCLD is set to 1, the system can repeatedly
TOC and reboot if there are any problems with the
system multi-node package after cmcld is up and
running.
Resolution:
Cluster activities are not automatically started if the
node TOCs repeatedly twice due to failure in starting
VxVM-CVM-pkg. A message is logged in syslog.log and
/etc/rc.log.
12. Defect: JAGag20225 SR: 8606464542
cmclconfd was coded to exclude CD-ROM and DVD-ROM
devices by explicitly excluding devices which matched
these descriptions. This did not exclude other types
of devices such as re-writable CD and DVD devices
causing them to be probed as a regular disk resulting
in errors.
Resolution:
cmclconfd has been modified to exclude more CD and DVD
devices.
PHSS_34759:
1. If a package fails to start or halt successfully for
some reason, temporary files like vgchange_pids$$,
mount_pids$$, umount_pids$$ and fsck_pids$$ which were
created by the package control script will not be
deleted and will be left over in the root directory "/".
Resolution:
Use local shell variables to keep the pid information
instead of creating the temporary file.
2. In view_system_pkg(), error_str_p and node_name_p
structures are freed twice in a loop.
Resolution:
Fixed the code.
3. Current control script does not take into account shared
VG activation. And the original comments for existing VG
activation examples are inaccurate.
Resolution:
Corrected the inaccurate comment field for existing VG
activation examples; added two new examples for shared
VG activations; changed the control script log message
to reflect the current activation mode including shared
mode.
4. If an oracle client has more than FD_SETSIZE i.e. 24576
files open, before it registers with the group
membership services via the NMAPI2. The file descriptor
which the process gets when connecting to cmgmsd via a
socket, could be greater than FD_SETSIZE causing data
corruption.
Resolution:
The code has been modified to return an error "EMFILE"
if a file descriptor higher then FD_SETSIZE is used thus
preventing data corruption. Also the following message
will be logged in syslog.log "Required file descriptor
(file descriptor obtained) exceeds the maximum limit
(24576)".
5. During multiple cluster reformations, the cmcld daemon
on the cluster coordinator node may get stuck in the
clear cluster lock state because of short kernel hangs
and network congestion on the node. If this happens,
the member node can get the cluster lock and form a
single node cluster while the coordinator is stuck in
the clear cluster lock state. Later when cmcld on the
coordinator node continues, it clears the cluster lock
allowing it to form another single node cluster
resulting in two single node clusters.
Resolution:
A fix is added to ensure that the cluster lock is
only cleared after successful completion of cluster
formation.
6. cmapplyconf and cmcheckconf fails if a standby LAN is
down/disconnected without indicating the exact reason
for failure.
Resolution:
cmapplyconf and cmcheckconf will log specific error if
the standby LAN card is disconnected.
7. During a cluster reformation the Serviceguard daemon
should issue a bus reset for all cluster aware volume
groups. This was not being done for exclusively
activated volume groups on SCSI buses.
Resolution:
Issue bus reset for all cluster aware volume groups on
SCSI buses.
8. cmcheckconf and cmapplyconf fail when -C cmcluster.ascii
is specified and the bridged net assignment has changed.
This could be due to a link failure on one of the LAN
cards in a bridged net, since this LAN card cannot talk
to any other LAN on the local node. There were no
internal/external logging messages for this error.
Resolution:
cmapplyconf and cmcheckconf will log specific error if
the standby LAN card is disconnected.
9. The result of cmcheckconf -k and cmapplyconf -k are
different when volume group listed in cluster config
ascii file is powered off. cmcheckconf does not probe
the disk listed under the volume group.
Resolution:
cmcheckconf -k has been modified to probe disks
belonging to volume groups listed in cluster config
ascii file. cmcheckconf -k will now report error if it
is not able to access disks belonging to volume groups
listed in the cluster config ascii file.
10. Use of cl_log call instead of context specific cl_clog
for logging inside cmlvm daemon.
Resolution:
The calls of cl_log are changed to context specific
cl_clog inside cmlvm daemon.
11. The IO requests from a failed VM member node in a
Serviceguard cluster takes longer to drain than regular
host members due to the fact that the VM host lives on
even when the VM guest dies.
Resolution:
Quiescence period during cluster reformation is
extended to accommodate the VM member nodes in the
cluster.
PHSS_34503:
1. If the revision field of a corrupted Serviceguard link
level message is corrupt, the message can pass through,
eventually causing cmcld to abort. If the revision is
corrupted to a value lower than 3 we do not do any
checksum checking of the message. This results in many
cmcld SIGSEGVs when the link level polling messages are
corrupted.
Resolution:
cmcld ignores corrupted link level messages.
2. In a reforming cluster that has the
NETWORK_FAILURE_DETECTION parameter set to
INONLY_OR_INOUT, full network polling would not be
performed even if the primary lan has missed the
maximum number of inbound polling packets thus causing
a local lan failover to the standby lan not to occur.
Resolution:
A full polling is performed even if the state of the
cluster is REFORMING.
3. A socket call failure due to insufficient available
memory causes cmcld to abort.
Resolution:
cmcld now retries the socket call if it fails due to
insufficient available memory.
4. Serviceguard automatically plumbs standby network
interfaces for IPv6 use, even when IPv6 is not being
used in the cluster configuration.
Resolution:
Serviceguard does not plumb any network interfaces for
inet6 (IPv6) when IPv6 is not being used within the
cluster.
5. The SIGPIPE signal was being set to default action
by the psmmon resource monitor. So, when a package
with an ems resource, using psmmon, is configured
in the cluster, the psmmon died when it tried to
re-connect with cmcld, when cmcld went down and came
up. This caused the package to not come up.
Resolution: Fixed the libsgcl used by the ems framework
to handle SIGPIPE appropriately, when connecting to
cmcld.
6. UDP messages were not marked as invalid even if there
were invalid values for length and offset fields in the
message, causing cmclconfd to exit without receiving
the message and/or cmviewcl to spin indefinitely. In
the cmclconfd case the message hence remains in the
inetd socket buffer causing inetd to spawn another
cmclconfd server. This is repeated until it reaches 40
servers in 60 seconds when it terminates the service
and only reinstates the service again after 10 minutes.
Resolution:
Mark the message as invalid if the length and offset
fields in the message contained improper values.
PHSS_33836:
1. When a node is deleted through a cmapplyconf to remove
the node from an online cluster, asymmetrical subnet
information does not get properly removed in the
cluster database. The subnet objects remain but they
are orphaned with no associated node objects. This
problem only occurs when the node is re-added to the
cluster and then restarted. A network partition error
occurs when the node tries to join the running cluster
through cmrunnode. When the node is re-started through
SG Manager cmomd aborts with a signal 6 error.
Resolution:
During the cmapplyconf, when nodes are being removed
from the cluster, delete any subnet objects related to
deleted node that exist in old cluster's cluster data
base, but is not in the new cluster. This will remove
subnet objects that are only configured on the node
that is being removed.
Action:
If there is an existing orphaned subnet information
in the cluster database, cmrunnode will generate the
following error messages.
"Detected a partition of IPv6 subnet <ipaddress>" or
"Detected a partition of IPv4 subnet <ipaddress>"
Also, starting a node through SG Manager will generate
this same log message. The only way to remove this
information is to delete and re-add the node using the
cmapplyconf command.
2. cmcld has not been updated to handle the drivers for
the newer supported cluster lock interface cards such
as the Ultra160 and Ultra320 cards which use the c8xx
and mpt drivers respectively.
Resolution:
Update cmcld to have the appropriate values set for
the c8xx and mpt drivers.
3. Under rare circumstances when the timer loop thread is
stuck (not holding cm_lock) or the system clock is not
advancing, cmcld threads will not be scheduled. This
prevents cmcld timeout and prevents the safety timer
being updated resulting in all nodes being TOC'd.
Resolution:
Enhanced the code to check the time stamps of received
heartbeat messages to ensure clock is advancing, rather
than using the heartbeat sequence number. cmcld will
abort if it detects the system clock is not advancing
for a set period of time resulting in a failure of the
single errant node rather than the entire cluster.
4. Too many "Unable to stat /etc/cmcluster/cmclconfig,
No such file or directory" messages fill up syslog.
This occurs every time a UDP probe comes in and the
cluster is not configured on the system.
Resolution:
The message is changed to level 1 and LOG_INTERNAL.
5. A small timing window exists that can cause the
configuration daemon (cmclconfd) to go into an
infinite loop on a node that is being configured
out of the cluster.
Resolution:
Reset the variable controlling the loop exit
when the configuration file size goes to zero.
6. A cmhaltnode -f causes node switching and global
switching for the packages running on the halting node
to be disabled before actually halting the package.
Once the packages are halted, global switching is
re-enabled so that the packages can start on an adoptive
node. If there are failures when halting any package,
global switching was not re-enabled any packages so
they were not started on an adoptive node.
Resolution:
When halting node the package global switching is not
disabled, so that the successfully halted and node fail
fast packages can move over to adoptive node even in
case of halting node is TOC'd.
7 The same test_return value was used for both starting
and stopping the NFS script. When the function failed
the same error message was logged indicating that a
start failure had occurred irrespective of whether
this was a start or stop.
Resolution:
Different test_return values are used for starting and
stopping the NFS script.
8 The documentation in the package control script states
that the HA NFS script should be named "ha_nfs.sh"
instead of the correct name "hanfs.sh".
Resolution:
Changed the documentation to reflect the correct
name "hanfs.sh".
PHSS_33834:
1. Changing the activation mode of Volume Group (VG) from
shared mode to exclusive mode or vice versa using
vgchange, was not supported when the Volume Group is
already active.
Resolution:
Code is added to support reactivation of VGs from
shared mode to exclusive mode without having to
deactivate the VG in between. In addition to this or
later ServiceGuard patch the user needs to have LVM
commands and kernel patches. The relevant patch ids:
PHCO_33310 and PHKL_33390 or their superseding patches.
If the LVM and Kernel patches are not installed, this
enhancement is disabled. There is a new option added to
the vgchange command. Refer to the white paper at
docs.hp.com under "Core HP-UX" -> 11iv2 or 11iv1 and in
LVM Volume Manager and look for "SLVM Online
Volume Group Reconfiguration whitepaper".
2. The code to verify the replies was using an invalid
array index.
Resolution:
The code was modified to use the correct array index.
3. After cmapplyconf has removed the node from cluster,
sometimes at the end of transaction completion,
Serviceguard still tries to contact the online node.
This is where the problem arises. Since the node is
no longer a part of the cluster, we will get
permission denied messages, unless the node is in
the Role Based Access list.
Resolution:
cmapplyconf has been changed so that it will not allow
the command to be run from the node that is being
deleted from the cluster.
4. While it is not supported to do an online change for
the cluster lock device, cmapplyconf neither logs an
error message nor does it update the cluster lock if
the cluster lock disk is changed online to a disk
belonging to the same volume group. The command should
report an error since changing the cluster lock online
is not allowed.
Resolution:
cmapplyconf has been changed so that Online changing of
the cluster lock will fail and generates an error
message even if it is the same volume group.
5. cmclconfd used the block device to read data from the
disks. Closing a block device can take a very long
time, because cmclconfd uses the block device file to
examine each device. This means the kernel must spend
unnecessary CPU cycles examining the buffer cache when
cmclconfd closes the device and also cmclconfd may
block other processes which want to perform file system
operations because the block device close holds a high
level file system semaphore. So the performance with
many disks is slow.
Resolution:
Changed cmclconfd to read from the raw devices.
6. This problem can happen during local lan failover if
there is no package IPs on the subnets being switched.
If a cluster reformation happens around the same time
the last stationary/HB IP has just done switching or
when polling interval has expired, when the last
stationary/HB IP has just done switching, then the
update which triggers the hpmcSGLocalSwitch trap to be
sent is skipped.
Resolution:
Update right after last IP is switched.
7. The defect is because the stdout of cmquerycl does not
accommodate for the maximum IPV6 subnet information to
be output, it leaves no space between the subnet name
and its corresponding network interface when the IPV6
IP address is more than 19 characters.
Resolution :
Allocate maximum space to print IPv6 subnet.
8. The usage lines for unsupported commands are more clear
about not supporting customer use.
Resolution :
Configuration daemon checks for the commands being
incorrectly used
PHSS_32733:
1. cmclconfd had no way of determining if the "identd
disabled" log message had already been logged. This
resulted in the message being logged each time
cmclconfd was executed.
Resolution:
Add a timestamp file to limit logging this message
to specific events including first execution since
boot and first execution since the node joined the
cluster.
2. Subagent might miss a notification from Serviceguard
because the node shuts down before the event loop gets
around to handling the subagent api events.
Resolution:
Drain all the events in the subagent API
event queue before cmcld exits when a node halts.
3. cmclconfd, after getting the disk device hardware
path from a system copies into its own data
structure which has a size of 30 characters only.
So if a system returns more than 30 characters from
the disk device, then cmclconfd can run into problem.
Resolution:
The cmclconfd daemon will only copy the last 29
characters if system returns more than 30 characters
for disk hardware path. Also in this case the disk
hardware path will start * character.
4. When the volume group is removed from the cluster
configuration file and when the user changes
configuration there will be a log in syslog that the
volume group has been cleared from the cluster.
There is a need for the proper warning message
on the console as well as in syslog file.
Resolution:
Log messages are added to display on the console and
the messages logged to syslog file have been modified.
5. The cmcheckconf and cmapplyconf commands do not
detect that there is no peer interface for this
lan on the same bridged network, thus allowing the
interface to be included in the Configuration
Database, which in turn causes cmcld to misbehave.
Resolution:
The cmcheckconf and cmapplyconf commands will fail if
a similar situation exists; thus preventing the cmcld
daemon from aborting unnecessarily.
6. The cmscancl command uses swlist to determine the
version of Serviceguard and Cluster Object Manager
loaded on the server. This command cannot account
for the OE bundles, hence, cannot properly report the
version of Serviceguard loaded.
Resolution:
Use what command on cmcld, cmomd and grep for
Date and Ver respectively instead of swlist.
7. cmcld no longer discontinues connecting to nodes when a
new election is started. This assertion should have
been removed when that change was made. It no longer
applies.
Resolution:
Code changed to remove the assertion.
8. During log switch, cmgmsd doesn't close the previous
file handle after creating a new one.
Resolution:
Use a new global variable global_flogh to keep the file
handle and call cl_flog_destroy to close the old file
handle before creating a new file handle when log
switch happens.
9. cmcld may core dump upon encountering a corrupted
DLPI packet. This can also happen when the cluster
configuration parameter, NETWORK_FAILURE_DETECTION
is set to INONLY_OR_INOUT and a node has been deleted
from the cluster configuration while the cluster is
online (and the cluster has not been halted and
restarted in the online configuration change).
Serviceguard may try to poll with the network
interface on the deleted node, causing this problem.
We end up dereferencing a NULL pointer thus causing
the core.
Resolution:
Do not dereference the NULL pointer and do not poll
with a node that gets deleted from the cluster.
10. If a node joins through cmcld -j and an online change
adds a new package and starts it, the coordinator
sends a status update to the joining node. This causes
an assertion failure and cmcld core because the joining
node does not know about the new package.
Resolution:
Ignore the status update message.
11. Because of an omission while implementing quorum
server, a valid election state was missed.
Resolution:
Added handling the valid election state.
12. When cmclconfd finds out a message is invalid after
peeking at the message header, it exits without
receiving the message. The message hence remains in
the inetd socket buffer causing inetd to spawn
another cmclconfd server. This is repeated until
it reaches 40 servers in 60 seconds when it terminates
the service and only reinstates the service again after
10 minutes.
Resolution:
Receive the message even if it is invalid, then
discard it.
13. Status database is updated of the local switch before
the switch actually happens. This triggers the subagent
to send out the hpmcSGLocalSwitch. Since the trap
destination is set to the IP address associated with
the interface that failed, the trap will get lost
because this IP address has not been switched over yet.
Resolution:
Update status database of a subnet switch only when
its stationary address has been moved to the standby
lan card. Update status database of a lan card
fail-over only when all of its stationary addresses
have been failed over to the standby lan card.
14. When printing a message to notify that Serviceguard is
trying to switch away from a card that is experiencing
hard errors to a less sick one that is just
experiencing ping fail problem, cmcld dereferences a
null pointer.
Resolution:
Get the data through another structure which is not
null.
15. When we re-assign a poller, we should make sure that
we check the interface marked as poller ONLY ONCE and
if there is a problem with it re-assign it. However if
the re-assignment yielded the same interface again, we
should not invoke ns_check_intf() on it again (which
was the bug) since we have not yet polled and thus
wait for the next polling interval. This bug
of "checking the same interface again" becomes an
issue when the polling interval is more than a
particular LAN interface driver's time to detect the
failure. (For example: In the case of an Ethernet
interface it is 12 seconds.) This means that a loss
of > 1 inbound/outbound packet (12 seconds / 30
seconds) would be treated as a card failure. Invoking
ns_check_intf() again on the card caused the missed
inbound/outbound packets to be incremented again
without having even polled thus incorrectly marking
the card down.
Resolution:
In case of Ethernet bridged network, the wait time for
failure detection of LAN interface is 12 seconds. When
network polling interval is more than 12 seconds, the
number for polls required to detect failure is one.
When there were, no updates on inbound and outbound
static data for the polling interval, the network
monitor thread used to mark the interface as down with
out sending the poll packets. Before checking the
interface, there has been a check to see if this is a
last interface which is used for checking the state.
16. When SG configured with CVM version 3.5 with large
number of disk groups are configured, SG packages are
started before all the disk groups get enabled. The
SG packages are not able to access the disks
associated with them.
Resolution:
Before executing the function
customer_defined_run_commands() in the package
control script, make sure that all the disk
groups are enabled associated with the package.
For more details about which packages are affected
and how to fix this problem please see the Special
Installation Instructions section.
17. There was a small timing window were the Package
Manager was still reconfiguring when a package command
got accepted. This caused an Assertion to be hit.
Resolution:
Added additional checks to prevent Package Manager
from accepting new commands when it is in
reconfiguring state.
18. cmviewcl reported "unknown" without checking whether
it retrieved the package status or not.
Resolution:
Fixed the code to report the package status as
"unknown" only when we couldn't retrieve the package
status.
19. Data from the USER_NAME field is not
validated when cmapplyconf is run,
although an error is reported in syslog
by cmcld if the USER_NAME is invalid.
When an invalid username is corrected
and cmapplyconf re-executed cmcld aborts
due to the invalid data in the CDB.
Resolution:
Appropriate checks are now added to be
consistent with the checking for CLUSTER NAME,
PACKAGE NAME, etc.
PHSS_32731:
1. IP addresses in the cmclnodelist file are treated
incorrectly. Also, the mechanism that searched the
ACPs for the privilege based on the hostname would only
search until it found a hostname that had any level of
privilege instead of searching for the hostname with
the highest privilege.
Resolution:
Code corrected to now compare the ip address of the
sender to any ip addresses listed in the cmclnodelist
file. Also, the code was changed to keep searching the
ACPs until the hostname with the highest privilege was
found.
2. If the node where the QS is running TOCs,
it may take up to 60 seconds for SG to find
out that the QS is not available. If the QS
was running as a package, then the QS package
may failover and be available faster than
the 60 seconds interval. This change helps
SG to detect the QS failure earlier
and reconnect to the new QS.
Resolution:
Reduced the minimum value for the QS polling
interval to 10 seconds.
3. The Kernel portion of LVM had a defect that made SGeRAC
have a limitation on the Online node reconfiguration
operation. The limitation is that the cluster should
not have any VGs activated in shared mode when the
online node reconfiguration is to be performed.
Resolution:
The problem in the kernel portion of the LVM is fixed
in the patch PHKL_31216. The checks to prevent online
node reconfiguration will not be done if this kernel
patch is installed.
4. With Role-Based access only root users from within
the cluster can perform configuration operations. Once
a cluster is configured on a node, cmapplyconf issued
from the node which is not configured in the cluster
is prohibited from making any configuration changes.
The command fails towards the end while writing a
configuration database to the disk of the node due to
the introduction of Role-Based access.
Resolution:
The cmapplyconf command detects remote configuration
attempts and exits out with an user-friendly error
message such as:
cmapplyconf : Command can not be executed remotely.
Addition of node abc to cluster xyz is possible
only when command is issued from a node which
is configured to be a part of the cluster.
Please login to one of the cluster member nodes of
cluster xyz and reissue the command.
5. The /etc/cmcluster/cmknowncmds file was shipped as an
empty file giving no indication of its importance and
the requirement not to delete it.
Resolution:
A comment was added to this file to instruct the
customer not to remove the file.
6. Programs cannot use select() call with an FD_SETSIZE
greater than 2048 without special compilation options.
Serviceguard code was using the default size of 2048
preventing daemons selecting with more than 2048
socket file descriptors.
Resolution:
The Serviceguard part of the resolution is to recompile
the source with appropriate compiler flags to be able
to support larger number of file descriptors. This
change is contained in this Serviceguard patch, and
this patch can be applied on its own without harm.
To take full advantage of this change on SG eRAC, a
Serviceguard eRAC patch is also required. The
Serviceguard eRAC patch will have this described in
its patch documentation, which will refer to this
SG Patch ID.
7. Serviceguard does not currently support IP over
InfiniBand in a cluster as heartbeat or data network.
Resolution:
Code is added to support IP over InfiniBand network
interface in a cluster as heartbeat and data networks
ON HP-UX 11.23 ONLY. This code change is noted here
in this patch for HP-UX 11.11 only for completeness,
but this does not indicate introduction of support of
InfiniBand on HP-UX 11.11. The corresponding
SG 11.16.00 Patch for HP-UX 11.23 (PHSS_32740) does
introduce Infiniband Support on HP-UX 11.23.
PHSS_31075:
1. Model 10, Model 20 and Model 30 disk arrays require a
different algorithm to probe them to work around a
firmware defect. Recently when the algorithm to probe
all disks was enhanced, some of this probing code was
omitted resulting in these disks not being recognized
correctly by Serviceguard.
Resolution:
Missing disk probing code added.
2. The OpenView Operations (OVO) library
/opt/OV/lib/libopccv.sl requires libpthread.1 to be
able to load. However, since the OPC functionality is
no longer supported, ServiceGuard A.11.14 patches and
later are no longer linked to this library file. This
leads to a loading problem of the OVO library.
Resolution:
Remove OPC functionality altogether.
3. Allowing duplicate roles to be defined can result in
configuration commands such as cmcheckconf/cmapplyconf
to fail.
Resolution :
The configuration commands will allow desirable
combinations of the duplicate roles if all the nodes in
the cluster have this patch installed.
4. The package start/halt fails because the ioctls to
add/delete IPs are not retried upon encountering the
transient ENOMEM error.
Resolution:
The add/delete IP ioctls will be retried upon
encountering the transient ENOMEM error thus increasing
the resiliency towards this error and not causing the
package start/halt to fail.
5. In the case of multiple heartbeat connections if a local
switch is taking place on one network or if one network
is under heavy load, the heartbeat messages might be
slower on that network. Serviceguard handles this by
pausing other connections allowing the slow connections
catch up. If a cluster reformation takes place during
this time the paused connections are not cleaned up
correctly and an assertion is hit.
Resolution:
Clean up paused connections correctly.
6. Sometimes on very fast machines, when multiple
cmhaltnode commands were issued, the commandServer
thread and the main event loop thread could be waiting
on each other to release a mutex. At this time, cmcld
is deadlocked and stops doing any useful work,
triggering infinite cluster reformations.
Resolution:
Clean up the code to release the mutex appropriately to
avoid this deadlock situation.
7. Serviceguard commands such as cmquerycl, cmcheckconf
and cmapplyconf are not resilient against IPv6 UDP
packet drops; thus results in loss of network probe
messages during network topology discovery.
Resolution:
Internally re-send the IPv6 UDP message(s) if they were
previously dropped.
8. The local switch lan names may be incorrect because
the string comparison matches a longer lan name
(e.g. lan11) with a shorter lan name (e.g. lan1).
Resolution:
First compare the string lengths of the lan names, and
if they match then do a "strncmp" of the actual lan name
strings.
9. The configuration database has a limit on the size
of a transaction that it can handle when the cluster
is up.
Resolution:
Increase this limit to 4 times its current value.
10. On higher end systems, race conditions can happen
when the subagent tries to query the cluster
health when cmcld is busy. The looping mechanism
needs some type of time interval to wait between
retries.
Resolution:
Added a 0.5 second time interval between query retries.
11. If raw logical volume devices are still being accessed
after applications have been halted during package
shutdown, the deactivation of the volume groups will
fail leading to a package halt failure.
Resolution:
An enhancement to the package control script is made
in which two parameters are added.
The KILL_PROCESSES_ACCESSING_RAW_DEVICES parameter has
YES or NO as values. When user indicates YES, any
process that is accessing the raw devices that belong
to the volume groups specified in the package control
script at package halt time will be killed. When user
indicates NO, such processes will be left intact.
The DEACTIVATION_RETRY_COUNT parameter allows users to
specify how many times we attempt to deactivate an LVM
or volume group or CVM disk group after a deactivation
failure.
12. Processes spawned by cmsrvassistd incorporate incorrect
umask values.
Resolution:
Control script file permissions are now correct.
13. If a VxVM disk group is currently imported on a node in
the cluster, such as after disk group administration or
from failing to cleanup after a package halt failure,
and then a user attempts to start up a package which
uses that VxVM disk group, that package will fail to
start with an appropriate error. However, during this
failure, due to a problem in the package control
script, the disk group can get forcefully imported and
deported, and the File System can get mounted and
umounted triggering File System synchronization and
clean up, potentially corrupting the File System
(which is still active on the other node). This
forceful import/deport also clears the hostid
protection from disk groups, so that any subsequent
attempt to start that package without first cleaning
up the VxVM disk group(s) (as indicated in the package
log file during the first package start failure) would
succeed, activating VxVM disk groups used by that
package on multiple nodes.
Resolution:
The package control script template has been
corrected so that it will exit after detecting a
disk group is possibly active on another node rather
than removing hostid protection from a VxVM disk group,
and mounting/unmounting it.
14. The what (1m) command parses the first line of a
package control script to create the output. As the
format of the first line in the package control script
is not correct, what (1m) fails to create output.
Resolution:
Correct the format in the package control script.
15. The cmgetconf doesn't process the access control
policies after conversion.
Resolution:
The command will check the Serviceguard version
to verify if access control policies need to be
processed.
16. cmviewcl/cmgetconf do not process lock lun information
on an hpux node when it comes from a linux cluster.
Resolution:
Process lock lun information on an hpux node.
17. This is an extraneous message which may be seen in
syslog because of a fix made in SGeRAC for defect
JAGaf42435 which was delivered in the patch PHSS_31079.
Resolution:
Change the message to be logged at a higher log level
so that it will not be seen by default.
18. Unsigned long was converted to unsigned int causing
an overflow.
Resolution:
Keep the variable's type as unsigned long to avoid
data loss.
19. When no buffer space is available, the LVM daemon
aborts, causing the cluster daemon to also abort,
leading to a TOC.
Resolution:
Since no buffer space available is a transient error,
LVM daemon can safely ignore this after logging the
error in syslog.
20. cmclconfd does not handle the absence of any LVM volume
group correctly.
Resolution:
Initialize variables and avoid situations where the
code might try to allocate 0 bytes.
PHSS_31071:
1. This patch contains an enhancement that allows
Serviceguard to utilize the services of the identd
daemon. This patch requires a sendmail patch level
of PHNE_28810 or later and a Cluster Object Manager
patch level of PHSS_31073 or later.
Enhancement:
No (superseded patches contained enhancements)
PHSS_33834:
This patch delivers an enhancement to change the
activation mode of Volume Group (VG) from shared
mode to exclusive mode or vice versa using
vgchange without requiring the Volume group to be
deactivated first.
PHSS_32733:
This patch delivers an enhancement to put
additional log messages in syslog notifying
users of node(s) that have left the cluster
as follows:
"The following node(s) [name] left the cluster."
PHSS_32731:
This is an enhancement to support online node
reconfiguration feature in SGeRAC cluster even when
there are LVM VGs activated in shared mode. This
is supported when the kernel LVM code fix
mentioned above in the "Resolution" text is
installed.
PHSS_31075:
This patch delivers new functionality that
allows the cmapplyconf command to handle
configuration operations 4 times larger than
before.
An enhancement to the package control script
is made in this patch in which two parameters are
added:
The KILL_PROCESSES_ACCESSING_RAW_DEVICES parameter has
YES or NO as val |