 |
≫ |
|
|
 |
パッチ名: PHSS_36636
パッチ摘要: s700_800 11.23 Serviceguard A.11.17.00
作成日: 07/08/22
公開日: 07/08/30
ハードウェアプラットフォームおよびOSリリース:
s700: 11.23
s800: 11.23
現象:
PHSS_36636:
1. 不具合:JAGag34293 SR:8606480162
Serviceguardの予約ネットワークポート(hacl-cfgポート5302/udpおよび
5302/tcp)を使用するネットワーク上で他のアプリケーションが実行している
と、cmqueryclが紛らわしいエラーメッセージを表示して終了することがあり
ます。cmcheckconf/cmviewcl/cmgetconf/cmrunnode/cmapplyconf/cmrunclコマ
ンドも同様です。
# cmquerycl
Unable to receive a datagram from the configuration
daemon (cmclconfd): No message of desired type
cmquerycl: Unable to find any configuration information
# cmapplyconf -v -C cluster.ascii
Checking cluster file: cluster.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Node <node1> is refusing Serviceguard communication.
Please make sure that the proper access is
configured on node <node1> through either file-based
access (pre-A.11.16 version) or role-based access
version A.11.16 or higher) and/or that the host name
lookup on node <node1> resolves the IP address
correctly.
cmapplyconf: Failed to gather configuration information
2. 不具合:JAGag34599 SR:8606480516
スタンバイlanから一次lanへの切り替え時に、lanフェイルオーバーメッセー
ジが数百回syslogに記録されることがあります。通常は、2つのサブネット(パ
ッケージipが構成されているものと構成されていないもの)がほぼ同時に障害
から回復すると、この現象が起きることがあります。以下のメッセージが
syslogに記録されます。
cmcld: lan2 switched to lan1
(このメッセージが440回繰り返されます)
cmcld: lan2 switched to lan1
3. 不具合:JAGag37580 SR:8606484458
cmsrvassistdが何らかの理由で終了すると、システムTOCが発生しますが、
問題の原因を特定する情報が表示されません。次のようなメッセージがsyslog
に記録されます。
cmcld: Service assistant daemon died unexpectedly!
It may be due to a pending reboot or panic.
cmcld: Exiting with status 1.
4. 不具合:JAGag41341 SR:8606488693
クラスタ構成ファイル内でHEARTBEAT_INTERVALが設定されていないと、
cmapplyconfは成功しますが、cmrunclでコアダンプが取られます。
cmviewconfを実行すると、ハートビート間隔が0に設定されていることがわか
ります。
# cmviewconf
Cluster information:
cluster name: abc
heartbeat interval: 0.00 (seconds)
5. 不具合:JAGag42544 SR:8606490065
VxFS 4.1以降をインストールしたHP-UX 11.31システムで、論理ボリュームグ
ループを照会するたびに、cmclconfdが次のような不正なsyslogメッセージを
表示します。
cmclconfd: Cannot recognize version 6 or later VxFS file
systems. Make sure that libc patch PHCO_32488 or later
is installed if such file systems are used.
HP-UX 11.23システム上のServiceguardでは、この問題は起きません。
6. 不具合:JAGag36170 SR:8606482272
ごく稀に、セグメンテーション違反により、cmcldでコアダンプが取られるこ
とがあります。スタックトレースの最後の2フレームは以下の関数を示してい
ます。
#0 0x175b60 in cl_list_remove+0xbc ()
#1 0x161024 in st_delete_callback_private+0x548 ()
7. 不具合:JAGag37994 SR:8606484939
cmhaltnodeコマンドまたはcmhaltclコマンドを同時に実行すると、cmviewclが
システムマルチノードパッケージのステータスを"starting"と表示します。
8. 不具合:JAGag42785 SR:8606490333
SIGSEGVによりudp cmclconfdデーモンでコアダンプが取られるため、
Serviceguard関連コマンドが異常終了することがあります。以下が、その際に
表示されるエラーメッセージの1例です。
# cmcheckconf -v -C ./cmclconf.ascii
Checking cluster file: ./cmclconf.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Warning: Can not find configuration for cluster
<cluster_name>
Error: Unable to establish communication to node
<node_name>: 19
cmcheckconf : Failed to gather configuration
information
syslogには次のようなメッセージが記録されます。
inetd: hacl-cfg/udp: Died on signal 11
スタックトレースは以下のようになっています。
#0 0x60000000c0320300:0 in
T_19_f81_cl___doprnt_main+0x99b0 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c030e570:0 in _doprnt+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c03341c0:0 in snprintf+0x140 ()
from /usr/lib/hpux32/libc.so.1
#3 0x432b920:0 in add_alias_ip_addrs+0xe0 ()
#4 0x432e110:0 in
sg_sec_check_filebased_security+0x10b0 ()
#5 0x432fa40:0 in sg_get_security_privilege+0x220 ()
#6 0x40d0c20:0 in get_udp_message+0x850 ()
#7 0x40d37f0:0 in main+0x27a0 ()
9. 不具合:JAGag43673 SR:8606491388
パッケージの起動時に、Metrocluster/SRDF環境でのVxVMディスクグループの
インポートがエラーになることがあります。通常は、SRDF R2側のRDFデバイス
グループの再構成中にSRDF R1側のノードがリブートまたは再起動されると、
この問題が起きます。RDFの再構成時に、R1側のデバイスがある状態になるこ
とがあるため、システムブート時に実行される"VxVMディスクの走査"コマンド
がそれらのデバイスを"offline"とマークします。そのため、その後、R1側の
パッケージがVxVMディスクグループをインポートしようとすると、エラーにな
ります。
10.不具合:JAGag42796 SR:8606490345
あるノードでのクラスタサービスの停止時に外部IPアドレスが構成されている
と、ノードがクラスタから離脱しても、cmcldがメモリーを消費し続けます。
そして、そのメモリー使用量がカーネルでの上限値に達した時点でコアダンプ
が取られます。そのため、外部IPアドレスが削除されません。したがって、
cmcldでのコアダンプ後、手動で外部IPアドレスを削除せざるを得ません。
コアファイルのスタックトレースはその都度変わりますが、関数
"add_netsen_shutdown_links_to_chain"が含まれていることがよくあります。
#0 0x60000000c035d830:0 in _brk+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c036cb00:0 in sbrk+0xf0 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c0231980:0 in malloc_sbrk+0x280 ()
from /usr/lib/hpux32/libc.so.1
#3 0x60000000c0232590:0 in grow_arena+0x210 ()
from /usr/lib/hpux32/libc.so.1
#4 0x60000000c022fa40:0 in real_malloc+0x920 ()
from /usr/lib/hpux32/libc.so.1
#5 0x60000000c022ef20:0 in _malloc+0x800 ()
from /usr/lib/hpux32/libc.so.1
#6 0x60000000c023c950:0 in malloc+0x140 ()
from /usr/lib/hpux32/libc.so.1
#7 0x41c2ae0:0 in add_netsen_shutdown_links_to_chain
() at netsen/ns_shutdown_chain.c:222
#8 0x41c3540:0 in ns_start_shutdown_chain () at
netsen/ns_shutdown_chain.c:342
#9 0x41c3880:0 in ns_shutdown () at
netsen/ns_shutdown_chain.c:372
#10 0x42bc760:0 in cl_chain_link_done () at
utils/cl_chain.c:121
#11 0x436b690:0 in cm_shutdown_event_handler ()
at cm/utils.c:708
#12 0x42c18c0:0 in cl_event_loop () at
utils/cl_event.c:460
#13 0x60000000c00c7420:0 in
__pthread_bound_body+0x170 ()
from /usr/lib/hpux32/libpthread.so.1
11.不具合:JAGag44706 SR:8606492535
クラスタ内に多くのパッケージが構成されていると、cmcldが頻繁に以下のメ
ッセージを記録します。
cmcld: Unable to set socket buffer size to 360448
bytes (No buffer space available), continuing anyway.
ただし、Serviceguardはこの状態を適切に処理するので、これらのメッセージ
はエラーを示しているわけではありません。
12.不具合:JAGag43289 SR:8606490922
"cmviewcl -f line"が常に、クラスタ内のリモートノードのos_statusを
"unknown"と表示します。
13.不具合:JAGag45533 SR:8606493360
cmviewconfが、"enabled"状態のサービスフェイルファーストフラグを
"disabled"と表示します。
14.不具合:JAGag45718 SR:8606493785
クラスタASCII構成ファイル内で、無効なバックスラッシュ文字("\")を使って
クォーラムサーバーのホスト名QS_HOSTを指定すると、cmapplyconfでコアダン
プが取られます。スタックトレースには、以下の関数が表示されます。
0x60000000c0345510:0 in kill+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c023bd50:0 in raise+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c02ff250:0 in abort+0x190 ()
from /usr/lib/hpux32/libc.so.1
#3 0x40a7510:0 in cdb_db_commit+0x890 ()
#4 0x40af270:0 in cdb_external_access+0x890 ()
#5 0x40bc660:0 in cl_config_commit_transaction+0x1560()
#6 0x4181520:0 in cf_configure_cluster+0x2ab0 ()
#7 0x4133b10:0 in config_main+0x5480 ()
#8 0x4146590:0 in main+0x900 ()
15.不具合:JAGag41937 SR:8606489376
クラスタの再編成時にクラスタ内のあるノードで複数のハングが起きると、
ハング中のノードとコーディネータの候補がセーフティタイマーの時間切れで
終了します。
16.不具合:JAGag46086 SR:8606494153
複数のQS_HOSTエントリを指定すると、cmapplyconfが不適切なエラーメッセー
ジを表示します。
cmqueryclコマンドの最後のパラメータとして-qオプションを指定すると、
cmqueryclでコアダンプが取られることがあります。
"cmviewcl -v -f line"コマンドがクォーラムサーバーのipアドレスを不正な
フォーマットで表示します。
quorum_server:<node_name>|ip_address:192.76.1.2|name=192.76.1.2
次のように表示すべきです。
quorum_server:<node_name>|ip_address=192.76.1.2|name=192.76.1.2
17.不具合:JAGag46092 SR:8606494159
cmapplyconfコマンドはクラスタの稼動中に、QS_POLLING_INTERVAL/
QS_TIMEOUT_EXTENSION値の変更を受け入れるようですが、実際には、クラスタ
構成ではこれらの値は変更されません。クラスタが稼動中の場合、
cmapplyconfはこれらのパラメータの変更を禁止すべきです。
18.不具合:JAGag27798 SR:8606473093
停止したノードがクラスタの他のノードと通信できないと、停止したノード上
のcmviewclが、下に示したように、パッケージのSTATUSとSTATEをそれぞれ
"down"、"halted"と表示します。しかし、パッケージは他のノード上で正常に
実行しているので、パッケージのSTATEはUNKNOWNと表示されるべきです。
停止したnode2ノードが他のクラスタノードと通信できないと、そのノード上
のcmviewclが以下の情報を返します。
# cmviewcl
CLUSTER STATUS
cluster1 unknown
NODE STATUS STATE
node1 unknown unknown
node2 down unknown
UNOWNED_PACKAGES
PACKAGE STATUS STATE AUTO_RUN NODE
pkg1 down halted enabled unowned
19.不具合:JAGag38581 SR:8606485608
監視/管理アクセス権限を持つ非rootユーザーが、クラスタの構成済みアクセ
ス制御ポリシーをすべて表示できます。
問題点の説明:
PHSS_36636:
1. 不具合:JAGag34293 SR:8606480162
Serviceguardは、無効なメッセージを受信するたびにエラーメッセージを表示
して終了していました。
解決方法:
無効なメッセージを無視するようにコードを修正しました。
2. 不具合:JAGag34599 SR:8606480516
異なるブリッジネットワーク間である共有変数を使用していたため、2つのサ
ブネットがほぼ同時に回復すると、その共有変数が矛盾した状態になっていま
した。その結果、ブリッジネットワークごとに上記のメッセージが記録されて
いました。
解決方法:
異なるブリッジネットワーク間で共有変数を使用しないようにコードを修正し
ました。
3. 不具合:JAGag37580 SR:8606484458
A.11.17では、あらゆる場合に、cmsrvassistdの終了ステータスコードが削除
されていました。
解決方法:
cmsrvassistdの終了ステータスを処理するコードを追加しました。
4. 不具合:JAGag41341 SR:8606488693
クラスタ構成ファイル内でHEARTBEAT_INTERVALが設定されていない場合を想定
していなかったため、デフォルト値が0に設定されていました。
解決方法:
このようなケースを適切に処理するようにコードを修正しました。
5. 不具合:JAGag42544 SR:8606490065
上記のsyslogメッセージは、HP-UX 11.23システム専用のメッセージです。
HP-UX 11.31システムには適用できません。ところが、記録する前に、そのマ
シン上で実行されているOSのバージョンがHP-UX 11.23かどうかチェックして
いませんでした。
解決方法:
HP-UX 11.31システムの場合は上記のsyslogメッセージを表示しないようにコ
ードを修正しました。
6. 不具合:JAGag36170 SR:8606482272
Serviceguardは、コールバック構造を更新する前にmutexをロックしますが、
コールバックをリストから削除する前にmutexをロック解除していました。
解決方法:
コールバックをリストから削除した後、mutexをロック解除するようにコード
を修正しました。
7. 不具合:JAGag37994 SR:8606484939
ノード/クラスタの停止時に、システムマルチノードパッケージのステータス
が"starting"と表示されていました。
解決方法:
cmhaltnode/cmhaltclの実行時には、システムマルチノードパッケージのステ
ータスを"changing"と表示するようにコードを修正しました。
8 不具合:JAGag42785 SR:8606490333
hostent構造を使用している最中に、hostent構造が次のgethost*()呼び出しに
よって変更されていました。
解決方法:
次のgethost*()呼び出しから保護するために、実際のhostent構造ではなく、
そのコピーを使用するようにコードを修正しました。
9. 不具合:JAGag43673 SR:8606491388
データをリモート側(R2)から一次側(R1)へリフレッシュするディスク操作の実
行中に、それらのディスクはごく短時間、"Not Ready"(ホストに非表示)状態
になります。この間に、R1上の一次ノードがリブートし、ブートシーケンスの
一部としてVxVMが起動すると、VxVMは、それらのディスクにアクセスできない
ため、それらの"Not Ready"ディスクを"offline"とマークします。その結果、
VxVMディスクグループをインポートできないため、パッケージのフェイルバッ
ク時に、パッケージが起動できませんでした。
解決方法:
"YES"または"NO"に設定できる新たなパラメータ"VXVM_DG_RETRY"を
Serviceguardパッケージ制御スクリプトに導入しました。このパラメータを
"YES"に設定すると、障害ディスクグループに属するディスクに対して"vxdisk
scandisks"が実行されます。
10.不具合:JAGag42796 SR:8606490345
クラスタ構成内にはないサブネット上のリロケータブルIPアドレスを使用する
パッケージが構成されたクラスタ上でcmhaltcl/cmhaltnodeを実行すると、
cmcldのサイズが増大してコアダンプが取られていました。IPアドレスの削除
時に、配列の要素を適切に拡張していませんでした。
解決方法:
配列の要素を適切に拡張するようにコードを修正しました。
11.不具合:JAGag44706 SR:8606492535
unixドメインソケットのデフォルトのバッファサイズが不十分な場合、ログメ
ッセージがデフォルトのログレベルで記録されていました。しかし、cl_msgの
フロー制御は、サイズの調整とその送信を処理します。
解決方法:
メッセージがデフォルトのログレベルで記録されないように、メッセージのロ
グレベルを上げ、かつ、ログカテゴリを変更しました。
12.不具合:JAGag43289 SR:8606490922
ノードのプロービングを行わない限り、cmviewclはos_status値を取得できま
せん。ノードのプロービングを行うのは、verboseオプションが指定された場
合だけです。したがって、verboseオプションが指定されていない場合、
cmviewclはos_statusを表示してはいけません。
解決方法:
verboseオプションが指定された場合にだけos_statusを表示するように
cmviewclを修正しました。
13.不具合:JAGag45533 SR:8606493360
サービスフェイルファーストフラグのステータスを表示する際に、cmviewconf
コマンドは異なるバイト順値を使ってフラグを比較していたため、フラグの不
正なステータスが表示されていました。
解決方法:
正しいバイト順比較を行ってサービスフェイルファーストフラグの正しいステ
ータスを取得するようにコードを修正しました。
14.不具合:JAGag45718 SR:8606493785
クラスタASCII構成ファイル内で、無効なバックスラッシュ文字("\")を使って
QS_HOST値を指定すると、cmapplyconfでコアダンプが取られていました。
解決方法:
クラスタ構成ファイル内に指定されているQS_HOST/QS_ADDRの値に無効なバッ
クスラッシュ文字が含まれていないかチェックし、含まれていればエラーメッ
セージを表示するようにコードを修正しました。
15.不具合:JAGag41937 SR:8606489376
ハングしたノードは、そのセーフティタイマーを更新できないため、セーフテ
ィタイマーの時間切れで終了していました。一方、コーディネータの候補は、
そのセーフティタイマーを更新するために、ハングしたノードからのハートビ
ート待ちで待機しているため、同様にセーフティタイマーの時間切れで終了し
ていました。
解決方法:
ノードがハングしたら、その影響がクラスタ内の他のノードに及ぶ前にそのノ
ードを強制終了するようにコードを修正しました。
16.不具合:JAGag46086 SR:8606494153
複数のQS_HOSTエントリを指定すると、cmapplyconfが不適切なエラーメッセー
ジを表示していました。
コマンド行からの-qオプションの読み取り時に、cmqueryclは配列の添え字を
正しく増分していませんでした。
"cmviewcl -v -f line"は、クォーラムサーバーのipアドレスを不正なフォー
マットで表示していました。
解決方法:
適切なエラーメッセージを表示するようにcmapplyconfを修正しました。
コマンド行引き数から-qオプションを正しく読み取るようにcmqueryclを修正
しました。
ip_addressesを正しいフォーマットで表示するように"cmviewcl -v -f line"
を修正しました。
17.不具合:JAGag46092 SR:8606494159
cmapplyconfは、QS_POLLING_INTERVAL/QS_TIMEOUT_EXTENSION値のオンライン
変更をチェックしていませんでした。
解決方法:
QS_POLLING_INTERVAL/QS_TIMEOUT_EXTENSIONのオンライン変更を禁止するよう
にcmapplyconfを修正しました。オンライン変更を行おうとすると、
cmapplyconfはエラーで終了します。
18.不具合:JAGag27798 SR:8606473093
cmviewclは、クラスタ到達可能ステータスをチェックせずに、すべての非所有
パッケージのSTATUSとSTATEをデフォルトで、それぞれ"down"、"halted"と表
示していました。
解決方法:
クラスタのノードが到達可能でない場合は、パッケージのSTATUSを"unknown"
と表示するようにcmviewclを修正しました。
19. 不具合:JAGag38581 SR:8606485608
クラスタ情報を表示するコマンドは、コマンドの実行権限を持つ非rootユーザ
ーに対して、構成済みのアクセス制御ポリシーをすべて表示していました。
これ自体は問題ではありませんが、同じかより高いレベルのアクセス権を持つ
ユーザーに対してだけこれらのポリシーを表示するようにコマンドを修正する
ことにしました。
解決方法:
表示するロールを、ユーザー名と(コマンドを実行する)ホストに基づいてコマ
ンドを実行するユーザーの権限レベルと一致させました。
-----------------------------------------------------------------------------
Patch Name: PHSS_36636
Patch Description: s700_800 11.23 Serviceguard A.11.17.00
Creation Date: 07/08/22
Post Date: 07/08/30
Hardware Platforms - OS Releases:
s700: 11.23
s800: 11.23
Products:
Serviceguard A.11.17.00
Filesets:
Cluster-Monitor.CM-CORE,fr=A.11.17.00,fa=HP-UX_B.11.23_IA,v=HP
Package-CVM-CFS.CM-CVM-CFS,fr=A.11.17.00,fa=HP-UX_B.11.23_IA,v=HP
Package-Manager.CM-PKG,fr=A.11.17.00,fa=HP-UX_B.11.23_IA,v=HP
Cluster-Monitor.CM-CORE,fr=A.11.17.00,fa=HP-UX_B.11.23_PA,v=HP
Package-CVM-CFS.CM-CVM-CFS,fr=A.11.17.00,fa=HP-UX_B.11.23_PA,v=HP
Package-Manager.CM-PKG,fr=A.11.17.00,fa=HP-UX_B.11.23_PA,v=HP
Cluster-Monitor.CM-CORE-COM,fr=A.11.17.00,fa=HP-UX_B.11.23_IA/PA,v=HP
Package-CVM-CFS.CM-CVM-CFS-COM,fr=A.11.17.00,fa=HP-UX_B.11.23_IA/PA,v=HP
Package-Manager.CM-PKG-MAN,fr=A.11.17.00,fa=HP-UX_B.11.23_IA/PA,v=HP
Cluster-Monitor.CM-CORE-MAN,fr=A.11.17.00,fa=HP-UX_B.11.23_IA/PA,v=HP
Automatic Reboot?: No
Status: General Release
Critical:
Yes
PHSS_36636: ABORT HANG PANIC
If HEARTBEAT_INTERVAL is not set in cluster
configuration file cmapplyconf succeeds but cmruncl
dumps core.
There is a very small timing window when cmcld can dump
core with a segmentation violation. The last 2 frames
in the stack trace will show the following functions:
#0 0x175b60 in cl_list_remove+0xbc ()
#1 0x161024 in st_delete_callback_private+0x548 ()
Serviceguard commands may fail due to the udp cmclconfd
daemon core dumping with a SIGSEGV. Below is an example
of the error message for one such command when this
happens:
# cmcheckconf -v -C ./cmclconf.ascii
Checking cluster file: ./cmclconf.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Warning: Can not find configuration for cluster
<clsuter_name>
Error: Unable to establish communication to node
<node_name>: 19
cmcheckconf : Failed to gather configuration
information
The syslog would look like:
inetd: hacl-cfg/udp: Died on signal 11
The stack trace would look like:
#0 0x60000000c0320300:0 in
T_19_f81_cl___doprnt_main+0x99b0 () from
/usr/lib/hpux32/libc.so.1
#1 0x60000000c030e570:0 in _doprnt+0x30 () from
/usr/lib/hpux32/libc.so.1
#2 0x60000000c03341c0:0 in snprintf+0x140 ()
from /usr/lib/hpux32/libc.so.1
#3 0x432b920:0 in add_alias_ip_addrs+0xe0 ()
#4 0x432e110:0 in
sg_sec_check_filebased_security+0x10b0 ()
#5 0x432fa40:0 in sg_get_security_privilege+0x220 ()
#6 0x40d0c20:0 in get_udp_message+0x850 ()
#7 0x40d37f0:0 in main+0x27a0 ()
When the cluster services are halted on a node,
if external IP addresses are configured cmcld
will continue to consume memory after the node
has left the cluster until it reaches the kernel
limit at which point it will core dump. Due to this
it takes a long time to halt, since it will be trying
to clean up the ip resources which could not be
removed by cmmodnet during halt procedure.
The stack traces of the resulting core files can vary
but often include one of the functions
"add_netsen_shutdown_links_to_chain".
#0 0x60000000c035d830:0 in _brk+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c036cb00:0 in sbrk+0xf0 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c0231980:0 in malloc_sbrk+0x280 ()
from /usr/lib/hpux32/libc.so.1
#3 0x60000000c0232590:0 in grow_arena+0x210 ()
from /usr/lib/hpux32/libc.so.1
#4 0x60000000c022fa40:0 in real_malloc+0x920 ()
from /usr/lib/hpux32/libc.so.1
#5 0x60000000c022ef20:0 in _malloc+0x800 ()
from /usr/lib/hpux32/libc.so.1
#6 0x60000000c023c950:0 in malloc+0x140 ()
from /usr/lib/hpux32/libc.so.1
#7 0x41c2ae0:0 in add_netsen_shutdown_links_to_chain
() at netsen/ns_shutdown_chain.c:222
#8 0x41c3540:0 in ns_start_shutdown_chain () at
netsen/ns_shutdown_chain.c:342
#9 0x41c3880:0 in ns_shutdown () at
netsen/ns_shutdown_chain.c:372
#10 0x42bc760:0 in cl_chain_link_done () at
utils/cl_chain.c:121
#11 0x436b690:0 in cm_shutdown_event_handler ()
at cm/utils.c:708
#12 0x42c18c0:0 in cl_event_loop () at
utils/cl_event.c:460
#13 0x60000000c00c7420:0 in
__pthread_bound_body+0x170 ()
from /usr/lib/hpux32/libpthread.so.1
In the cluster ASCII configuration file, if QS_HOST
value is specified with invalid, backslash character
('\'), cmapplyconf dumps core. The stack trace
will show the following functions:
0x60000000c0345510:0 in kill+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c023bd50:0 in raise+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c02ff250:0 in abort+0x190 ()
from /usr/lib/hpux32/libc.so.1
#3 0x40a7510:0 in cdb_db_commit+0x890 ()
#4 0x40af270:0 in cdb_external_access+0x890 ()
#5 0x40bc660:0 in cl_config_commit_transaction+0x1560()
#6 0x4181520:0 in cf_configure_cluster+0x2ab0 ()
#7 0x4133b10:0 in config_main+0x5480 ()
#8 0x4146590:0 in main+0x900 ()
A node in a cluster experiencing multiple hangs during
cluster reformation can cause the node experiencing the
hangs and the candidate for coordinator to die when
safety timer expires.
PHSS_35427: ABORT HANG PANIC
cmcld aborts when the select() system call is
interrupted by a signal. This results in the node
being reset by the safety timer.
cmsrvassistd will loop when the script or program
specified in a package SERVICE_CMD parameter does not
exist or does not have execute permission, attempting to
restart the service until the defined maximum service
restart count has been reached. If the count is infinite
cmsrvassistd will take large amounts of CPU effectively
taking over a single cpu system.
System repeatedly TOC's when AUTOSTART_CMCLD is set
to 1 if system multinode package is unable to start.
pthreads patch PHCO_34944 or later exposes a defect in
Serviceguard on uniprocessor systems which can lead to
cmcld consuming 100% of cpu resulting in a hang or system
TOC. This does not apply to multi-processor systems.
When cmgmsd cannot be halted correctly within timeout,
cmcld hits an assertion and the node will TOC if the
safety timer is still enabled. But, there is no core
from cmgmsd to understand the reason why it could not
halt.
With CVM4.1, failover time could be increased by a few
seconds if a FAILFAST service fails while the cluster is
reforming. This can delay the TOC of the local node and
eventually cluster reformation.
The Serviceguard daemon cmlvmd terminates upon receipt
of SIGHUP. This causes cmcld to abort and a potential
node TOC.
Improved integration with Distributed Systems
Administration Utilities (DSAU). Intermittent command
hangs in cmapplyconf and cmdeleteconf have been seen
when DSAU is running on nodes in a Serviceguard cluster.
PHSS_35371: ABORT PANIC
Reuse of memory during reprobe of DGC disks can lead
cmclconfd to SIGSEGV resulting in command failures.
Invalid data can be specified in the USER_NAME field for
the access control policies in the cluster ascii file
and a cmapplyconf will complete without error. When a
cmapplyconf is re-executed to correct this, and if
the cluster is running, cmcld will abort, resulting
in a node TOC.
PHSS_34337: ABORT HANG PANIC CORRUPTION
Corruption in link level messages can lead to cmcld
SIGSEGV even with checksumed messages, the stack traces
of the resulting core files can vary but often include
one of the functions ns_if_setgood or dlpi_recv.
A socket call failure due to insufficient available
memory causes cmcld to abort.
UDP messages were not marked as invalid even if there
were invalid values for length and offset fields in the
message, causing cmclconfd to exit without receiving
the message and/or cmviewcl to spin indefinitely.
The Serviceguard NMAPI interface fails if the file
descriptor used to connect to cmgmsd is greater than the
default FD_SETSIZE, i.e. 24576 causing data corruption
of the client process.
Formation of 2 clusters may potentially result in
packages running on 2 nodes at the same time and may
potentially result in data corruption issues.
When no buffer space is available, the LVM daemon
aborts, causing the cluster daemon to also abort,
leading to a TOC.
When the timer loop thread is stuck (not holding
cm_lock) or the system clock is not advancing, cmcld
threads will not be scheduled. This prevents cmcld
timeout and prevents the safety timer being updated
resulting in all nodes being TOC'd.
The cmviewconf command can core dump if it cannot get
node information (for example, if it cannot contact the
cmclconfd daemon).
When connection fails between cmcld and quorum server
frequently and at adjacent intervals, cmcld may dump
core.
Memory was freed twice during cluster reformation may
cause cmcld to dump core.
Removing a node from a cluster when a package is
running on that node causes cmquerycl to dump core.
PHSS_33840: HANG ABORT
After deleting a node from the cluster, the
configuration daemon (cmclconfd) on the deleted
node goes into an infinite loop.
When multiple nodes are started and join an existing
cluster at the same time, a node may abort.
cmcld aborts because cmapplyconf incorrectly passes
rather than failing as it should when 'detected a
partition of IP subnet' and 'minimum network
configuration requirement for the cluster have not
been met.'
cmapplyconf can core dump in the situation where
one of the members has a network interface that
may have intermittent problems, where the device's
availability turns on and off. If the administrator
runs cmapplyconf to modify the cluster configuration
and one of the members has the network device that
has intermittent problems the cmapplyconf will fail
and core dump.
Category Tags:
defect_repair enhancement general_release critical panic
halts_system corruption manual_dependencies
Path Name: /hp-ux_patches/s700_800/11.X/PHSS_36636
Symptoms:
PHSS_36636:
1. Defect: JAGag34293 SR: 8606480162
Other applications on the network using Serviceguard
reserved network ports (hacl-cfg ports 5302/udp
and 5302/tcp) can cause cmquerycl to fail unexpectedly
with misleading error messages. This is also applicable
to cmcheckconf, cmviewcl, cmgetconf, cmrunnode,
cmapplyconf and cmruncl commands.
# cmquerycl
Unable to receive a datagram from the configuration
daemon (cmclconfd): No message of desired type
cmquerycl: Unable to find any configuration information
# cmapplyconf -v -C cluster.ascii
Checking cluster file: cluster.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Node <node1> is refusing Serviceguard communication.
Please make sure that the proper access is
configured on node <node1> through either file-based
access (pre-A.11.16 version) or role-based access
version A.11.16 or higher) and/or that the host name
lookup on node <node1> resolves the IP address
correctly.
cmapplyconf: Failed to gather configuration information
2. Defect: JAGag34599 SR: 8606480516
Switching back from standby lan to primary lan can
cause lan failover message to be displayed multiple
times in the syslog. This is seen typically when
two subnets, one with package ip configured and the
other without, recover from failure at about the
same time. Messages similar to the following can
be seen in syslog.
cmcld: lan2 switched to lan1
above message repeats 440 times
cmcld: lan2 switched to lan1
3. Defect: JAGag37580 SR: 8606484458
If cmsrvassistd terminates for any reason, there will
be a system TOC, but no information to determine the
cause of the problem. The following message is
displayed in syslog:
cmcld: Service assistant daemon died unexpectedly!
It may be due to a pending reboot or panic.
cmcld: Exiting with status 1.
4. Defect: JAGag41341 SR: 8606488693
If HEARTBEAT_INTERVAL is not set in cluster
configuration file, cmapplyconf succeeds but cmruncl
dumps core.
On doing a cmviewconf, we can see that the heartbeat
interval is set to zero.
# cmviewconf
Cluster information:
cluster name: abc
heartbeat interval: 0.00 (seconds)
5. Defect: JAGag42544 SR: 8606490065
Each time the logical volume groups are queried by
cmclconfd on a HP-UX 11.31 system which has VxFS 4.1
or later installed, the following false syslog message
is displayed:
cmclconfd: Cannot recognize version 6 or later VxFS file
systems. Make sure that libc patch PHCO_32488 or later
is installed if such file systems are used.
This defect is not applicable to Serviceguard on HP-UX
11.23 systems.
6. Defect: JAGag36170 SR: 8606482272
There is a very small timing window when cmcld can dump
core with a segmentation violation. The last 2 frames
in the stack trace will show the following functions:
#0 0x175b60 in cl_list_remove+0xbc ()
#1 0x161024 in st_delete_callback_private+0x548 ()
7. Defect: JAGag37994 SR: 8606484939
cmviewcl displays the status of system multinode
packages as "starting" when a cmhaltnode or cmhaltcl
command is executing at the same time.
8. Defect: JAGag42785 SR: 8606490333
Serviceguard commands may fail due to the udp cmclconfd
daemon core dumping with a SIGSEGV. Below is an example
of the error message for one such command when this
happens:
# cmcheckconf -v -C ./cmclconf.ascii
Checking cluster file: ./cmclconf.ascii
Checking nodes ... Done
Checking existing configuration ... Done
Warning: Can not find configuration for cluster
<cluster_name>
Error: Unable to establish communication to node
<node_name>: 19
cmcheckconf : Failed to gather configuration
information
The syslog for this would look like:
inetd: hacl-cfg/udp: Died on signal 11
The stack trace would look like:
#0 0x60000000c0320300:0 in
T_19_f81_cl___doprnt_main+0x99b0 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c030e570:0 in _doprnt+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c03341c0:0 in snprintf+0x140 ()
from /usr/lib/hpux32/libc.so.1
#3 0x432b920:0 in add_alias_ip_addrs+0xe0 ()
#4 0x432e110:0 in
sg_sec_check_filebased_security+0x10b0 ()
#5 0x432fa40:0 in sg_get_security_privilege+0x220 ()
#6 0x40d0c20:0 in get_udp_message+0x850 ()
#7 0x40d37f0:0 in main+0x27a0 ()
9. Defect: JAGag43673 SR: 8606491388
The importing of VxVM disk groups in
Metrocluster/SRDF environment can fail during
package start up. This typically happens when
nodes on the SRDF R1 side are rebooted or
restarted while SRDF is in the process of
reconfiguring the RDF device groups on the SRDF R2
side. During RDF reconfiguration, the devices on the
R1 side can be in a state which causes the VxVM disk
scan done at system boot up to label the devices as
offline, then later when a package on the R1 side
attempts to import the VxVM disk groups, the import
will fail.
10. Defect: JAGag42796 SR: 8606490345
When the cluster services are halted on a node,
if external IP addresses are configured cmcld
will continue to consume memory after the node
has left the cluster until it reaches the kernel
limit at which point it will core dump. The external
IP addresses do not get removed and need to be removed
manually after cmcld dumps core.
The stack traces of the resulting core files can vary
but often include one of the functions
"add_netsen_shutdown_links_to_chain".
#0 0x60000000c035d830:0 in _brk+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c036cb00:0 in sbrk+0xf0 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c0231980:0 in malloc_sbrk+0x280 ()
from /usr/lib/hpux32/libc.so.1
#3 0x60000000c0232590:0 in grow_arena+0x210 ()
from /usr/lib/hpux32/libc.so.1
#4 0x60000000c022fa40:0 in real_malloc+0x920 ()
from /usr/lib/hpux32/libc.so.1
#5 0x60000000c022ef20:0 in _malloc+0x800 ()
from /usr/lib/hpux32/libc.so.1
#6 0x60000000c023c950:0 in malloc+0x140 ()
from /usr/lib/hpux32/libc.so.1
#7 0x41c2ae0:0 in add_netsen_shutdown_links_to_chain
() at netsen/ns_shutdown_chain.c:222
#8 0x41c3540:0 in ns_start_shutdown_chain () at
netsen/ns_shutdown_chain.c:342
#9 0x41c3880:0 in ns_shutdown () at
netsen/ns_shutdown_chain.c:372
#10 0x42bc760:0 in cl_chain_link_done () at
utils/cl_chain.c:121
#11 0x436b690:0 in cm_shutdown_event_handler ()
at cm/utils.c:708
#12 0x42c18c0:0 in cl_event_loop () at
utils/cl_event.c:460
#13 0x60000000c00c7420:0 in
__pthread_bound_body+0x170 ()
from /usr/lib/hpux32/libpthread.so.1
11. Defect: JAGag44706 SR: 8606492535
cmcld logs the message below frequently when the number
of packages configured in the cluster is high.
cmcld: Unable to set socket buffer size to 360448
bytes (No buffer space available), continuing anyway.
These messages are not an indication of a failure, as
Serviceguard properly handles this situation.
12. Defect: JAGag43289 SR: 8606490922
"cmviewcl -f line" always yields os_status as 'unknown'
for any remote node in a cluster.
13. Defect: JAGag45533 SR: 8606493360
cmviewconf displays the service fail fast flag
as "disabled" even though the flag was enabled.
14. Defect: JAGag45718 SR: 8606493785
In the cluster ASCII configuration file, if quorum
server hostname, QS_HOST, is specified with invalid,
backslash character ('\'), cmapplyconf dumps core.
The stack trace will show the following functions:
0x60000000c0345510:0 in kill+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c023bd50:0 in raise+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c02ff250:0 in abort+0x190 ()
from /usr/lib/hpux32/libc.so.1
#3 0x40a7510:0 in cdb_db_commit+0x890 ()
#4 0x40af270:0 in cdb_external_access+0x890 ()
#5 0x40bc660:0 in cl_config_commit_transaction+0x1560()
#6 0x4181520:0 in cf_configure_cluster+0x2ab0 ()
#7 0x4133b10:0 in config_main+0x5480 ()
#8 0x4146590:0 in main+0x900 ()
15. Defect: JAGag41937 SR: 8606489376
A node in a cluster experiencing multiple hangs during
cluster reformation can cause the node experiencing
the hangs and the candidate for coordinator to die
when safety timer expires.
16. Defect:JAGag46086 SR: 8606494153
cmapplyconf displays inappropriate error message when
multiple QS_HOST entries are specified.
There is a possibility that cmquerycl may core dump if
-q option is specified as last parameter in the
cmquerycl command.
In 'cmviewcl -v -f line' command quorum server ip
addresses are displayed in invalid format.
The 'cmviewcl -v -f line' output should be as
quorum_server:<node_name>|ip_address=192.76.1.2|
name=192.76.1.2
instead of
quorum_server:<node_name>|ip_address:192.76.1.2|
name=192.76.1.2
17. Defect: JAGag46092 SR: 8606494159
While the cmapplyconf command appears to allow the
QS_POLLING_INTERVAL and QS_TIMEOUT_EXTENSION values
to be modified while the cluster is running, these
values are not actually changed in the cluster
configuration.
cmapplyconf should prevent these parameters from being
modified while the cluster is running.
18. Defect: JAGag27798 SR: 8606473093
When a halted node cannot communicate with other
nodes of the cluster, cmviewcl on the halted node
displays the package status and state as "down"
and "halted" respectively, as shown below. But the
packages are running fine on the other nodes, so
the package state should be displayed as UNKNOWN.
When the node, node2 is halted and cannot communicate
with the other cluster nodes, cmviewcl on that node
returns the following:
# cmviewcl
CLUSTER STATUS
cluster1 unknown
NODE STATUS STATE
node1 unknown unknown
node2 down unknown
UNOWNED_PACKAGES
PACKAGE STATUS STATE AUTO_RUN NODE
pkg1 down halted enabled unowned
19. Defect: JAGag38581 SR: 8606485608
Non-root users having monitor or admin access
privileges can view all configured access control
policies for the cluster.
PHSS_35427:
1. Defect: JAGag21443 SR: 8606465899
cmcld aborts when the select() system call is
interrupted by a signal. This results in the node
being reset by the safety timer. The following
messages will be logged in the syslog file:
cmcld[2257]: Aborting! select failed (file:
lcomm/local_server.c, line: 1165)
cmcld[2257]: select for port 46356 failed with
Interrupted system call
cmcld[2257]: select for port 46100 failed with
Interrupted system call
cmcld[2257]: 29, 95774e60, 8ef
cmcld[2257]: 17 (read)
cmcld[2257]: 19 (read)
cmcld[2257]: 20 (read)
cmcld[2257]: 21 (read)
cmcld[2257]: 26 (read)
cmcld[2257]: 27 (read)
cmcld[2257]: 28 (read)
cmcld[2257]: 29 (read)
cmcld[2257]: Aborting! select failed (file:
rcomm/comm_ip.c, line: 443)
cmcld[2257]: 33, 95774e60, 8f0
cmcld[2257]: 22 (read)
cmcld[2257]: 23 (read)
cmcld[2257]: 24 (read)
cmcld[2257]: 25 (read)
cmcld[2257]: 30 (read)
cmcld[2257]: 31 (read)
cmcld[2257]: 32 (read)
cmcld[2257]: 33 (read)
cmcld[2257]: Aborting! select failed (file:
rcomm/comm_ip.c, line: 443)
cmclconfd[2256]: The Serviceguard daemon,
/opt/cmcluster/bin/cmcld[2257], died upon receiving
signal number 6.
2. Defect: JAGag20225 SR: 8606464542
Defect: JAGag29645 SR: 8606475212
cmquerycl, cmcheckconf and cmapplyconf commands log
errors in syslog if CD/DVD drives from TEAC and other
manufacturers are present in a node, though the command
succeeds.
The following messages may be seen in syslog.log:
cmclconfd[9660]: Error looking up device
/dev/dsk/c17t1d0: /dev/config is not open.
cmclconfd[3730]: Unable to open
disk /dev/rdsk/c0t0d0: Error 0
3. Defect: JAGag06135 SR: 8606448943
The following warning message will be logged in the
flight recorder log, even if the kernel ticks since
boot are advancing.
FAILURE : Kernel ticks_since_boot has not been
advanced for 4.00 seconds, which is greater than or
equal to maximum allowable interval of 10.00 seconds.
4. Defect: JAGag14977 SR: 8606458777
Defect: JAGag33746 SR: 8606479578
The Serviceguard daemon cmcld, cmnetassistd
does not terminate upon receipt of SIGILL.
5. Defect: JAGag11719 SR: 8606455144
When cmgmsd cannot be halted correctly within timeout,
cmcld hits an assertion and the node will TOC if the
safety timer is still enabled. But, there is no core
from cmgmsd to understand the reason why it could not
halt. This is applicable only for SGeRAC installations.
6. Defect: JAGag25946 SR: 8606470887
When activating multiple volume groups at the same time
in a very heavily loaded system, if the parameter
CONCURRENT_VGCHANGE_OPERATIONS in the package
configuration file is set to greater than 1, some of
the vgchange commands might fail with the following
error message in the package log:
vgchange: Failed to establish a connection with cmlvmd
for volume group /dev/vg1
7. Defect: JAGag12644 SR: 8606456223
cmsrvassistd will loop when the script or program
specified in a package SERVICE_CMD parameter does not
exist or does not have execute permissions, attempting
to restart the service until the defined maximum service
restart count has been reached. If the count is infinite
cmsrvassistd will take large amounts of CPU effectively
taking over a single cpu system.
8. Defect: JAGag05782 SR: 8606448540
System repeatedly TOC's when AUTOSTART_CMCLD is set
to 1, if system multinode package is unable to start.
9. Defect: JAGag27672 SR: 8606472905
pthreads patch PHCO_34944 or later exposes a defect in
Serviceguard on uniprocessor systems which can lead to
cmcld consuming 100% of cpu resulting in a hang or
system TOC. This does not apply to multi-processor
systems.
10. Defect: JAGag20034 SR: 8606464337
Serviceguard does not failover IPv6 addresses when the
standby is configured on a lower-index interface such
as lan1 and primary is configured on higher-index such
as lan2.
11. Defect: JAGag25522 SR: 8606470431
The CMGMSD_LOG_FILE parameter is defined in
/etc/cmcluster.conf. As a result, the cmgmsd daemon
logs into the location specified by CMGMSD_LOG_FILE
instead of /var/adm/syslog/syslog.log, which is
supposed to be the default.
12. Defect: JAGag13439 SR: 8606457100
Defect: JAGag25508 SR: 8606470417
The permissions for the Serviceguard SNMP subagent log
file /var/adm/SGsnmpsuba.log is 666 instead of 644.
13. Defect: JAGag18064 SR: 8606462172
In a Serviceguard cluster with CFS and HP Integrity
Virtual Machine nodes as Serviceguard nodes,
cmapplyconf will allow the first virtual machine node
to be added to the cluster, or the last virtual
machine node to be removed from the cluster while the
System Multi-Node package, SG-CFS-pkg is up on other
nodes. This should not be allowed.
14. Defect: JAGag11227 SR: 8606454589
In extremely rare circumstances, if cmcld dies or is
killed while it is halting, it may not be restartable
on that node. The following message would appear in
syslog.
cmcld: It appears that package applications or
cmcld: resources may be active on this node.
cmcld: Re-starting the cluster could cause data
corruption.
cmcld: To recover from this situation
cmcld: reboot this system:
cmcld: shutdown -r (stops package components)
cmcld: After ensuring that no package applications
cmcld: or resources are active, you can override this
data
cmcld: integrity protection by issuing the following
commands
cmcld: (which allow the daemon to start without
rebooting):
cmcld: rm /var/adm/cmcluster/.cm_start_time
cmcld: touch /var/adm/cmcluster/.cm_start_time
cmcld: For CFS customers, it is highly recommended that
cmcld: they reboot the node instead of using the data
cmcld: override mechanism
15. Defect: JAGaf91648 SR: 8606432206
cmquerycl does not recognize JFS filesystems created
with the default layout version 6 and does not report
them.
In addition to this Serviceguard patch (or it's
superseding patch) the libc patch, PHCO_32488 or it's
superseding should be installed.
If the libc patch PHCO_32488 or its superseding patch
is not installed, cmquerycl will not be able to
recognize JFS filesystems created with the default
layout version 6 and does not report them.
16. Defect: JAGag30170 SR: 8606475859
The Serviceguard daemon cmlvmd terminates upon receipt
of SIGHUP. This causes cmcld to abort and a potential
node TOC.
17. Defect: JAGag31538 SR: 8606477058
cmapplyconf or cmcheckconf of a package with incorrect
syntax for "resource_up_value" might succeed.
18. Defect: JAGag34015 SR: 8606479869
Serviceguard with CVM 4.1 does not support APA or
Infiniband heartbeat interfaces. The Serviceguard
configuration commands currently allow this type
configuration, which will cause CVM/CFS to be unable to
initialize successfully. This can increase perceived
failover times. As this is an unsupported configuration,
undiscovered symptoms are possible.
19. Defect: JAGag21411 SR: 8606465861
Under rare circumstances, cmrunnode will core dump.
The stack trace of cmrunnode is:
#0 0x60000000c04d0410:0 in kill+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c03c7430:0 in raise+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c0489370:0 in abort+0x190 ()
from/usr/lib/hpux32/libc.so.1
#3 0x40f2600:0 in cl_cassfail ()
at utils/cl_clog.c:230
#4 0x4345800:0 in cf_start_post_rba_nodes ()
at config/config_start.c:338
#5 0x4348450:0 in cf_start_cluster ()
at config/config_start.c:716
#6 0x4178d10:0 in cmd_private_fork_daemon ()
at cmd/cmd_utils.c:103
#7 0x4172260:0 in node_main () at cmd/cmd_node.c:403
#8 0x416a2a0:0 in main () at cmd/cmd_main.c:180
20. Defect: JAGaf82011 SR: 8606422187
Improved integration with Distributed Systems
Administration Utilities (DSAU). Intermittent command
hangs in cmapplyconf and cmdeleteconf have been seen
when DSAU is running on nodes in a Serviceguard cluster.
21. Defect: JAGag34316 SR: 8606480189
cmquerycl, cmapplyconf, cmcheckconf do not enforce the
supported limit of 8 nodes for clusters using
CVM 4.1 /CFS.
22. Defect: JAGag34191 SR: 8606480056
With CVM4.1, failover time could be increased by
a few seconds if a FAILFAST service fails while the
cluster is reforming. This can delay the TOC of the
local node and eventually cluster reformation.
23. Defect: JAGag26716 SR: 8606471740
Unused functions in /etc/cmcluster/cfs/SG-CFS-util.sh
are obsoleted. This is not a defective behavior hence
there are no symptoms.
24. Enhancement: JAGag08750 SR: 8606451844
Serviceguard did not support APA's LACP mode and only
supported up to 4 ports per link aggregate for FEC and
MANUAL mode.
This is an enhancement to support Serviceguard with
APA's LACP mode and up to 8 ports per link aggregate
for FEC and MANUAL modes, 32 ports per link aggregate
for LACP mode.
For this enhancement it is required to either install
11.23 December 2005 HP-UX 11i v2 fusion release
or the APA patch PHNE_34774. This enhancement is
disabled if either of them are not installed.
25. Enhancement: JAGag36461 SR: 8606482593
Enhancement to allow Serviceguard Extension for
Faster Failover to be supported with Serviceguard
Storage Management Suite bundles containing CFS.
26. Enhancement: JAGaf87266 SR: 8606427785
Hostnames in Serviceguard and Serviceguard Extension
for RAC cluster nodes are supported only up to 31
characters long.
This is an enhancement to support hostnames in
Serviceguard and Serviceguard Extension for RAC cluster
nodes up to 39 characters long.
For this enhancement it is required to install the
following bundles: NodeHostNameXpnd available in
Software pack media release: SPK0505-11.23,
Part Number: 5013-3681.
This enhancement is disabled if the bundle
NodeHostNameXpnd is not installed.
27. Enhancement: JAGaf93937 SR: 8606435509
With the release of Quorum Server A.03.00.00, multiple
IP addresses for the quorum server can be specified.
This is an enhancement that allows Serviceguard to
support configuration of multiple IP addresses for
the quorum server.
For this enhancement it is required to upgrade the
quorum server version to A.03.00.00. For more
information on how to install and configure Quorum
Server version A.03.00.00, please refer to the release
notes for Quorum Server A.03.00.00. Note that this
release document is expected to be released April or
May 2007. This enhancement remains disabled if
quorum server version is not upgraded to A.03.00.00.
When this patch is used with versions of quorum server
earlier than A.03.00.00, only one quorum server IP
address is supported.
PHSS_35371:
1. Defect: JAGag13927 SR: 8606457625
cmquerycl command aborts when the cluster configuration
contains DGC devices having long hardware paths
with the output as given below. Similar abort may
be experienced with other Serviceguard commands like
cmgetconf, cmapplyconf.
........
Gathering storage information
Found 23 devices on node omztcl2
Analysis of 23 devices should take approximately 5
seconds
0%----10%----20%----30%----40%----50%----60%----70%
----80%----90%----100%
Unable to receive device query message from omztcl2:
Software caused connection abort
Could not send message to node omztcl2: Software caused
connection abort
Assertion failed: conn->inuse, file:
config/config_storage.c, line:2399
2. Defect: JAGag08257 SR: 8606451287
In a cluster with a package configured with a dependent
EMS resource. If you issue cmrunnode on one node while
the cluster is down, the cmrunnode will fail and cmcld
will die nicely. However, the EMS resource for the
package was not deregistered before cmcld exits.
A subsequent cmruncl will cause the package depending
on the EMS resource not to start the package on this
node and logs the following error.
Jul 7 11:44:20 bit cmcld[12586]:
Resource /net/interfaces/lan/status/lan0 does
not meet package RESOURCE_UP_VALUE for package snarf.
Jul 7 11:44:20 bit cmcld[12586]: Package snarf cannot
run on this node.
3. Defect: JAGaf69163 SR: 8606409265
Invalid data can be specified in the USER_NAME field for
the access control policies in the cluster ascii file
and a cmapplyconf will complete without error. When a
cmapplyconf is re-executed to correct this, and if
the cluster is running, cmcld will abort, resulting in a
node TOC.
The following message will be logged in syslog when an
invalid username is applied:
Jul 12 11:34:08 sly cmcld: ERROR:
Invalid user name in RBA
Privilege lookup
The following messages will be logged
in syslog when the invalid username
is corrected:
Jul 12 11:35:06 sly cmcld:
cdb_db_handle_lookup - More than
one found
Jul 12 11:35:06 sly cmcld: CDB Prepare -
Unable to delete /acps/sly/*, object
does not exist
Jul 12 11:35:06 sly cmcld: CDB Prepare -
Unable to perform configuration
operation 2. Return value is 22.
Jul 12 11:35:06 sly cmcld: Aborting:
cdb/cdb_db_server.c 1937 (Failed to
roll back config change
Jul 12 11:35:06 sly cmcld:
cdb_db_handle_lookup - More than one
found
Jul 12 11:35:10 sly cmclconfd[6699]: The
Serviceguard daemon, /usr/lbin/cmcld[6700],
died upon receiving signal number 6.
4. Defect: JAGag09971 SR: 8606453198
With Serviceguard Extension for RAC, when the filesystem
where /etc/cmcluster resides becomes full and Oracle is
trying to request a group membership change, the
messages like the following will appear in syslog:
cmgmsd[1997]: Unable to apply the configuration change
due to insufficient disk space.
cmgmsd[1997]: ERROR: commit_cdb_txn: Failed to commit
transaction(28,No space left on device)
This could ultimately manifest itself in various Oracle
failures.
5. Defect: JAGag13268 SR: 8606456893
A package which uses VxVM disk groups will fail
to start and will report that a disk group may be
imported on another node even if it is not if cmviewcl
fails. The following will be seen in the package
log file:
check_dg: Error DG may still be imported on HOST
6. Defect: JAGag11741 SR: 8606455170
A system can become unresponsive during a cmquerycl if
there are a large number of logical volumes configured
on the system. For example a system configured with
1400 logical volumes was unresponsive for 10 minutes
while cmquerycl was running. Similar delay may be
experienced with other Serviceguard commands like
cmgetconf, cmapplyconf.
7. Defect: JAGag11992 SR: 8606455475
The Serviceguard boot script /sbin/init.d/cmcluster
can take a very long time to execute, resulting in a
long system boot time after a Metrocluster/SRDF
package failover. The following error might show
up in /etc/rc.log:
VxVM vxdisk ERROR V-5-1-531 Device c14t12d1:
clearimport failed: Disk write failure
If the Metrocluster/SRDF package is then restarted on
the R1 node, the import of the VxVM data group fails
with
VxVM vxdisk ERROR V-5-1-539 Device c14t12d1:
get_contents failed: Disk device is offline
VxVM vxdg ERROR V-5-1-587 Disk group dgpkgEMC-1: import
failed: No valid disk found containing disk group
PHSS_34337:
1. In a reforming cluster that has the
NETWORK_FAILURE_DETECTION parameter set to
INONLY_OR_INOUT, full network polling would not be
performed even if the primary lan has missed the
maximum number of inbound polling packets thus causing
a local lan failover to the standby lan not to occur.
2. Sometimes a package using an psmmon EMS resource may
not come up, when Serviceguard is re-started.
3. cmapplyconf incorrectly allows a cluster lock volume
group to be unclustered (VOLUME_GROUP line removed
from cluster ascii file).
4. Serviceguard automatically plumbs standby network
interfaces for IPv6 use, even when IPv6 is not being
used in the cluster configuration.
5. Corruption in link level messages can lead to cmcld
SIGSEGV even with checksumed messages, the stack traces
of the resulting core files can vary but often include
one of the functions ns_if_setgood or dlpi_recv.
6. cmcld aborts with "Not enough space" on socket
allocation. The following message will be logged in the
syslog:
vmunix: Failed to allocate a socket: Not enough space
vmunix: Service Guard Aborting!
vmunix: Cause: socket failed
vmunix: (File: rcomm/comm_ip_setup.c, Line: 538)
vmunix: Aborting! socket failed
7. syslog shows the following diagnostic message from
cmcld:
connect to 192.77.1.5 for port 5300 failed with
Invalid argument
8. In the package control script templates, the
explanation fields for the VGCHANGE examples are
inaccurate. The examples and log messages from the
control script do not take shared volume group
activation mode into account.
9. Serviceguard A.11.17 provides a SCRIPT_LOG_FILE
parameter to set the log file for a package. Messages
from the control script are output to this log file but
service command output does not and still goes to the
default log file. The default log file is named by
adding .log onto the control script name.
10. Failure of cmhaltnode when executed in parallel on
multiple nodes. The error of the failed cmhaltnode
command may look something like this. The error message
is not unique. Any other error message is also
possible.
$cmhaltnode -f
Disabling package switching to all nodes being halted.
Warning: Do not modify or enable packages until the
halt operation is completed.
Failed to query the package information
11. Sometimes a package using a simple package dependency
may not start even though the package that it depends
on has started.
12. Error message output by cmcheckconf and cmapplyconf is
not helpful if the bridged network assignment changes.
13. Under very rare circumstances all the nodes in the
cluster may TOC at the same time, when the timer loop
thread is stuck (not holding cm_lock) or the system
clock is not advancing. This prevents cmcld timeout and
prevents the safety timer from being updated resulting
in a TOC.
14. If the hacl-cfg UDP port is scanned by Linux utilities
like nmap and amap, Serviceguard commands potentially
fail for ten minutes. If inetd logging is enabled the
following message is logged to syslog:
"inetd[27802]: hacl-cfg/udp: Server failing
(looping), service terminated."
Sometimes cmviewcl ends up spinning forever with the
following output from cmviewcl:
"Protocol failure talking with cmclconfd on
10.144.196.135 (5)"
15. cmviewconf core dumps if it is unable to communicate
with cmclconfd for any reason.
16. cmapplyconf does not provide correct information when
run in a VERITAS Cluster Volume Manager 4.1
environment. Serviceguard does not provide correct
information about minimum LAN requirements when they
are not met. Also the message "Need not have to look
for shared VGs" logged by cmapplyconf is unclear.
17. cmrunnode times out after approximately 35 seconds
rather than waiting for AUTO_START_TIMEOUT to expire
even though cmcld is still trying to form a cluster
in the background.
18. When cmviewcl is run with the options '-v -f line',
the value displayed for the cluster_formation_time
attribute is incorrect.
19. The Serviceguard NMAPI interface fails if the file
descriptor used to connect to cmgmsd is greater than
the default FD_SETSIZE, i.e. 24576 causing data
corruption of the client process. This is applicable
only for SGeRAC and Oracle client processes.
20. Incorrectly formatted IP addresses in the cluster
ascii file are not correctly detected by cmcheckconf
and cmapplyconf resulting in confusing error messages.
The IP addresses are reported as 255.255.255.255 rather
than the text that was entered in the ascii file. For
example the entry:
HEARTBEAT_IP 16.113.153.bad
results in the following error:
Network interface lan0 on node ogre has a different IP
address (16.113.153.12 != 255.255.255.255)
21. A 2-node Serviceguard cluster with a cluster lock may
form two clusters if all heartbeat networks experience
prolonged heavy network congestion and if there are
frequent kernel hangs during a cluster reformation.
This will result in data integrity problem.
22. For Serviceguard cluster configurations that do not
have the SGeRAC product installed a SCSI bus reset is
not issued at the appropriate time for exclusively
activated volume groups.
23. The cmcheckconf and cmapplyconf commands may fail with
a misleading error message when a Standby LAN in the
cluster configuration has been disconnected or has
failed.
24. Too many "Unable to stat /etc/cmcluster/cmclconfig,
No such file or directory" messages fill up syslog.
25. Quorum server going up and down at times causes cmcld
to dump core. The cmcld aborts with signal 6.
The stack trace of cmcld is:
#0 0x60000000c04a9690:0 in kill+0x30 ()
from /usr/lib/hpux32/libc.so.1
#1 0x60000000c03a0430:0 in raise+0x30 ()
from /usr/lib/hpux32/libc.so.1
#2 0x60000000c04625f0:0 in abort+0x190 ()
from /usr/lib/hpux32/libc.so.1
#3 0x4351700:0 in cl_cassfail (clog_handle=0x0,
module=11,assertion=0x400402f0 "FALSE",
file=0x40029630 "utils/cl_select.c", line=482) at
utils/cl_clog.c:228
#4 0x4382900:0 in cl_select_notify_error () at
utils/cl_select.c:482
#5 0x4383ca0:0 in cl_select_loop () at
utils/cl_select.c:671
#6 0x60000000c00b3d20:0 in
__pthread_bound_body+0x170 ()
from /usr/lib/hpux32/libpthread.so.1
26. Result of cmcheckconf -k and cmapplyconf -k are
different when volume group listed in cluster config
ascii file is powered off.
27. Configuration commands such as cmgetconf fail after
reporting disks do not have an ID when they do:
Warning: The disk at /dev/dsk/c25t0d0 on node kelvin
does not have an ID, or a disk label.
Error: Unable to determine a unique identifier for
physical volume /dev/dsk/c25t0d0 on node kelvin. Use
pvcreate to give the disk an identifier.
The following errors are reported in syslog:
Feb 6 20:01:07 kelvin cmclconfd[6345]: Unable to open
disk
/dev/dsk/c25t0d0: Resource temporarily unavailable
Feb 6 20:02:20 kelvin cmclconfd[6345]: Physical
volume
/dev/dsk/c25t0d0 in volume group /dev/vgXX does not
have an ID!
28. In certain configurations, cmquerycl could hang
indefinitely if any volume group is removed from the
system while cmquerycl is in progress.
29. The cmviewcl(5) man page that documents cmviewcl -f
line is missing.
30. As a prerequisite to support HP Integrity Virtual
Machine (HPVM) nodes as a member of a Serviceguard
cluster, the quiescence period during cluster
reformation for HPVM guests need to be extended.
31. Enhanced the Serviceguard package control script
templates to support integration with the EVFS
(Encrypted Volume File System) product.
32. A TOC occurs and the following error shows up in
syslog:
cmlvmd: Failed to accept connections from commands:
No buffer space available
33. cmviewcl command takes about a minute returning the
error message, when the user does not have the
adequate access level to view the cluster information.
34. Certain memory within Serviceguard daemon cmcld was
freed twice during cluster reformation resulting in
memory corruption. This can result in unexpected
behavior ranging from no effect at all or an error
message or a core dump.
35. The documentation about Serviceguard configuration
parameter NODE_TIMEOUT is improved. The improved
documentation is in cluster ASCII file, cmquerycl man
page and "Note" message after cmcheckconf/cmapplyconf
command.
36. During heavy network traffic, cmcld may log the
following message to syslog:
cmcld: Failed to receive IP message from 192.77.1.13
on 5300, Resource temporarily unavailable.
37. Log messages like the following fill up syslog.log
vmunix: LLT INFO V-14-1-10023
lost 12 hb seq 97 from 3 link 1 (lan2)
vmunix: LLT INFO V-14-1-10019
delayed hb 3350 ticks from 2 link 1 (lan2)
...
vmunix: LLT INFO V-14-1-10019
delayed hb 22100 ticks from 3 link 0 (lan1)
These messages appear when Serviceguard is configured
with Veritas Cluster File System or Veritas Cluster
Volume Manager and there is a standby lan configured
for the heartbeat interface.
38. When using the NFS toolkit scripts in a Serviceguard
package control script, if the package is halting and
the NFS scripts are unable to cleanly shutdown NFS, the
Serviceguard package script logs the following message
in the log:
Node "nodename": Package start failed at Wed Dec 7
09:22:19 EST 2005
This is a misleading message, since in reality, the
package stop failed, not the package start.
39. The documentation in the package control script states
that the HA NFS script should be named "ha_nfs.sh"
instead of the correct name "hanfs.sh".
40. cmquerycl -f line option when used in conjunction with
-c and -C options does not display PV information of
a node that is being added into the cluster.
41. cmquerycl -f line option displays nodes outside the
cluster and their node id is set the highest node
id of the cluster.
42. "cmquerycl -v -fline -c cluster -n nodeA", where the
-n list does not include all the nodes in the existing
cluster, generates a command core.
43. If cluster nodename is configured with more than
eight characters, Serviceguard commands and daemons
may fail with error indicating that the operating
system release string is null.
44. Serviceguard commands fail to connect to Serviceguard
node. This can happen when there are high amounts of
Serviceguard traffic on the network, combined with
slow DNS servers and a configuration which does not
have all Serviceguard nodes on the network being
listed in /etc/hosts.
For example cmrunpkg can see the cluster as
"down/not running" even if the cluster is running.
cmrunpkg -n april -n may pkg-m-1070450443_7
cmrunpkg: Cluster appears to be down
45. Packages start on the first node that satisfies the
package dependencies at cluster start up time and not
necessarily on the configured primary node.
PHSS_33840:
1. This patch provides VERITAS Cluster File System
(CFS 4.1) capability with appropriate Serviceguard
Storage Management Suite bundles such as T2775BA,
T2776BA, or T2777BA. This patch also enables
VERITAS CVM 4.1 capability with Serviceguard A.11.17.
Therefore this Item 1 does not represent a defect and
therefore there is no external Symptom for this Item.
2. The boot-time cluster initialization script
(/sbin/init.d/cmcluster) does not retry up to
the full AUTO_START_TIMEOUT time in the case
where the cluster is not already running and
one or more of the configured cluster nodes
are not reachable via the network.
3. After deleting a node from the cluster, the
configuration daemon (cmclconfd) on the deleted
node goes into an infinite loop.
4. Messages such as the following appear in syslog:
cmfileassistd[26383]: The cluster daemon aborted our
connection (231).
cmcld[12375]: Too much pending message memory
(1120552 bytes, 1048576 max)
cmfileassistd[26383]: Lost connection with Serviceguard
cluster daemon (cmcld): Software caused
connection abort
5. When multiple nodes are started and join an existing
cluster at the same time, a node may abort and the
message in syslog will be:
Assertion failed: trans_id != NULL, file:
cdb/cdb_utils.c, line: 263.
This may also appear as a segmentation violation
abort of the cmcld with the stack trace of the core
dump showing cl_config_disconnect.
6. Misspellings were found in the cmquerycl man page.
7. cmcld aborts because cmapplyconf incorrectly passes
rather than failing as it should when 'detected a
partition of IP subnet' and 'minimum network
configuration requirement for the cluster have not
been met.'
An example of the warnings is shown below:
Detected a partition of IP subnet 192.76.1.0.
Partition 1
node1 lan9000
node2 lan9000
node3 lan9000
Partition 2
node4 lan9000
Detected a partition of IPv6 subnet fec0:0:0:4c01::.
Partition 1
node1 lan9000
node2 lan9000
node3 lan9000
Partition 2
node4 lan9000
Checking for inconsistencies
Minimum network configuration requirements for
the cluster have not been met. Minimum network
configuration requirements are:
- 2 or more heartbeat networks OR
- 1 heartbeat network with local
switch (HP-UX Only) OR
- 1 heartbeat network using APA with 2 trunk
members (HP-UX Only) OR
- 1 heartbeat network with serial
line (HP-UX Only) OR
- 1 heartbeat network using bonding (mode 1)
with 2 slaves (Linux Only)
...
Adding configuration to node node4
Modifying configuration on node node1
Modifying configuration on node node3
Modifying configuration on node node2
Adding node node4 to cluster
cluster_node1_200510181839
Marking/unmarking volume groups for
use in the cluster
Completed the cluster creation
8. cmcld has not been updated to handle the
drivers for the newer supported cluster
lock interface cards such as the Ultra160
and Ultra320 cards. Therefore cmcld defaults
the cluster lock timings to the default (worst
case) leading to longer failover times than
would be expected, approximately 60 seconds
rather than 30 seconds which would be seen
for c720 driver for a simple 2 node cluster
with 2 second node timeout.
9. There is no symptom for this item. This is just a
description of the addition of a 64bit version of
libsgcl shared library to the Patch Depot. The
file is being added now for future internal use.
10. cmapplyconf can core dump in the situation where
one of the members has a network interface that
may have intermittent problems, where the device's
availability turns on and off. If the administrator
runs cmapplyconf to modify the cluster configuration
and one of the members has the network device that
has intermittent problems the cmapplyconf will fail
and core dump.
The messages from cmapplyconf might look like:
Checking nodes ... Done
Checking existing configuration ... Done
Probing dst_node_id = 3 dst_net_id= 1 dst_ppa=0
with ZERO dst_mac_addr=0x
Probing dst_node_id = 3 dst_net_id= 2 dst_ppa=0
with ZERO dst_mac_addr=0x
Probing dst_node_id = 3 dst_net_id= 1 dst_ppa=0
with ZERO dst_mac_addr=0x
Probing dst_node_id = 3 dst_net_id= 2 dst_ppa=0
with ZERO dst_mac_addr=0x
Probing dst_node_id = 3 dst_net_id= 1 dst_ppa=0
with ZERO dst_mac_addr=0x
Probing dst_node_id = 3 dst_net_id= 2 dst_ppa=0
with ZERO dst_mac_addr=0x
Assertion failed: NULL != sub_net, file:
config/config_net_evaluate.c, line: 245
The command will fail.
Also, the syslog messages on the node where the
network device is having problems might look like:
Oct 1 05:08:45 nodename cmclconfd[5832]: Lan
interface 0 (PPA) on node id 3 has ZERO MAC address
Oct 1 05:08:45 nodename cmclconfd[5832]: DLPI error 1,
unix error 0 sending to aa080009167f - snap value
Oct 1 05:08:45 nodename cmclconfd[5832]: Problem
with network interface 0: Connection timed out.
11. During a cmrunnode: "Detected Partition" error
messages appear and the cmrunnode succeeds.
Also, cmomd core from the api code path can occur:
Evaluating IP addresses the code expects to have at
least 1 netd object to represent the node with the
configured subnet.
The stack trace would look something like this:
Program terminated with signal 6, Aborted.
#0 0xc0214128 in kill+0x10 () from /usr/lib/libc.2
#0 0xc0214128 in kill+0x10 () from /usr/lib/libc.2
#1 0xc01ab554 in raise+0x24 () from /usr/lib/libc.2
#2 0xc01f0df0 in abort_C+0x160 () from /usr/lib/libc.2
#3 0xc01f0e4c in abort+0x1c () from /usr/lib/libc.2
#4 0x81d0 in crash_handler (s=11) at om/om_main.c:226
#5
#6 0xc019b178 in tree_concatenate+0x8 ()
from /usr/lib/libc.2
#7 0xc019c4d0 in real_free+0x498 ()
from /usr/lib/libc.2
#8 0xc019f698 in free+0x340 ()
from /usr/lib/libc.2
#9 0xc96dccc4 in
cf_private_evaluate_ip6_partition
(cl=0x401cabb0, scope=25,
ret=0x7bff5388, logh=0x7bff5174, flags=336)
at config/config_net_evaluate.c:1616
#10 0xc96dd224 in cf_private_evaluate_network_probing
(cl=0x401cabb0, scope=25, flags=336,
logh=0x7bff5174)
at config/config_net_evaluate.c:1718
#11 0xc9710ef8 in cf_private_find_config (cl=0x401cabb0,
scope=25, flags=336, make_copy=1, logh=0x7bff5174)
at config/config_query.c:889
#12 0xc97113a8 in cf_find_config (cl=0x401cabb0, scope=25,
flags=336, logh=0x7bff5174)
at config/config_query.c:981
#13 0xc9712000 in cf_validate_network (cl=0x401cabb0,
flags=336, logh=0x7bff5174)
at config/config_query.c:1183
#14 0xc94345ac in cmp_validate_network_connections
(cl=0x401b28f0, vflag=0, log=0x40011480 "OMOB")
at providers/cmprovider/cmp_utils.c:2316
#15 0xc9454478 in exec_method_op
(context=0x400388d0 "OMOB",
providerOp=0x40167560 "OMOB") at
providers/cmprovider/cluster/cmp_cluster_node.c:1665
#16 0xc9454e1c in cmp_op_SGClusterNodeContainment
(context=0x400388d0 "OMOB",
providerOp=0x40167560 "OMOB") at
providers/cmprovider/cluster/cmp_cluster_node.c:1772
#17 0xc9428944 in CMProviderOperation
(providerOp=0x40167560 "OMOB")
at providers/cmprovider/cmp_provider.c:1346
#18 0xc93c81c4 in _OMProviderOperation
(provider=0x40021958 "OMOB",
providerOp=0x40167560 "OMOB") at
om/cm_provider.c:487
#19 0xc93dc508 in CMProviderOperation
(providerOp=0x40167560 "OMOB")
at om/cm_provider_linkage.c:439
#20 0xc9370c58 in _OMExecMethodOp (
class_name=0x40166910 "SGClusterNodeContainment",
method_name=0x40012608 "start",
instance_id=0x40167830
"SGClusterNodeContainment:SGCluster:1052344974+C
MNode:dad.cup.hp.com", return_value=0x7bff3888,
input_parameters=0x40166888 "OMOB",
output_parameters=0x7bff3878,
clientContext=0x400125a0 "OMOB",
log=0x40011480 "OMOB") at om/cm_ops.c:742
#21 0x1bffc in parse_exec_method (cl=0x400111e0)
at om/om_network.c:3069
#22 0x289d4 in connection_handler (fd=5, key=0x400111e0 "")
at om/om_network.c:5035
#23 0xa2b4 in OMSelectLoop (doneOMSelectLoop=0x40002774)
at om/om_select.c:185
#24 0x9704 in main (argc=5, argv=0x7bff0054)
at om/om_main.c:573
12. When cmcld is starting up, ClusterUp and NodeUp
snmp traps are missed.
13. "cmviewcl -f line" is not fully documented in
man pages.
Defect Description:
PHSS_36636:
1. Defect: JAGag34293 SR: 8606480162
Serviceguard used to exit displaying an error message
whenever an invalid message was received.
Resolution:
The code has been modified to ignore invalid messages.
2. Defect: JAGag34599 SR: 8606480516
There is a shared variable between different bridged
networks. When both subnet recover at about the same
time, it leads to an inconsistent state of the shared
variable. This causes the messages to appear for one
of the bridge network.
Resolution:
The code has been modified to not use a shared
variable between different bridged networks.
3. Defect: JAGag37580 SR: 8606484458
The cmsrvassistd exit status code was removed
in A.11.17 for all the cases.
Resolution:
The code has been added to handle the cmsrvassistd
exit status.
4. Defect: JAGag41341 SR: 8606488693
There was a missed case where in, if
HEARTBEAT_INTERVAL is not set in the cluster
configuration file then default value was being set to
zero.
Resolution:
The code has been modified to handle this case.
5. Defect: JAGag42544 SR: 8606490065
The syslog messages were relevant only on HP-UX 11.23
but not on HP-UX 11.31 systems. There was no check
before logging to see if the machine is HP-UX 11.23 or
not.
Resolution:
The code has been modified not to display the false
syslog message on HP-UX 11.31 systems.
6. Defect: JAGag36170 SR: 8606482272
Serviceguard locks a mutex before updating the callback
structure, but it unlocks the mutex before we remove
the callback from the list.
Resolution:
The code has been modified to unlock the mutex after
the callback is removed from the list.
7. Defect: JAGag37994 SR: 8606484939
The status of system multinode package is shown as
"starting" during node/cluster halting.
Resolution:
The code has been modified to show the status of
system multinode package as "changing" during
cmhaltnode/cmhaltcl execution.
8 Defect: JAGag42785 SR: 8606490333
Trying to modify hostent structure through gethost*()
call while in the middle of using it.
Resolution:
Instead of actual hostent structure, a copy of it is
used to have protection against next gethost*() calls.
9. Defect: JAGag43673 SR: 8606491388
During the course of certain disk operations which
refresh data from remote side(R2) to primary side (R1),
disks are rendered "Not Ready"(not visible to the host)
for a short period of time. During this period of time,
if the primary node on the R1 is rebooting and VxVM is
starting up as part of boot sequence, VxVM marks these
"Not Ready" disks offline since it cannot access them.
Later, this results in a package startup failure during
package failback as VxVM Diskgroups are unable to be
imported due to offline disks.
Resolution:
The Serviceguard package control script has been
enhanced with a new parameter "VXVM_DG_RETRY" which
can be set to either "YES" or "NO". Setting this
parameter to "YES" will run the following
command "vxdisk scandisks" on the disks which belong
to the failed disk group.
10. Defect: JAGag42796 SR: 8606490345
Execution of cmhaltcl or cmhaltnode on a cluster
configured with packages using relocatable IP
addresses on subnets that are not in the cluster
configuration, causes cmcld to grow in size
and dump core. The elements of the array were not
advanced properly during the removal of IP addresses.
Resolution:
Code has been modified to make sure that the elements
of the array are advanced properly.
11. Defect: JAGag44706 SR: 8606492535
The log message was logged at default log level,
when the default buffer size of unix domain socket was
insufficient. But, cl_msg's flow control takes care of
adjusting the size and sending it.
Resolution:
Increased the message log level and changed the
log category, so that the message is not logged
at default log levels.
12. Defect: JAGag43289 SR: 8606490922
cmviewcl cannot obtain the os_status value unless
the node is probed. This is only done when the verbose
option is used. cmviewcl should not display the
os_status unless the verbose option is used.
Resolution:
os_status is only displayed when the verbose option is
specified with cmviewcl.
13. Defect: JAGag45533 SR: 8606493360
While displaying the status of service fail fast flag,
cmviewconf command uses different byte order values to
compare the flag which results in incorrect status of
the flag to be displayed.
Resolution:
Proper byte order comparison is performed to get the
correct status of service fail fast flag.
14. Defect: JAGag45718 SR: 8606493785
In the cluster ASCII configuration file, if QS_HOST
value is specified with invalid, backslash character
('\'), cmapplyconf dumps core.
Resolution:
The values specified in the cluster configuration file
for QS_HOST and QS_ADDR are checked for invalid,
backslash characters. An error will be thrown if it
has invalid, backslash character.
15. Defect: JAGag41937 SR: 8606489376
The node experiencing hangs dies as it is unable to
update its safety timer. The candidate for coordinator
dies as it is waiting for heartbeat from the node
experiencing hangs in order to update its safety timer.
Resolution:
Node experiencing hangs is failed before it causes
problems to other nodes in the cluster.
16. Defect: JAGag46086 SR: 8606494153
cmapplyconf displays inappropriate error message when
multiple QS_HOST entries are specified.
When the -q option is read from cmquerycl command
line, the array index is not incremented properly.
In 'cmviewcl -v -f line' the quorum server ip
addresses are displayed in invalid format command
output. The format is changed to display it correctly.
Resolution:
The error message is updated to display the
appropriate, valid message.
The way how cmquerycl command line arguments are read
for -q option is changed to read it properly.
The 'cmviewcl -v -f line' output format string is
changed to display the ip_addresses properly.
17. Defect: JAGag46092 SR: 8606494159
cmapplyconf does not check for online changes in
QS_POLLING_INTERVAL and QS_TIMEOUT_EXTENSION values.
Resolution:
cmapplyconf is modified so it disallows changes to
QS_POLLING_INTERVAL and QS_TIMEOUT_EXTENSION online
and fails with an error if this is attempted.
18. Defect: JAGag27798 SR: 8606473093
For all the unowned packages by default status and
state were displayed as "down" and "halted"
respectively without checking for the cluster
reachable status.
Resolution:
Package status has been modified to display
as "unknown" when the nodes of the cluster is not
reachable.
19. Defect: JAGag38581 SR: 8606485608
Commands that show cluster information will display all
configured access control policies to a non-root user
that has privilege to run the command. This itself is
not a problem but it has been decided to show policies
only to those with the same or higher level of access.
Resolution:
The roles displayed will match the privilege level of
the user running the command based on the user name and
host from where command is run.
PHSS_35427:
1. Defect: JAGag21443 SR: 8606465899
The select() system call was not retried when it
failed because of an interrupted system call.
Resolution:
select() system call is retried for a maximum of ten
times when it fails because of an interrupted system
call.
2. Defect: JAGag20225 SR: 8606464542
Defect: JAGag29645 SR: 8606475212
TEAC CD/DVD devices were not being excluded from
probing due to their unique peripheral descriptions
resulting in them not being detected as CD/DVDs,
also some CD/DVD devices from other manufacturers were
being probed as the descriptions were not present in
cmclconfd.
Resolution:
cmclconfd has been modified to exclude more CD and DVD
devices and specific TEAC CD/DVD devices.
3. Defect: JAGag06135 SR: 8606448943
When 3 heartbeats are exchanged between coordinator and
member node in the same tick, "ticks_since_boot not
advancing for last 4 seconds" is logged in the flight
recorder log. This message is misleading.
Resolution:
Code has been changed such that there will not be any
warning message if 3 heartbeats are received in same
tick but start giving warning message if 4 heartbeats
are received in same tick.
4. Defect: JAGag14977 SR: 8606458777
Defect: JAGag33746 SR: 8606479578
cmcld, cmnetassistd was coded to ignore SIGILL.
Resolution:
code to ignore SIGILL removed.
5. Defect: JAGag11719 SR: 8606455144
There was no core file from cmgmsd as it was not
aborted when it fails to halt correctly within timeout.
Resolution:
cmcld will now send an abort signal to cmgmsd if the
daemon fails to halt within timeout thus causing cmgmsd
to dump a core file. In the future this change helps
troubleshooting the underlying problem.
6. Defect: JAGag25946 SR: 8606470887
The maximum number of connections that can be accepted
by cmlvmd daemon is 5. It is too small.
Resolution:
The number of connections that can be accepted by
cmlvmd daemon is now increased to 4096.
7. Defect: JAGag12644 SR: 8606456223
There was no check to see whether the service script
or program exists or has execute permission before
attempting to run the service.
Resolution:
Only allow service restart to be attempted if service
script or program exists with execute permission.
8. Defect: JAGag05782 SR: 8606448540
When system multi-node package fails at startup,
it causes the node to TOC's. If AUTOSTART_CMCLD is set
to 1, the system can repeatedly TOC and reboot if
there are any problems with the system multi-node
package after cmcld is up and running.
Resolution:
Cluster activities are not automatically started if the
node TOC's repeatedly twice due to failure in starting
system multinode package. A message is logged
in syslog.log and /etc/rc.log.
9. Defect: JAGag27672 SR: 8606472905
The issue resulted from Serviceguard's use of multiple
thread priorities within the same process.
Resolution:
Added additional synchronization between the threads.
10. Defect: JAGag20034 SR: 8606464337
Lower index interface is processed first and added to
the bridged net. In this scenario, the standby
configured on lower index interface does not find the
primary interface in the bridge net to figure out if
IPv6 is configured.
Resolution:
The plumbing routine is done only after all interface
entries are added to the database.
11. Defect: JAGag25522 SR: 8606470431
The variable CMGMSD_LOG_FILE was included
in /etc/cmcluster.conf.
Resolution:
CMGMSD_LOG_FILE variable is removed from
/etc/cmcluster.conf so that cmgmsd daemon will hence
forth log to syslog.
12. Defect: JAGag13439 SR: 8606457100
Defect: JAGag25508 SR: 8606470417
The permissions for /var/adm/SGsnmpsuba.log is
set to 666 which allows universal write to this
log file.
Resolution:
The permissions for the log file
/var/adm/SGsnmpsuba.log is set to 644.
13. Defect: JAGag18064 SR: 8606462172
Incorrect behavior of cmapplyconf allows the
first virtual machine node to be added or last
virtual machine node to be removed from a
Serviceguard CFS Cluster even when system multi
node package SG-CFS-pkg is up on other nodes.
Resolution:
Modified the behavior of cmapplyconf to disallow
this configuration.
14. Defect: JAGag11227 SR: 8606454589
Changes to cmcld increased the halt sequence time,
where if cmcld were to die, it would not be restartable
on that node.
Resolution:
Modified the code to decrease the halt sequence time.
15. Defect: JAGaf91648 SR: 8606432206
cmquerycl command does not recognize JFS filesystems
created with the default layout version 6 and does not
report them.
Resolution:
cmquerycl has been enhanced to identify and report
the logical volumes with JFS version 6 layout file
system.
In addition to this Serviceguard patch (or it's
superseding patch) the libc patch, PHCO_32488 or it's
superseding patch should be installed.
If the libc patch PHCO_32488 or its superseding patch
is not installed, cmquerycl will not be able to
recognize JFS filesystems created with the default
layout version 6 and does not report them.
16. Defect: JAGag30170 SR: 8606475859
cmlvmd was coded not to ignore SIGHUP.
Resolution:
cmlvmd has been modified to ignore SIGHUP
17. Defect: JAGag31538 SR: 8606477058
cmapplyconf or cmcheckconf did not check for string
boundaries while parsing string value given for
"resource_up_value".
Resolution:
Modified cmapplyconf and cmcheckconf to look for end
of string before parsing the subsequent token.
18. Defect: JAGag34015 SR: 8606479869
Serviceguard should prevent clusters using APA or
Infiniband from being configured when SG-CFS-pkg is
configured.
Resolution:
Added checks so that cmquerycl, cmcheckconf and
cmapplyconf will fail if Serviceguard with CVM 4.1 is
configured with APA or infiniband heartbeat interfaces.
19. Defect: JAGag21411 SR: 8606465861
Execution of the command cmrunnode on a cluster
even before the earlier command cmapplyconf has
finished writing the config file, can cause the value
of "node" to be NULL, which causes an assertion.
Resolution:
Assertion has been replaced with the following message,
"Unable to execute the command at this time, please try
again." asking the user to try again.
20. Defect: JAGaf82011 SR: 8606422187
The execution of cmapplyconf, cmdeleteconf invokes
scripts in a Distributed Systems Administration
Utilities (DSAU) environment. The command 'ps -ef' on
any script launched from
/usr/sbin/cmconfig_change_callout shows wrong process
name. It will show the process name of the parent
script, '/usr/bin/sh usr/sbin/cmconfig_change_callout'.
Resolution:
/usr/sbin/cmconfig_change_callout has been modified to
use nohup to launch the script.
21. Defect: JAGag34316 SR: 8606480189
cmquerycl, cmapplyconf, cmcheckconf do not enforce the
supported limit of 8 nodes for clusters using
CVM 4.1 /CFS.
Resolution:
Disallow cmquerycl, cmapplyconf, cmcheckconf operation
for CVM 4.1 /CFS cluster of more than 8 nodes.
22. Defect: JAGag34191 SR: 8606480056
With CVM 4.1, when a failfast service fails while
the cluster is reforming, safety time will be updated
to a value beyond what it should be.
Resolution:
Added a check to ensure safety time is not set past
its current value.
23. Defect: JAGag26716 SR: 8606471740
Unused functions in /etc/cmcluster/cfs/SG-CFS-util.sh
are obsoleted. This is not a defective behavior hence
there is no defect description.
24. Enhancement: JAGag08750 SR: 8606451844
Serviceguard did not support APA's LACP mode and only
supported up to 4 ports per link aggregate for FEC and
MANUAL mode.
Resolution:
Serviceguard has been enhanced to support APA's LACP
mode and up to 8 ports per link aggregate for FEC and
MANUAL modes, 32 ports per link aggregate for LACP mode.
In addition to this Serviceguard patch or the later the
user needs to install the 11.23
December 2005 HP-UX 11i v2 fusion release or the APA
patch PHNE_34774. This enhancement is disabled if
either of them are not installed.
25. Enhancement: JAGag36461 SR: 8606482593
Certification and software limitations were required
for support of the Storage Management Suite's cluster
filesystem component within a cluster configured for
faster failover.
Resolution:
Added checks during cmapplyconf to enforce a minimum
node timeout limit for clusters configured with both
CFS and faster failover.
26. Enhancement: JAGaf87266 SR: 8606427785
In Serviceguard Extension for RAC (SGeRAC)
configuration, the SLVM subsystem is unable to support
more than 31 character hostnames. Hence it limits the
support of hostnames in Serviceguard and SGeRAC cluster
nodes to only 31 characters.
Resolution:
Removed the 31-character hostname restriction in cmcld
and enhanced cmlvmd to allow existing vgdisplay command
in SGeRAC configuration to truncate 39-character
hostnames in the cluster nodes to 30 plus a '*' at the
end in its output.
Note that the SLVM subsystem on HP-UX 11.23 still
does not support hostnames of more than 31 characters
long.
This enhancement makes it flexible for Serviceguard to
support 39 character hostname and at the same time
seamlessly integrate with existing SLVM subsystem.
In addition to this Serviceguard patch or the later the
user needs to install the following bundle:
NodeHostNameXpnd, available in Software pack media
release: SPK0505-11.23, Part Number: 5013-3681.
This enhancement is disabled if the bundle
NodeHostNameXpnd is not installed.
27. Enhancement: JAGaf93937 SR: 8606435509
When the quorum server is needed during a cluster
reconfiguration, if the subnet on which the cluster
nodes communicate with the quorum server goes down,
then the cluster will go down. If an additional
subnet (a total of two subnets) can be configured
for communication between nodes in the cluster and
quorum server, this will provide additional redundancy.
In Quorum Server A.03.00.00 release, it is possible to
specify two IP addresses for nodes to communicate with
the quorum server. Prior to this feature, the nodes of a
Serviceguard cluster could communicate with the quorum
server through only one subnet.
This is an enhancement to add the capability in
Serviceguard to configure an additional IP address for
communication between quorum server and cluster nodes.
Resolution:
This feature will enable the Serviceguard cluster to be
configured in such a way that the nodes of the cluster
can communicate with the quorum server through multiple
subnets. This will also need an enhancement to the
quorum server to accept connections through multiple IP
addresses which is supported in Quorum Server version
A.03.00.00.
Supported platforms and configurations
======================================
When this feature is used, it is recommended that the
nodes of the Serviceguard cluster and the Quorum Server
be physically connected on two different subnets in
order to realize the redundancy that this feature
offers.
The Quorum Server multiple IP address feature is
supported only when configured with the new Quorum
server version A.03.00.00 that supports this feature.
This feature is supported by this patch only on
Serviceguard version A.11.17.00 on HP-UX 11.23. When
this patch is used with versions of quorum server
earlier than A.03.00.00, only one quorum server IP
address is supported.
Up to one additional quorum server IP address is
currently supported, which means a total of two
addresses can be configured.
For more information on how to install and configure
quorum server version A.03.00.00, please refer to the
release notes for Quorum Server A.03.00.00. Note that
this release document is expected to be released April
or May 2007.
How to configure multiple quorum server IP addresses
====================================================
cmquerycl, the command to generate a cluster
configuration file has been modified to accept an
additional quorum server IP address as shown below.
Please note that for all further steps to configure
second quorum server IP address to succeed, Quorum
Server version A.03.00.00 is needed and it must be
configured as described in its release notes.
To generate the cluster configuration with the
additional quorum server IP address on a second subnet,
qsip2, apart from the first one, qsip1, for a cluster
consisting of nodes node1 and node2, run the following
command.
# cmquerycl -n node1 -n node2 -q qsip1 qsip2 -C
cluster.conf
This generates a cluster configuration file that has a
second quorum server IP address, qsip2, specified in it.
This alternate IP address would have been specified by
the new keyword "QS_ADDR" in the cluster ascii
configuration file.
Alternatively, an existing cluster configuration file
can be edited to add the QS_ADDR keyword with the
alternate quorum server IP address. This can then be
used with the cmcheckconf and cmapplyconf commands. The
QS_ADDR keyword must be specified after the QS_HOST
keyword.
Use the cluster configuration file, cluster.conf,
generated above to configure the Serviceguard cluster.
Please note that a cluster configured with one quorum
server IP address must be halted before configuring it
with two quorum server IP addresses. Online addition of
second quorum server IP address is not supported.
Run the following command to check the cluster
configuration file.
# cmcheckconf -C cluster.conf
Run the following command to verify and apply the
cluster configuration file.
# cmapplyconf -C cluster.conf
Bring the cluster up by running the following command.
# cmruncl
For the cmquerycl, cmapplyconf and cmcheckconf to
succeed with multiple quorum server IP address
configuration, the following conditions must be met:
1. All the nodes must be able to communicate with all
the quorum server subnets.
2. Both the quorum server IP addresses specified must
be of the same quorum server.
3. The quorum server must be Quorum Server A.03.00.00
or later.
4. The authorization file of the quorum server must
specify all the IP addresses from which each of the
Serviceguard nodes will communicate with it (For
more details refer to quorum server A.03.00.00
release notes above).
After configuring a cluster with multiple quorum server
IP addresses, to verify that the cluster is configured
correctly, run the cmviewcl command as shown below and
verify that it reports two IP addresses.
# cmviewcl -v -f line | grep quorum_server
quorum_server:qsip1|name=qsip1
quorum_server:qsip1|ip_address=15.70.191.21
|name=15.70.191.21
quorum_server:qsip1|ip_address=15.70.191.46
|name=15.70.191.46
quorum_server:qsip1|polling_interval=300000000
quorum_server:qsip1|node:node1|status=up
quorum_server:qsip1|node:node2|state=running
The second quorum server IP address is reported in
"quorum_server" section cmviewcl command as shown above,
by the "ip_address" field.
cmviewcl has been modified to report the status of
quorum server. It will report a status of "up", if
quorum server is reachable via any of the quorum server
IP addresses.
cmviewconf and cmgetconf commands have also been
modified to report alternate quorum server IP addresses,
if configured. These commands also use the keyword
QS_ADDR to report the additional quorum server IP
address.
How does this feature work
==========================
Due to a network failure, if a Serviceguard node is
unable to communicate with the quorum server, the node
connects to the quorum server via alternate quorum
server IP address. At any given time only one of the
quorum server subnets will be in use and only this
subnet is monitored periodically. This also means
that various nodes in the Serviceguard cluster may be
communicating with quorum server via different quorum
server IP addresses.
IMPORTANT NOTE: When SGeFF and quorum server multiple
IP addresses are both configured, it is very important
that the QS_POLLING_INTERVAL be tuned to your network
environment and reduced to as low a value as possible,
without going so low as to generate considerable load
on the Quorum server or the network. The default value
of QS_POLLING_INTERVAL is set to 30 seconds by cmquerycl
when SGeFF and quorum server multiple IP addresses
are both configured.
Please note that Serviceguard Manager cannot be used to
configure a cluster with the additional quorum server IP
address. Configuration operations on a cluster already
configured with multiple quorum server IP addresses will
fail when performed from Serviceguard Manager. A change
request (CR- JAGag37574 / SR- 8606484450) has been
filed against Serviceguard Manager for this issue.
PHSS_35371:
1. Defect: JAGag13927 SR: 8606457625
Disks which identify as DGC are probed twice when
configuration commands such as cmapplyconf are run.
During the second probe the hardware path of the device
is written to the wrong area of memory resulting in
possible failure of cmclconfd and subsequent command
failure as a result.
Resolution:
Code modified to copy the hardware path to the correct
location in memory.
2. Defect: JAGag08257 SR: 8606451287
If cmrunnode fails to start the cluster services on a
node with packages which have dependent EMS resources
configured, the EMS resources are not deregistered
before cmcld exits. These packages will then fail to
start on this node on subsequent attempts after the
cluster is started.
Resolution:
EMS resource is deregistered before cmcld exits.
3. Defect: JAGaf69163 SR: 8606409265
Data from the USER_NAME field is not
validated when cmapplyconf is run,
although an error is reported in syslog
by cmcld if the USER_NAME is invalid.
When an invalid username is corrected
and cmapplyconf re-executed cmcld aborts
due to the invalid data in the CDB.
Resolution:
Appropriate checks are now added to be
consistent with the checking for CLUSTER NAME,
PACKAGE NAME, etc.
4. Defect: JAGag09971 SR: 8606453198
cmgmsd required that group membership transactions be
committed to the /etc/cmcluster/cmclconfig cluster
binary file on all nodes before the transaction would
complete. If this filesystem is full the transaction
would fail resulting in Oracle errors. However, group
membership information is transient and does not have
to be written to disk.
Resolution:
cmgmsd transactions no longer fail if there is not
enough disk space to write them to the binary
configuration file. An error is written to syslog but
the transaction completes preventing Oracle errors. The
transaction will be written to disk on nodes which have
enough space.
5. Defect: JAGag13268 SR: 8606456893
The package control script was checking the exit status
of "cmviewcl | sed" which is set to the exit status of
the sed command when the exit status of cmviewcl was
required. This meant that cmviewcl was never retried as
it was supposed to if the cmviewcl command failed.
Resolution:
The code was modified to check the status of cmviewcl so
it could be retried if it fails.
6. Defect: JAGag11741 SR: 8606455170
cmquerycl opens the block logical volume device instead
of the raw logical volume while querying logical volumes.
This results in overhead of closing the block logical
volume device in terms of holding the filesystem alpha
semaphore.
Resolution:
The code is modified to open the raw logical volume
device rather than the block logical volume device.
7. Defect: JAGag11992 SR: 8606455475
After the failover of a Metrocluster/SRDF package from
the R1 to the R2 side, the Serviceguard boot
script /sbin/init.d/cmcluster potentially tries to
run "vxdisk clearimport <disk>" on write disabled EMC
disk devices belonging to the package's SRDF device
group. After some timeout the vxdisk command fails
and VxVM marks the device "offline". This adds
considerable time to the overall boot process of the
node that owned the package on the R1 side before.
If the package is restarted on the R1 node later, the
import of the VxVM disk group that uses the disk
devices that were marked offline during system boot,
fails.
Resolution:
The Serviceguard RC script /sbin/init.d/cmcluster
does not try to run "vxdisk clearimport <disk>" on
write disabled EMC Symmetrix disks managed by VxVM.
For a complete fix the patch PHSS_35451 for
Metrocluster/SRDF A.05.00 or later is required.
PHSS_34337:
1. In a reforming cluster that has the
NETWORK_FAILURE_DETECTION parameter set to
INONLY_OR_INOUT, full network polling would not be
performed even if the primary lan has missed the
maximum number of inbound polling packets thus causing
a local lan failover to the standby lan not to occur.
Resolution:
A full polling is performed even if the state of the
cluster is REFORMING.
2. The SIGPIPE signal was being set to default action
by the psmmon resource monitor. So, when a package
with an ems resource, using psmmon, is configured
in the cluster, the psmmon died when it tried to
re-connect with cmcld, when cmcld went down and came
up. This caused the package to not come up.
Resolution:
Fixed the libsgcl used by the ems framework to handle
SIGPIPE appropriately, when connecting to cmcld.
3. Incorrect check allowed unclustered volume groups.
Resolution:
Fixed logic to check clustered and unclustered volume
group.
4. Serviceguard automatically plumbs standby network
interfaces for IPv6 use, even when IPv6 is not being
used in the cluster configuration.
Resolution:
Serviceguard does not plumb any network interfaces for
inet6 (IPv6) when IPv6 is not being used within the
cluster.
5. If the revision field of a corrupted Serviceguard link
level message is corrupt, the message can pass through,
eventually causing cmcld to abort. If the revision is
corrupted to a value lower than 3 we do not do any
checksum checking of the message. This results in many
cmcld SIGSEGVs when the link level polling messages are
corrupted.
Resolution:
cmcld ignores corrupted link level messages.
6. A socket call failure due to insufficient available
memory causes cmcld to abort.
Resolution:
cmcld now retries the socket call if it fails due to
insufficient available memory.
7. A transient error was encountered during a connect()
call and it was recovered after a retry. The
diagnostic was unnecessarily logged.
Resolution:
syslog will no longer show that message.
8. Current control script does not take into account
shared VG activation. And the original comments for
existing VG activation examples are inaccurate.
Resolution:
Corrected the inaccurate comment field for existing VG
activation examples; added two new examples for shared
VG activations; changed the control script log message
to reflect the current activation mode including shared
mode.
9. Service command output of a package goes to the default
log file even if SCRIPT_LOG_FILE parameter is set.
Resolution:
The log file was set appropriately to the
SCRIPT_LOG_FILE if defined or to the default log file
otherwise.
10. Some of the transient errors caused the failure of
cmhaltnode when it is run in parallel (within a short
window of time) on multiple nodes.
Resolution:
A retry mechanism is added in the cmcluster script for
cmhaltnode to handle the situation if multiple shutdown
(1m) commands are executed on multiple cluster nodes.
This ensures that cmhaltnode succeeds at least in the
next retries if it was failed due to transient
mechanisms. To manually halt cluster services on
multiple nodes at the same time, the only supported
command is "cmhaltnode <node1> <node2> .." which
serializes the actions and therefore avoids the
problem.
11. Sometimes a package with dependency does not start
after the package that it depends on is running.
This occurs when there is an activity on the dependent
package when the package it depends on comes up and
we cannot start the package at that time.
Resolution:
Modified Serviceguard to remember the event
and later to start the dependent package.
12. cmcheckconf and cmapplyconf fail when -C
cmcluster.ascii is specified and the bridged net
assignment has changed. This could be due to a link
failure on one of the LAN cards in a bridged net,
since this LAN card cannot talk to any other LAN on
the local node. There were no internal/external
logging messages for this error.
Resolution:
cmapplyconf and cmcheckconf will log specific error if
the standby LAN card is disconnected.
13. Under rare circumstances when the timer loop thread is
stuck (not holding cm_lock) or the system clock is not
advancing, cmcld threads will not be scheduled. This
prevents cmcld timeout and prevents the safety timer
being updated resulting in all nodes being TOC'ed.
Resolution:
Enhanced the code to check the time stamps of received
heartbeat messages to ensure clock is advancing, rather
than using the heartbeat sequence number. cmcld will
abort if it detects the system clock is not advancing
for a set period of time resulting in a failure of the
single errant node rather than the entire cluster.
14. UDP messages were not marked as invalid even if there
were invalid values for length and offset fields in the
message, causing cmclconfd to exit without receiving
the message and/or cmviewcl to spin indefinitely. In
the cmclconfd case the message hence remains in the
inetd socket buffer causing inetd to spawn another
cmclconfd server. This is repeated until it reaches 40
servers in 60 seconds when it terminates the service
and only reinstates the service again after 10 minutes.
Resolution:
Mark the message as invalid if the length and offset
fields in the message contained improper values.
15. cmviewconf dumps core when it cannot get a node handle
back from cmclconfd or it is not able to communicate
with cmclconfd for any reason.
Resolution:
Instead of dumping core, the command returns an error
statement and exits.
16. cmapplyconf message is unclear when run in a
VERITAS Cluster Volume Manager 4.1 environment.
Serviceguard does not provide correct information
about minimum LAN requirements when they are not met.
The message "Need not have to look for shared VGs" is
unclear.
Resolution:
cmapplyconf output has been changed to indicate the
correct information. Unclear messages related to shared
volume groups have been removed.
17. cmrunnode times out after approximately 35 seconds
rather than waiting for |