intra-mart Advent Calendar 2013 第11日:Resin HealthSystemのご紹介

この記事は、intra-mart Advent Calendar 2013 第10日の記事です。

Resinは、アプリケーションが動作しているResin本体とは別にwatchdogという監視サーバが動いていて、Resin本体を監視しています。

ちなみに、先日ご紹介した、Resin Adminは、このwatchdogで監視して取得した情報を参照する画面という位置づけです。

で、その動作設定をしているのが、conf/health.xmlというファイルです。

そのhealth.xml自体かなり長いXMLファイルですが、その中で重要なのは、以下の2ヶ所で

  <health:HealthSystem>
    <enabled>true</enabled>
    <startup-delay>15m</startup-delay>
    <period>5m</period>
    <recheck-period>30s</recheck-period>
    <recheck-max>5</recheck-max>
  </health:HealthSystem>

<health:HealthSystem>

<startup-delay>15m</startup-delay>

<recheck-period>30s</recheck-period>

<recheck-max>5</recheck-max>

</health:HealthSystem>

と

  <health:ActionSequence>
    <health:IfHealthCritical time="2m"/>
    <health:FailSafeRestart timeout="10m"/>
    <health:DumpJmx/>
    <health:DumpThreads/>
    <health:DumpHeap/>
    <health:DumpHeap hprof="true"
                     hprof-path="${resin.logDirectory}/heap.hprof"/>
    <health:StartProfiler active-time="2m" wait="true"/>
    <health:Restart/>
  </health:ActionSequence>

<health:ActionSequence>

<health:IfHealthCritical time="2m"/>

<health:FailSafeRestart timeout="10m"/>

<health:DumpJmx/>

<health:DumpThreads/>

<health:DumpHeap/>

<health:DumpHeap hprof="true"

hprof-path="${resin.logDirectory}/heap.hprof"/>

<health:StartProfiler active-time="2m" wait="true"/>

<health:Restart/>

</health:ActionSequence>

の部分です。

<health:HealthSystem>は、

<enabled>：HealthSystem自体の有効無効（デフォルト：true)
<startup-delay>：起動から、HealthSystemでの判定動作の遅延時間指定。（この時間中にエラーになっても処理は行われない。）（デフォルト：15分)
<period>：監視間隔（デフォルト：5分)
<recheck-period>：問題発生後の再確認間隔（デフォルト：30秒）
<recheck-max>：問題発生後の再確認最大回数（デフォルト：5回）

というHealthSystemの全体的な動作指定。

<health:ActionSequence>は、

<health:IfHealthCritical time="2m"/>：もし2分間ステータスがCriticalになったら、
<health:FailSafeRestart timeout="10m"/>タイムアウト時間（10分）まで、シャットダウン時の情報を収集
<health:DumpJmx/>：JMXのダンプを出力
<health:DumpThreads/>：スレッドダンプを出力
<health:DumpHeap/>：ピープダンプを出力
<health:DumpHeap hprof="true" hprof-path="${resin.logDirectory}/heap.hprof"/>：hprofで解析できるヒープダンプを指定ディレクトリ（log)に出力
<health:StartProfiler active-time="2m" wait="true"/>２分間分のプロファイルを取得
<health:Restart/>：Resinを再起動する。

という流れの指定になっています。

つまり、デフォルトの設定の場合、
Resin起動後15分以降、health.xmlで指定されている10項目の監視項目で、Criticalなステータスが2分以上継続した場合に再起動処理が動作する。
しかし、実際には、シャットダウン時の情報を収集処理で、約２分ほど経過した後に再起動となります。

よって、障害発生から、最短でも約4分経過しないと再起動されないということになっています。

私達も最初、なぜ、Criticalになっても再起動されないし、変なタイミングで再起動しているので不思議だったのですが、一個一個見ていってやっとわかりました。

参考ですが、弊社の社内システムでは、
30秒間隔で以下の監視を行い、起動後5分以降、もし2分間ステータスがCriticalになったら、30秒間のプロファイルを取得して再起動するようにしました。
つまり、障害発生から、約2分30秒で情報を取得して再起動するようにして運用しています。

監視項目（HttpStatus以外デフォルト値）

ConnectionPoolが最大値を超えていないか
Cpuの負荷が、95%以上でないか
HealthSystem自体が正常か
HttpStatus：iAPのログイン画面が正常に表示されるか
JvmDeadlock：JVMがデットロックでないか？
License：Resinのライセンスは有効か？
MemoryPermGen：JavaのPerm領域が正常か？（空きが1M以上）
MemoryTenured：JavaのTenured領域が正常か？（空きが1M以上）
Transaction：トランザクションの失敗はないか？

その他、メール送ったり、任意のシェル動かしたりできます。

さらに詳細を知りたい方は、cauchoのページを参照してください。