Interpreting hanganalyze and systemstate

Preamble

The story is quite common for people acting as DBA: one production database is hanging and people start to complain they get no result or are not even been able to connect. We all know that before rushing to a shutdown immediate/abort and startup, it is worth taking a hanganalyze and systemstate dumps for post-mortem analysis. In any case if you open a SR to Oracle support they will request those files…

To help you doing the note to refer to is (I have personally it in my MOS favorites):

  • How to Collect Diagnostics for Database Hanging Issues (Doc ID 452358.1)

Then you can blindly upload them to Oracle support or try to understand them to see if you reach same conclusion. To help you understand those dumps please first refer to:

  • Interpreting HANGANALYZE trace files to diagnose hanging and performance problems for 9i and 10g. (Doc ID 215858.1)
  • Reading and Understanding Systemstate Dumps (Doc ID 423153.1)

First interesting thing I noticed in the hanganalyze document is how Oracle put it in front as containing much more accurate information than systemstate. This even if Oracle support continue to focus on systemstate file…

The production database that had issue is 10.2.0.4 running on HPUX Itanium 11.31. I did not have access to database server directly and only received the trace files…

I started by alert.log:

.
.
Fri Apr 10 09:09:45 2015
System State dumped to trace file /ora_mesprd/dump/MESPRD/udump/mesprd_ora_24442.trc
Fri Apr 10 09:32:29 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=156
System State dumped to trace file /ora_mesprd/dump/MESPRD/udump/mesprd_ora_21462.trc
Fri Apr 10 09:33:15 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=210
Fri Apr 10 09:33:29 2015
>>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=212
.
.

First the local DBA had the very good habit to take systemstate and he also provided me hanganalyze file.

Hanganalyze and systemstate

Hanganalyze

From the hanganalyze dump I have been able to see (hanganalyze file is easier to read and understand than systemstate):

.
.
==============
HANG ANALYSIS:
==============
Found 28 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <0/444/44899/0x1be32d0/12891/cursor: pin S>
.
.
==============
HANG ANALYSIS:
==============
Found 58 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
    <0/444/44899/0x1be32d0/12891/cursor: pin S>
.
.

So wait event cursor: pin S looks to be the one that have made the database to hang. Starting from this we might say that ospid 12891 is the blocker of the whole database, systemstate dump file will say something different… Going to systemstate searching for “ospid: 12891” you will be able to get all the needed information on this process: Os user, executed SQL commands and so on.

First I noticed that this is a bit far from what we see from AWR report around the problem:

Top 5 Timed Events

Event                   Waits       Time(s) Avg Wait(ms) % Total Call Time Wait Class
row cache lock          131,826     388,129 2,944	 79.2              Concurrency
cursor: pin S wait on X	8,165,729   188,876 23           38.5              Concurrency
cursor: pin S           156,329,083 127,834 1            26.1              Other
enq: US - contention    19,779      58,184  2,942        11.9              Other
latch: shared pool      1,834       12,292  6,702        2.5               Concurrency

Systemstate

Now let’s focus on systemstate. I initially started to open the file with Notepad++ but as the file is around 130 MB it was quite slow and is any case I was lacking the grep and awk powerful Unix commands. Just before transferring the file to one of my local Linux server I remembered that MobaXterm has a bash simulator under Windows.

To do this under MobaXterm emulated bash go to /drives and then c (for C drive in my case) and then you can navigate to the Windows directory where are located your dump files and the magic is that you can start using Unix command on files located on your Windows desktop using MobaXterm !!

Using tricky awk command I have been able to count number of session wait events (\x27 is hexadecimal code for simple quote):

[yjaquier.DESKTOP1]awk 'index($0,"waiting for \x27")!=0 {print substr($0,0,index($0,"\x27 "))}' mesprd_ora_24442.trc | sort | uniq -c
      2     waiting for 'LNS ASYNC end of log'
     43     waiting for 'SQL*Net message from client'
      1     waiting for 'Streams AQ: qmn coordinator idle wait'
      1     waiting for 'Streams AQ: qmn slave idle wait'
      2     waiting for 'Streams AQ: waiting for time management or cleanup tasks'
     61     waiting for 'cursor: pin S wait on X'
     39     waiting for 'cursor: pin S'
      1     waiting for 'enq: TX - row lock contention'
      1     waiting for 'pmon timer'
     11     waiting for 'rdbms ipc message'
    161     waiting for 'row cache lock'

So using systemstate I would focus on row cache lock as this is the event on which most sessions wait. So clearly not what is in hanganalyze… Cursor: pin S is even in third position…

I started to generate an intermediate file with:

awk '/row cache enqueue/{print "----------------------------------------"; for (i=1;i<=5;i++) {print; getline}}' mesprd_ora_24442.trc > /drives/c/temp/yan.txt

Studying this newly generated file the object that came on top was object=c0000004005eaf28. But all sessions were requesting a Shared (S) mutex and I have not been able to identify, like in MOS note, a session holding an exclusive (X) mutex on this object…

Then I moved to second most waited wait event cursor: pin S wait on X, using the principle described in MOS note:

awk '/waiting for \x27cursor: pin S wait on X\x27/{print; getline; print}' mesprd_ora_24442.trc

But here again there was so many distinct value for idn= that not able to conclude anything…

So then moved to the third wait event cursor: pin S. I again used the same awk command:

awk '/waiting for \x27cursor: pin S\x27/{print; getline; print}' mesprd_ora_24442.trc

But this time all idn= value were exactly the same (bbcee4f7) so continue investigation with something like to find the only session idn bbcee4f7 oper GET_XXX:

[yjaquier.DESKTOP1]awk '/idn bbcee4f7 oper/{print}' mesprd_ora_24442.trc
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper DECR_EXAM
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD
       Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD

Finally searched process related to idn bbcee4f7 oper DECR_EXAM and found process 161. But again this has more or less nothing to see with what we found in hanganalyze file but this is what Oracle support has asked us to look for. Fact is that this process 161 was a Zabbix process so more or less a bad monitoring implementation…

References

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>