Table of contents
Preamble
The story is quite common for people acting as DBA: one production database is hanging and people start to complain they get no result or are not even been able to connect. We all know that before rushing to a shutdown immediate/abort and startup, it is worth taking a hanganalyze and systemstate dumps for post-mortem analysis. In any case if you open a SR to Oracle support they will request those files…
To help you doing the note to refer to is (I have personally it in my MOS favorites):
- How to Collect Diagnostics for Database Hanging Issues (Doc ID 452358.1)
Then you can blindly upload them to Oracle support or try to understand them to see if you reach same conclusion. To help you understand those dumps please first refer to:
- Interpreting HANGANALYZE trace files to diagnose hanging and performance problems for 9i and 10g. (Doc ID 215858.1)
- Reading and Understanding Systemstate Dumps (Doc ID 423153.1)
First interesting thing I noticed in the hanganalyze document is how Oracle put it in front as containing much more accurate information than systemstate. This even if Oracle support continue to focus on systemstate file…
The production database that had issue is 10.2.0.4 running on HPUX Itanium 11.31. I did not have access to database server directly and only received the trace files…
I started by alert.log:
. . Fri Apr 10 09:09:45 2015 System State dumped to trace file /ora_mesprd/dump/MESPRD/udump/mesprd_ora_24442.trc Fri Apr 10 09:32:29 2015 >>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=156 System State dumped to trace file /ora_mesprd/dump/MESPRD/udump/mesprd_ora_21462.trc Fri Apr 10 09:33:15 2015 >>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=210 Fri Apr 10 09:33:29 2015 >>> WAITED TOO LONG FOR A ROW CACHE ENQUEUE LOCK! pid=212 . . |
First the local DBA had the very good habit to take systemstate and he also provided me hanganalyze file.
Hanganalyze and systemstate
Hanganalyze
From the hanganalyze dump I have been able to see (hanganalyze file is easier to read and understand than systemstate):
. . ============== HANG ANALYSIS: ============== Found 28 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> <0/444/44899/0x1be32d0/12891/cursor: pin S> . . ============== HANG ANALYSIS: ============== Found 58 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> <0/444/44899/0x1be32d0/12891/cursor: pin S> . . |
So wait event cursor: pin S looks to be the one that have made the database to hang. Starting from this we might say that ospid 12891 is the blocker of the whole database, systemstate dump file will say something different… Going to systemstate searching for “ospid: 12891” you will be able to get all the needed information on this process: Os user, executed SQL commands and so on.
First I noticed that this is a bit far from what we see from AWR report around the problem:
Top 5 Timed Events Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class row cache lock 131,826 388,129 2,944 79.2 Concurrency cursor: pin S wait on X 8,165,729 188,876 23 38.5 Concurrency cursor: pin S 156,329,083 127,834 1 26.1 Other enq: US - contention 19,779 58,184 2,942 11.9 Other latch: shared pool 1,834 12,292 6,702 2.5 Concurrency
Systemstate
Now let’s focus on systemstate. I initially started to open the file with Notepad++ but as the file is around 130 MB it was quite slow and is any case I was lacking the grep and awk powerful Unix commands. Just before transferring the file to one of my local Linux server I remembered that MobaXterm has a bash simulator under Windows.
To do this under MobaXterm emulated bash go to /drives and then c (for C drive in my case) and then you can navigate to the Windows directory where are located your dump files and the magic is that you can start using Unix command on files located on your Windows desktop using MobaXterm !!
Using tricky awk command I have been able to count number of session wait events (\x27 is hexadecimal code for simple quote):
[yjaquier.DESKTOP1] ➤ awk 'index($0,"waiting for \x27")!=0 {print substr($0,0,index($0,"\x27 "))}' mesprd_ora_24442.trc | sort | uniq -c 2 waiting for 'LNS ASYNC end of log' 43 waiting for 'SQL*Net message from client' 1 waiting for 'Streams AQ: qmn coordinator idle wait' 1 waiting for 'Streams AQ: qmn slave idle wait' 2 waiting for 'Streams AQ: waiting for time management or cleanup tasks' 61 waiting for 'cursor: pin S wait on X' 39 waiting for 'cursor: pin S' 1 waiting for 'enq: TX - row lock contention' 1 waiting for 'pmon timer' 11 waiting for 'rdbms ipc message' 161 waiting for 'row cache lock' |
So using systemstate I would focus on row cache lock as this is the event on which most sessions wait. So clearly not what is in hanganalyze… Cursor: pin S is even in third position…
I started to generate an intermediate file with:
awk '/row cache enqueue/{print "----------------------------------------"; for (i=1;i<=5;i++) {print; getline}}' mesprd_ora_24442.trc > /drives/c/temp/yan.txt |
Studying this newly generated file the object that came on top was object=c0000004005eaf28. But all sessions were requesting a Shared (S) mutex and I have not been able to identify, like in MOS note, a session holding an exclusive (X) mutex on this object…
Then I moved to second most waited wait event cursor: pin S wait on X, using the principle described in MOS note:
awk '/waiting for \x27cursor: pin S wait on X\x27/{print; getline; print}' mesprd_ora_24442.trc |
But here again there was so many distinct value for idn= that not able to conclude anything…
So then moved to the third wait event cursor: pin S. I again used the same awk command:
awk '/waiting for \x27cursor: pin S\x27/{print; getline; print}' mesprd_ora_24442.trc |
But this time all idn= value were exactly the same (bbcee4f7) so continue investigation with something like to find the only session idn bbcee4f7 oper GET_XXX:
[yjaquier.DESKTOP1] ➤ awk '/idn bbcee4f7 oper/{print}' mesprd_ora_24442.trc Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper DECR_EXAM Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD Mutex c0000003f6e6fe78(798, 0) idn bbcee4f7 oper GET_SHRD |
Finally searched process related to idn bbcee4f7 oper DECR_EXAM and found process 161. But again this has more or less nothing to see with what we found in hanganalyze file but this is what Oracle support has asked us to look for. Fact is that this process 161 was a Zabbix process so more or less a bad monitoring implementation…