Table of contents
Initial problem
One of our production database suffered from hangs and we were obliged to shut it down, alert log error messages were:
Fri Feb 15 22:28:39 2013 PMON failed to acquire latch, see PMON dump Fri Feb 15 22:29:41 2013 PMON failed to acquire latch, see PMON dump Fri Feb 15 22:30:43 2013 PMON failed to acquire latch, see PMON dump Fri Feb 15 22:31:46 2013 PMON failed to acquire latch, see PMON dump .... Fri Feb 15 22:49:18 2013 WARNING: inbound connection timed out (ORA-3136) Fri Feb 15 22:49:18 2013 WARNING: inbound connection timed out (ORA-3136) |
We initially thought it was a bug as the Oracle release is 10.2.0.4 and same application, with more or less same behavior, was running well on at least two other sites. And so tried to push Oracle support in this direction.
For such a severity 1 service request opened has been opened and we provided Hanganalyze and Systemstate dump files. First bad surprised was to discover that even if sustaining support was still there, reading dump file and help identifying a bug is included only in extended support contract, which we have not purchased…
Then after a tense discussion with Oracle support they directed us to work on SQL with high number of child and no execution. As a postmortem analysis the flow to reach this conclusion is not that obvious.
The top 5 wait event is “classical”, even if having CPU time in top 1 wait event is something to investigate:
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time Wait Class ------------------------------ ------------ ----------- ------ ------ ---------- CPU time 2,099 84.2 db file scattered read 70,291 230 3 9.2 User I/O log file sync 50,130 114 2 4.6 Commit log file parallel write 52,139 96 2 3.8 System I/O db file sequential read 415,411 85 0 3.4 User I/O
Unfortunately I do not have the CPU usage breakdown (parse, recursive, other) but server was clearly not CPU overloaded from what I observed at OS level (top and glance as HPUX machine). When looking at time model statistics I can’t see much time on parse:
Time Model Statistics -> Total time in database user-calls (DB Time): 2492.7s -> Statistics including the word "background" measure background process time, and so do not contribute to the DB time statistic -> Ordered by % or DB time desc, Statistic name Statistic Name Time (s) % of DB Time ------------------------------------------ ------------------ ------------ sql execute elapsed time 2,179.08 87.42 DB CPU 2,098.53 84.19 parse time elapsed 129.35 5.19 hard parse elapsed time 59.61 2.39 . . .
Then my personal next action would have been to look in SQL Statistics section and focus on SQL ordered by Gets and SQL ordered by Reads but Oracle had a different good idea and noticed that we had multiple sql_id with high number of child (some AWR reports we even showing number above 30,000) and no execution:
SQL ordered by Version Count
- Only Statements with Version Count greater than 20 are displayed
Version Count | Executions | SQL Id | SQL Module | SQL Text |
---|---|---|---|---|
26,834 | ag3x03cnr96vz | Select DMSKey From (Select DMS… | ||
6,197 | gkqfhakg8ggh3 | ** SQL Text Not Available ** | ||
5,263 | d7spvxbdzqgvd | Select DMSKey From (Select DMS… | ||
5,172 | d11cy1pb6kzdz | Select DMSKey From (Select DMS… | ||
4,571 | 64sbr6m745pwm | Select DMSKey From (Select DMS… | ||
4,486 | danhtqgtdc38r | Select DMSKey From (Select DMS… | ||
4,154 | cpka41z0xskd1 | Select DMSKey From (Select DMS… | ||
2,998 | caq9f0czwjq1y | Select DMSKey From (Select DMS… | ||
2,744 | 3q72d5fdqtn14 | Select DMSKey From (Select DMS… | ||
2,519 | f9d038dpjvv3j | Select DMSKey From (Select DMS… | ||
2,498 | 96cqkgvyd1mz2 | Select DMSKey From (Select DMS… | ||
2,326 | 5z820x1dv4grm | Select DMSKey From (Select DMS… | ||
1,568 | cttujhvkbux9s | Select DMSKey From (Select DMS… | ||
1,482 | 5ff132jcqfkfw | select sysid from fweqpequipme… | ||
1,418 | cka4wk4uak6u7 | SELECT pkg_brutil.Join(cursor(… | ||
1,095 | 39 | 3ywn07p6taaxs | DTPFsrv@mux018 (TNS V1-V3) | delete from t_usedlotid where… |
1,047 | dfxzwbdzczn7x | Select DMSKey From (Select DMS… | ||
941 | cdx3rfcqcg96s | SELECT pkg_brutil.Join(cursor(… | ||
811 | 25 | 2xp8q9wnfmngp | PrintingService.exe | SELECT “T_TICKETSTAGING”.”LOT… |
First solution and second problem
To understand why so many child for one sql_id you can use v$SQL_SHARED_CURSOR view but unfortunately all columns where equal to N and 10gR2 release is lacking the HASH_MATCH_FAILED new in 11gR2. Another convenient way is to use version_rpt script from MOS note 438755.1:
SELECT * FROM TABLE(version_rpt('ag3x03cnr96vz')); |
But the script was simply hanging the database and generating same error we initially had. Finally launching it at right time we have finally been able to get the result but not the one we expected as the result was empty and simply giving advice on how to go further:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Details for HASH_MATCH_FAILED : No details available #### To further debug Ask Oracle Support for the appropiate level LLL. alter session set events 'immediate trace name cursortrace address 3370658479, level LLL'; To turn it off do use address 1, level 2147483648 ================================================================ |
To activate trace use:
SELECT hash_value FROM v$sqlarea WHERE sql_id='ag3x03cnr96vz'; ALTER SESSION SET events 'immediate trace name cursortrace address <hash_value>, level 512'; |
Remark:
512 is a first good start, if not enough information can be even increased…
To deactivate trace use:
ALTER SESSION SET events 'immediate trace name cursortrace address 1, level 2147483648'; |
But again the trace was not really useful and Oracle reached to a conclusion that CURSOR_SHARING=similar (our setting) was creating issue and requested us to set it to force (by the way similar is a setting that could be deprecated in further release). After application supplier approval we finally did it, number of child decreased for almost all sql_id and we did not encounter problem again…
Second solution
But still we had sql_id with high number of child even with CURSOR_SHARING=force (CPU time was still first Top 5 Timed Events at more than 60%), all not-shared statements where of the form:
SELECT package1.function1(CURSOR(SELECT ... FROM ... WHERE id=<hard value>)) FROM dual; |
To be honest using SYS_REFCURSOR was new to me and it would have never come to my mind naturally to create a procedure accepting a cursor in its argument. While I was asking myself the added value of doing this, I found in Oracle magazine March/April 2013 an article from Steven Feuerstein giving few reasons:
Pass a cursor variable back to the host environment that called the program unit—the result set can be “consumed” for display or other processing.
Construct a result set inside a function, and return a cursor variable to that set. This is especially handy when you need to use PL/SQL, in addition to SQL, to build the result set.
Pass a cursor variable to a pipelined table function—a powerful but quite advanced optimization technique. A full explanation of cursor variables, including the differences between strong and weak REF CURSOR types, is beyond the scope of this article.
This immediately remind me the post I have written on pipelined function and I share that this would be a nice example of using SYS_REFCURSOR but that’s the only one I see…
I expected those statements to also have a shared cursor but they had not. I initially thought of a 10gR2 bug but I have been able to reproduce in 11.2.0.3 using the below test case:
DROP TABLE test1; CREATE TABLE test1(id NUMBER, descr VARCHAR(50)) TABLESPACE users; DECLARE i NUMBER; nbrows NUMBER; BEGIN i:=1; nbrows:=50000; LOOP EXIT WHEN i>nbrows; INSERT INTO test1 VALUES(i,TO_CHAR(TO_DATE(i,'j'), 'jsp')); i:=i+1; END LOOP; COMMIT; END; / CREATE INDEX test1_idx_id ON test1(id) TABLESPACE users; EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname=>'yjaquier', tabname=>'test1', estimate_percent=>NULL, method_opt=>'FOR ALL INDEXED COLUMNS SIZE AUTO', degree=>DBMS_STATS.DEFAULT_DEGREE, CASCADE=>TRUE); --No histogram CREATE OR REPLACE PACKAGE package1 AS FUNCTION function1(cursor1 SYS_REFCURSOR) RETURN VARCHAR2; END; / show error CREATE OR REPLACE PACKAGE BODY package1 AS FUNCTION function1(cursor1 SYS_REFCURSOR) RETURN VARCHAR2 IS descr1 VARCHAR2(50); BEGIN LOOP FETCH cursor1 INTO descr1; EXIT WHEN cursor1%notfound; END LOOP; RETURN descr1; END; END package1; / show err |
Then you will see that each call to something like below is creating a child in V$SQL:
SELECT package1.function1(CURSOR(SELECT descr FROM test1 WHERE id=<hard value>)) FROM dual; |
To automatically generate multiple calls you can use below PL/SQL:
DECLARE i NUMBER; sql_stmt VARCHAR2(200); BEGIN i:=1; DBMS_RANDOM.SEED(SYSDATE); FOR i IN 1..50 LOOP sql_stmt:='SELECT package1.function1(CURSOR(SELECT descr FROM test1 WHERE id=' || ROUND(DBMS_RANDOM.VALUE(1,50000)) || ')) FROM dual'; EXECUTE IMMEDIATE sql_stmt; END LOOP; END; / |
I submitted it to Oracle support, they opened a bug and Oracle development team finally closed it as expected behavior:
Because that is a completely different type of call from an Oracle internal point of view. That is simple SQL statement where the sharing of the cursor happens only from a SQL view – The binds *are* data literals (as they are predicated against a column) When you have SELECTpackage1.function1(CURSOR(SELECT descr FROM test1 WHERE id=:A)) FROM dual; it is immediately obvious there is a degree more complexity (The SQL is within a function (CURSOR) within a function (function1) )
So, with my words, Oracle rate this statement too complex and prefer not to share cursor between multiple execution to avoid wrong/different results. While we know that there will not be any issue. So clearly doing massively such things is not advisable. As a very final test I have tried directly with a bind variable:
SQL> var id1 NUMBER; SQL> EXEC :id1:=1002; PL/SQL PROCEDURE successfully completed. SQL> SELECT package1.function1(CURSOR(SELECT descr FROM test1 WHERE id=:id1)) FROM dual; PACKAGE1.FUNCTION1(CURSOR(SELECTDESCRFROMTEST1WHEREID=:ID1)) -------------------------------------------------------------------------------- one thousand two |
And changed multiple time the value of id1 variable and, as expected, cursor is shared for multiple id1 value. This is another good example (was it needed ?) that trying to correct application bad design at Oracle database level is always a very bad idea…
References
- How to Collect Diagnostics for Database Hanging Issues [ID 452358.1]
- Troubleshooting: High Version Count Issues [ID 296377.1]
- High SQL Version Counts – Script to determine reason(s) [ID 438755.1]
- Unsafe Literals or Peeked Bind Variables [ID 377847.1]
- High Version Count with CURSOR_SHARING = SIMILAR or FORCE [ID 261020.1]
- Resolving Issues Where There Are Too Many Child Cursors Due to CURSOR_SHARING = SIMILAR [ID 1476096.1]