Database hanging for high version count and SYS_REFCURSOR usage

Initial problem

One of our production database suffered from hangs and we were obliged to shut it down, alert log error messages were:

Fri Feb 15 22:28:39 2013
PMON failed to acquire latch, see PMON dump
Fri Feb 15 22:29:41 2013
PMON failed to acquire latch, see PMON dump
Fri Feb 15 22:30:43 2013
PMON failed to acquire latch, see PMON dump
Fri Feb 15 22:31:46 2013
PMON failed to acquire latch, see PMON dump
....
Fri Feb 15 22:49:18 2013
WARNING: inbound connection timed out (ORA-3136)
Fri Feb 15 22:49:18 2013
WARNING: inbound connection timed out (ORA-3136)

We initially thought it was a bug as the Oracle release is 10.2.0.4 and same application, with more or less same behavior, was running well on at least two other sites. And so tried to push Oracle support in this direction.

For such a severity 1 service request opened has been opened and we provided Hanganalyze and Systemstate dump files. First bad surprised was to discover that even if sustaining support was still there, reading dump file and help identifying a bug is included only in extended support contract, which we have not purchased…

Then after a tense discussion with Oracle support they directed us to work on SQL with high number of child and no execution. As a postmortem analysis the flow to reach this conclusion is not that obvious.

The top 5 wait event is “classical”, even if having CPU time in top 1 wait event is something to investigate:

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                          2,099          84.2	 
db file scattered read               70,291         230      3    9.2   User I/O
log file sync                        50,130         114      2    4.6     Commit
log file parallel write              52,139          96      2    3.8 System I/O
db file sequential read             415,411          85      0    3.4   User I/O

Unfortunately I do not have the CPU usage breakdown (parse, recursive, other) but server was clearly not CPU overloaded from what I observed at OS level (top and glance as HPUX machine). When looking at time model statistics I can’t see much time on parse:

Time Model Statistics
-> Total time in database user-calls (DB Time): 2492.7s
-> Statistics including the word "background" measure background process
   time, and so do not contribute to the DB time statistic
-> Ordered by % or DB time desc, Statistic name 

Statistic Name                                       Time (s) % of DB Time
------------------------------------------ ------------------ ------------
sql execute elapsed time                             2,179.08        87.42
DB CPU                                               2,098.53        84.19
parse time elapsed                                     129.35         5.19
hard parse elapsed time                                 59.61         2.39
.
.
.

Then my personal next action would have been to look in SQL Statistics section and focus on SQL ordered by Gets and SQL ordered by Reads but Oracle had a different good idea and noticed that we had multiple sql_id with high number of child (some AWR reports we even showing number above 30,000) and no execution:

SQL ordered by Version Count

  • Only Statements with Version Count greater than 20 are displayed
Version CountExecutions SQL IdSQL ModuleSQL Text
26,834 ag3x03cnr96vz Select DMSKey From (Select DMS…
6,197 gkqfhakg8ggh3 ** SQL Text Not Available **
5,263 d7spvxbdzqgvd Select DMSKey From (Select DMS…
5,172 d11cy1pb6kzdz Select DMSKey From (Select DMS…
4,571 64sbr6m745pwm Select DMSKey From (Select DMS…
4,486 danhtqgtdc38r Select DMSKey From (Select DMS…
4,154 cpka41z0xskd1 Select DMSKey From (Select DMS…
2,998 caq9f0czwjq1y Select DMSKey From (Select DMS…
2,744 3q72d5fdqtn14 Select DMSKey From (Select DMS…
2,519 f9d038dpjvv3j Select DMSKey From (Select DMS…
2,498 96cqkgvyd1mz2 Select DMSKey From (Select DMS…
2,326 5z820x1dv4grm Select DMSKey From (Select DMS…
1,568 cttujhvkbux9s Select DMSKey From (Select DMS…
1,482 5ff132jcqfkfw select sysid from fweqpequipme…
1,418 cka4wk4uak6u7 SELECT pkg_brutil.Join(cursor(…
1,095393ywn07p6taaxsDTPFsrv@mux018 (TNS V1-V3) delete from t_usedlotid where…
1,047 dfxzwbdzczn7x Select DMSKey From (Select DMS…
941 cdx3rfcqcg96s SELECT pkg_brutil.Join(cursor(…
811252xp8q9wnfmngpPrintingService.exe SELECT “T_TICKETSTAGING”.”LOT…

First solution and second problem

To understand why so many child for one sql_id you can use v$SQL_SHARED_CURSOR view but unfortunately all columns where equal to N and 10gR2 release is lacking the HASH_MATCH_FAILED new in 11gR2. Another convenient way is to use version_rpt script from MOS note 438755.1:

SELECT * FROM TABLE(version_rpt('ag3x03cnr96vz'));

But the script was simply hanging the database and generating same error we initially had. Finally launching it at right time we have finally been able to get the result but not the one we expected as the result was empty and simply giving advice on how to go further:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Details for HASH_MATCH_FAILED :
 
No details available
####
To further debug Ask Oracle Support for the appropiate level LLL.
alter session set events
 'immediate trace name cursortrace address 3370658479, level LLL';
To turn it off do use address 1, level 2147483648
================================================================

To activate trace use:

SELECT hash_value
FROM v$sqlarea
WHERE sql_id='ag3x03cnr96vz';
 
ALTER SESSION SET events 'immediate trace name cursortrace address <hash_value>, level 512';

Remark:
512 is a first good start, if not enough information can be even increased…

To deactivate trace use:

ALTER SESSION SET events 'immediate trace name cursortrace address 1, level 2147483648';

But again the trace was not really useful and Oracle reached to a conclusion that CURSOR_SHARING=similar (our setting) was creating issue and requested us to set it to force (by the way similar is a setting that could be deprecated in further release). After application supplier approval we finally did it, number of child decreased for almost all sql_id and we did not encounter problem again…

Second solution

But still we had sql_id with high number of child even with CURSOR_SHARING=force (CPU time was still first Top 5 Timed Events at more than 60%), all not-shared statements where of the form:

SELECT package1.function1(CURSOR(SELECT ... FROM ... WHERE id=<hard value>)) FROM dual;

To be honest using SYS_REFCURSOR was new to me and it would have never come to my mind naturally to create a procedure accepting a cursor in its argument. While I was asking myself the added value of doing this, I found in Oracle magazine March/April 2013 an article from Steven Feuerstein giving few reasons:

Pass a cursor variable back to the host environment that called the program unit—the result set can be “consumed” for display or other processing.

Construct a result set inside a function, and return a cursor variable to that set. This is especially handy when you need to use PL/SQL, in addition to SQL, to build the result set.

Pass a cursor variable to a pipelined table function—a powerful but quite advanced optimization technique. A full explanation of cursor variables, including the differences between strong and weak REF CURSOR types, is beyond the scope of this article.

This immediately remind me the post I have written on pipelined function and I share that this would be a nice example of using SYS_REFCURSOR but that’s the only one I see…

I expected those statements to also have a shared cursor but they had not. I initially thought of a 10gR2 bug but I have been able to reproduce in 11.2.0.3 using the below test case:

DROP TABLE test1;
 
CREATE TABLE test1(id NUMBER, descr VARCHAR(50)) TABLESPACE users;
 
DECLARE
  i NUMBER;
  nbrows NUMBER;
BEGIN
  i:=1;
  nbrows:=50000;
LOOP
  EXIT WHEN i>nbrows;
    INSERT INTO test1 VALUES(i,TO_CHAR(TO_DATE(i,'j'), 'jsp'));
    i:=i+1;
  END LOOP;
  COMMIT;
END;
/
 
CREATE INDEX test1_idx_id ON test1(id) TABLESPACE users;
 
EXEC DBMS_STATS.GATHER_TABLE_STATS(ownname=>'yjaquier', tabname=>'test1', estimate_percent=>NULL, method_opt=>'FOR ALL INDEXED COLUMNS SIZE AUTO', degree=>DBMS_STATS.DEFAULT_DEGREE, CASCADE=>TRUE); --No histogram
 
CREATE OR REPLACE PACKAGE package1
AS
FUNCTION function1(cursor1 SYS_REFCURSOR) RETURN VARCHAR2;
END;
/
show error
 
CREATE OR REPLACE PACKAGE BODY package1
AS
FUNCTION function1(cursor1 SYS_REFCURSOR) RETURN VARCHAR2
IS
  descr1 VARCHAR2(50);
BEGIN
  LOOP
    FETCH cursor1 INTO descr1;
    EXIT WHEN cursor1%notfound;
  END LOOP;
  RETURN descr1;
END;
END package1;
/
show err

Then you will see that each call to something like below is creating a child in V$SQL:

SELECT package1.function1(CURSOR(SELECT descr FROM test1 WHERE id=<hard value>)) FROM dual;

To automatically generate multiple calls you can use below PL/SQL:

DECLARE
  i NUMBER;
  sql_stmt VARCHAR2(200);
BEGIN
  i:=1;
  DBMS_RANDOM.SEED(SYSDATE);
  FOR i IN 1..50
  LOOP
    sql_stmt:='SELECT package1.function1(CURSOR(SELECT descr FROM test1 WHERE id=' || ROUND(DBMS_RANDOM.VALUE(1,50000)) || ')) FROM dual';
    EXECUTE IMMEDIATE sql_stmt;
  END LOOP;
END;
/

I submitted it to Oracle support, they opened a bug and Oracle development team finally closed it as expected behavior:

Because that is a completely different type of call from an Oracle internal point of view. That is simple SQL statement where the sharing of the cursor happens only from a SQL view – The binds *are* data literals (as they are predicated against a column) When you have SELECTpackage1.function1(CURSOR(SELECT descr FROM test1 WHERE id=:A)) FROM dual; it is immediately obvious there is a degree more complexity (The SQL is within a function (CURSOR) within a function (function1) )

So, with my words, Oracle rate this statement too complex and prefer not to share cursor between multiple execution to avoid wrong/different results. While we know that there will not be any issue. So clearly doing massively such things is not advisable. As a very final test I have tried directly with a bind variable:

SQL> var id1 NUMBER;
SQL> EXEC :id1:=1002;
 
PL/SQL PROCEDURE successfully completed.
 
SQL> SELECT package1.function1(CURSOR(SELECT descr FROM test1 WHERE id=:id1)) FROM dual;
 
PACKAGE1.FUNCTION1(CURSOR(SELECTDESCRFROMTEST1WHEREID=:ID1))
--------------------------------------------------------------------------------
one thousand two

And changed multiple time the value of id1 variable and, as expected, cursor is shared for multiple id1 value. This is another good example (was it needed ?) that trying to correct application bad design at Oracle database level is always a very bad idea…

References

  • How to Collect Diagnostics for Database Hanging Issues [ID 452358.1]
  • Troubleshooting: High Version Count Issues [ID 296377.1]
  • High SQL Version Counts – Script to determine reason(s) [ID 438755.1]
  • Unsafe Literals or Peeked Bind Variables [ID 377847.1]
  • High Version Count with CURSOR_SHARING = SIMILAR or FORCE [ID 261020.1]
  • Resolving Issues Where There Are Too Many Child Cursors Due to CURSOR_SHARING = SIMILAR [ID 1476096.1]

About Post Author

Share the knowledge!
This entry was posted in Oracle and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>