Aggregated global statistics on partitioned tables or not ?

Preamble

One of my internal customers came with a basic question on how to gather statistics on partitioned table. I thought I was mastering the subject but after few small tests and read of ton of posts displayed in references I have discovered that it was not so obvious. The main concern you must have is around global statistics and their accuracy over time !

On partition tables you have statistics at partition and subpartitions level and at table level that are called global statistics.

Those global statistics can be gathered or derived from partition/subpartition statistics aggregation. I thought the aggregated statistics method was the optimal one but I was apparently wrong…

Global statistics are used when you are selecting more than one partition/subpartition means you are not using something quite restrictive like:

.
FROM partitioned_table
WEHRE partition_code=xx
.

Or

.
FROM partitioned_table PARTITION (xx)
.

The Database Performance Tuning Guide state:

Depending on the SQL statement undergoing optimization, the optimizer can choose to use partition, subpartition, or global statistics. Both global and partition statistics are important for most applications.

Testing has been done on Oracle enterprise edition 11.2.0.4 running on HPUX 11.31. The server has 4 Itanium 2 9100 series processor (1.59 GHz, 18 MB) so 8 cores and 32737 MB (31.97 GB) of memory.

Partitioned test table

My test table is a partition-range one, that is something we massively do where I work:

DROP TABLE test01;
 
CREATE TABLE test01(partition_code NUMBER, val VARCHAR2(100))
NOLOGGING
TABLESPACE tools
PARTITION BY RANGE (partition_code)
(
PARTITION p00 VALUES less than (1),
PARTITION p01 VALUES less than (2),
PARTITION p02 VALUES less than (3),
PARTITION p03 VALUES less than (4),
PARTITION p04 VALUES less than (5),
PARTITION p05 VALUES less than (6),
PARTITION p06 VALUES less than (7),
PARTITION p07 VALUES less than (8),
PARTITION p08 VALUES less than (9),
PARTITION p09 VALUES less than (10)
);

I fill it up with few rows:

INSERT INTO test01 VALUES(0,RPAD(0,10,0));
INSERT INTO test01 VALUES(1,RPAD(1,10,1));
INSERT INTO test01 VALUES(2,RPAD(2,10,2));
INSERT INTO test01 VALUES(3,RPAD(3,10,3));
INSERT INTO test01 VALUES(4,RPAD(4,10,4));
INSERT INTO test01 VALUES(5,RPAD(5,10,5));
INSERT INTO test01 VALUES(6,RPAD(6,10,6));
INSERT INTO test01 VALUES(7,RPAD(7,10,7));
INSERT INTO test01 VALUES(8,RPAD(8,10,8));
INSERT INTO test01 VALUES(9,RPAD(9,10,9));
COMMIT;

So val column has currently 10 distinct values…

In all below example I have changed default date format to have exact gathering time:

SQL> ALTER SESSION SET nls_date_format='dd-mon-yyyy hh24:mi:ss';
 
SESSION altered.

For any test delete the previous statistics with:

SQL> EXEC dbms_stats.delete_table_stats(ownname=>'yjaquier', tabname=>'test01');
 
PL/SQL PROCEDURE successfully completed.

Aggregated global statistics

To benefit from aggregated global statistics I gather statistics on all partition and expect Oracle to aggregate them to derive global table statistics. So using:

EXEC dbms_stats.gather_table_stats(ownname=>'yjaquier', tabname=>'test01', partname=>'xx', granularity=>'partition');

Which gives:

SQL> SELECT partition_name,last_analyzed,global_stats,stale_stats
     FROM dba_tab_statistics
     WHERE table_name='TEST01'
     AND owner='YJAQUIER';
 
PARTITION_NAME                 LAST_ANALYZED        GLO STA
------------------------------ -------------------- --- ---
                               25-aug-2015 12:27:43 NO  NO
P00                            25-aug-2015 12:27:42 YES NO
P01                            25-aug-2015 12:27:43 YES NO
P02                            25-aug-2015 12:27:43 YES NO
P03                            25-aug-2015 12:27:43 YES NO
P04                            25-aug-2015 12:27:43 YES NO
P05                            25-aug-2015 12:27:43 YES NO
P06                            25-aug-2015 12:27:43 YES NO
P07                            25-aug-2015 12:27:43 YES NO
P08                            25-aug-2015 12:27:43 YES NO
P09                            25-aug-2015 12:27:43 YES NO

When partition_name column is null it means we have table level statistics. Note the global_stats column to NO for table level statistics. Means we have aggregated global statistics for our partitioned table.

If we focus on column statistics (for display_raw function see references section):

SQL> col low_value FOR a15
SQL> col high_value FOR a15
SQL> SELECT a.column_name,a.last_analyzed,a.num_distinct,a.density,
     display_raw(a.low_value,b.data_type) AS low_value,display_raw(a.high_value,b.data_type) AS high_value
     FROM dba_tab_col_statistics a, dba_tab_cols b
     WHERE a.owner=b.owner
     AND a.table_name=b.table_name
     AND a.column_name=b.column_name 
     AND a.table_name='TEST01'
     AND a.owner='YJAQUIER';
 
COLUMN_NAME                    LAST_ANALYZED        NUM_DISTINCT    DENSITY LOW_VALUE       HIGH_VALUE
------------------------------ -------------------- ------------ ---------- --------------- ---------------
PARTITION_CODE                 25-aug-2015 12:32:18           10         .1 0               9
VAL                            25-aug-2015 12:32:18           10         .1 0000000000      9999999999

Now I add one (particular) line to P09 partition and gather statistics only on this partition:

SQL> INSERT INTO test01 VALUES(9,RPAD(8,10,8));
 
1 ROW created.
 
SQL> COMMIT;
 
COMMIT complete.
 
SQL> EXEC dbms_stats.gather_table_stats(ownname=>'yjaquier', tabname=>'test01', partname=>'p09', granularity=>'partition');
 
PL/SQL PROCEDURE successfully completed.
 
SQL> SELECT partition_name,last_analyzed,global_stats,stale_stats
     FROM dba_tab_statistics
     WHERE table_name='TEST01'
     AND owner='YJAQUIER';
 
PARTITION_NAME                 LAST_ANALYZED        GLO STA
------------------------------ -------------------- --- ---
                               25-aug-2015 12:37:07 NO  YES
P00                            25-aug-2015 12:32:18 YES NO
P01                            25-aug-2015 12:32:18 YES NO
P02                            25-aug-2015 12:32:18 YES NO
P03                            25-aug-2015 12:32:18 YES NO
P04                            25-aug-2015 12:32:18 YES NO
P05                            25-aug-2015 12:32:18 YES NO
P06                            25-aug-2015 12:32:18 YES NO
P07                            25-aug-2015 12:32:18 YES NO
P08                            25-aug-2015 12:32:18 YES NO
P09                            25-aug-2015 12:37:07 YES NO
 
SQL> SELECT a.column_name,a.last_analyzed,a.num_distinct,a.density,
     display_raw(a.low_value,b.data_type) AS low_value,display_raw(a.high_value,b.data_type) AS high_value
     FROM dba_tab_col_statistics a, dba_tab_cols b
     WHERE a.owner=b.owner
     AND a.table_name=b.table_name
     AND a.column_name=b.column_name 
     AND a.table_name='TEST01'
     AND a.owner='YJAQUIER';
 
COLUMN_NAME                    LAST_ANALYZED        NUM_DISTINCT    DENSITY LOW_VALUE       HIGH_VALUE
------------------------------ -------------------- ------------ ---------- --------------- ---------------
PARTITION_CODE                 25-aug-2015 12:37:07           10         .1 0               9
VAL                            25-aug-2015 12:37:07           11 .090909091 0000000000      9999999999

And this is main drawback of aggregated global statistics, Oracle has no global view of the data and so cannot really provide precise global statistics, particularly for number of distinct value (NDV). Here the number of NDV for val column should still be 10. RPAD(8,10,8) has already been inserted in P08 partition. Aggregated global statistics have been updated at the time of P09 partition gather statistics but notice the YES in stale_stats column which means automatic gather statistics job with select this table as a candidate (which will most probably break your gather statistics strategy).

Another drawback I have seen is related to histograms that have been computed at partition level but not aggregated at table level…

Conclusion
Global statistics must be gathered explicitly to be as close as possible to reality… Means extra resource consumption (and obviously longer time).

Gathered global statistics

If I restart from original example and let Oracle decide for granularity parameter of DBMS_STATS, means using:

EXEC dbms_stats.gather_table_stats(ownname=>'yjaquier', tabname=>'test01', partname=>'xx');

Global statistics at table level (partition_name is null) have been gathered explicitly (YES value):

SQL> SELECT partition_name,last_analyzed,global_stats,stale_stats
     FROM dba_tab_statistics
     WHERE table_name='TEST01'
     AND owner='YJAQUIER';
 
PARTITION_NAME                 LAST_ANALYZED        GLO STA
------------------------------ -------------------- --- ---
                               25-aug-2015 12:47:13 YES NO
P00                            25-aug-2015 12:47:12 YES NO
P01                            25-aug-2015 12:47:12 YES NO
P02                            25-aug-2015 12:47:12 YES NO
P03                            25-aug-2015 12:47:12 YES NO
P04                            25-aug-2015 12:47:12 YES NO
P05                            25-aug-2015 12:47:12 YES NO
P06                            25-aug-2015 12:47:12 YES NO
P07                            25-aug-2015 12:47:13 YES NO
P08                            25-aug-2015 12:47:13 YES NO
P09                            25-aug-2015 12:47:13 YES NO

If I again insert a (particular) row in P09 partition:

SQL> INSERT INTO test01 VALUES(9,RPAD(8,10,8));
 
1 ROW created.
 
SQL> COMMIT;
 
COMMIT complete.

If I re-gather statistics notice the global statistics that are updated at same time as the underlining partition statistics:

SQL> EXEC dbms_stats.gather_table_stats(ownname=>'yjaquier', tabname=>'test01', partname=>'p09');
 
PL/SQL PROCEDURE successfully completed.
 
SQL> SELECT partition_name,last_analyzed,global_stats,stale_stats
     FROM dba_tab_statistics
     WHERE table_name='TEST01'
     AND owner='YJAQUIER';
 
PARTITION_NAME                 LAST_ANALYZED        GLO STA
------------------------------ -------------------- --- ---
                               25-aug-2015 12:48:12 YES NO
P00                            25-aug-2015 12:47:12 YES NO
P01                            25-aug-2015 12:47:12 YES NO
P02                            25-aug-2015 12:47:12 YES NO
P03                            25-aug-2015 12:47:12 YES NO
P04                            25-aug-2015 12:47:12 YES NO
P05                            25-aug-2015 12:47:12 YES NO
P06                            25-aug-2015 12:47:12 YES NO
P07                            25-aug-2015 12:47:13 YES NO
P08                            25-aug-2015 12:47:13 YES NO
P09                            25-aug-2015 12:48:12 YES NO
 
11 ROWS selected.
 
SQL> SELECT a.column_name,a.last_analyzed,a.num_distinct,a.density,
     display_raw(a.low_value,b.data_type) AS low_value,display_raw(a.high_value,b.data_type) AS high_value
     FROM dba_tab_col_statistics a, dba_tab_cols b
     WHERE a.owner=b.owner
     AND a.table_name=b.table_name
     AND a.column_name=b.column_name 
     AND a.table_name='TEST01'
     AND a.owner='YJAQUIER';
 
COLUMN_NAME                    LAST_ANALYZED        NUM_DISTINCT    DENSITY LOW_VALUE       HIGH_VALUE
------------------------------ -------------------- ------------ ---------- --------------- ---------------
PARTITION_CODE                 25-aug-2015 12:48:12           10 .045454545 0               9
VAL                            25-aug-2015 12:48:12           10         .1 0000000000      9999999999

This time everything is well handle and NDV matches correct number (10). Remember it implies additional resource consumption

Global statistics performance comparison

To compare performance drawback of the aggregated statistics versus the gathered ones I have worked on a big table of our application. The table size is 3549 MB, has 12 partitions and around 1 million rows per partition. I have used the biggest partition to test execution time.

Over multiple execution below command, so aggregated global statistics, took around 15 seconds:

EXEC dbms_stats.gather_table_stats(ownname=>'xx', tabname=>'yy', partname=>'zz', granularity=>'partition', degree=>16);

Over multiple execution below command, so gathered global statistics, took around 1 minute:

EXEC dbms_stats.gather_table_stats(ownname=>'xx', tabname=>'yy', partname=>'zz', degree=>16);

So the time difference is substantial and it is 4 times slower to gather partition statistics and global statistics at same time. But the added value is clear… Fortunately the situation has been drastically improved with incremental statistics.

References

About Post Author

Share the knowledge!
This entry was posted in Oracle and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>