Data Guard apply lag gap troubleshooting

Preamble

We have a SAP WM instance with an infrastructure that is partially following the Oracle Maximum Availability Architecture (MAA). In our primary data center the configuration is made of two servers and a RAC deployment. We have also setup a Data Guard configuration (physical standby) in our disaster recovery site (around 10 Km far from main site). An enterprise architect teammate asked me the lag between the master RAC database and the physical standby one: simple and fair question isn’t it ?

The first annoying thing is that Data Guard Broker has not been configured so I will use only SQL commands but reaching conclusion with DG Broker is much simpler…

If you are here you know what is Data Guard, the official Oracle high level picture is below one. Dta Guard is not more than transferring redo from primary site to secondary site:

Typical Oracle Data Guard Configuration
Typical Oracle Data Guard Configuration

Data Guard apply lag

To measure the lag between primary and standby site I have used two methods with obvious idea of having same result. First method is using V$DATAGUARD_STATS view that is accessible even in mount state so on standby database:

SQL> SELECT * FROM v$dataguard_stats;
 
NAME                             VALUE                                                            UNIT                           TIME_COMPUTED                  DATUM_TIME
-------------------------------- ---------------------------------------------------------------- ------------------------------ ------------------------------ ------------------------------
transport LAG                    +00 00:00:00                                                     DAY(2) TO SECOND(0) INTERVAL   05/02/2014 15:23:29            05/02/2014 15:23:28
apply LAG                        +00 00:22:55                                                     DAY(2) TO SECOND(0) INTERVAL   05/02/2014 15:23:29            05/02/2014 15:23:28
apply finish TIME                +00 00:00:02.764                                                 DAY(2) TO SECOND(3) INTERVAL   05/02/2014 15:23:29
estimated startup TIME           17                                                               SECOND                         05/02/2014 15:23:29

Second method is using Oracle System Change Number (SCN), on standby database:

SQL> SELECT current_scn FROM v$database;
 
CURRENT_SCN
-----------
 6587402685

On primary database:

SQL> SELECT current_scn FROM v$database;
 
CURRENT_SCN
-----------
 6587738190

Then convert these two SCNs in date on primary database (query with SCN_TO_TIMESTAMP not possible on physical standby):

SQL> SELECT SCN_TO_TIMESTAMP(6587831530), SCN_TO_TIMESTAMP(6587402685) FROM dual;
 
SCN_TO_TIMESTAMP(6587831530)                                                SCN_TO_TIMESTAMP(6587402685)
--------------------------------------------------------------------------- ---------------------------------------------------------------------------
02-MAY-14 03.23.33.000000000 PM                                             02-MAY-14 03.00.31.000000000 PM

The two methods are showing same (bad) result (the small difference is explained by the time to do the different queries). So a bit disappointed as the lag is huge (23 minutes) and I have even seen it going up to more than 1 hour…

Remark:
The V$STANDBY_EVENT_HISTOGRAM view is empty if you are not using Active Data Guard paid option that allows you to perform read only query on your physical standby database. This is expected behavior…

My configuration is following best practices:

log_archive_config                   string      DG_CONFIG=(WAL,WAL_standby)
log_archive_dest_1                   string      LOCATION=+WAL_ARCH
log_archive_dest_2                   string      SERVICE=WAL_standby LGWR ASYNC
                                                  AFFIRM VALID_FOR=(ONLINE_LOGF
                                                 ILES,PRIMARY_ROLE) DB_UNIQUE_N
                                                 AME=WAL_standby delay=0 OPTION
                                                 AL max_failure=15 reopen=300 r
                                                 egister

Remark:
LGWR is now deprecated and should be removed

I have standby redo log groups on primary database:

SQL> col MEMBER FOR a50
SQL> SELECT * FROM v$logfile ORDER BY TYPE,GROUP#;
 
    GROUP# STATUS  TYPE    MEMBER                                             IS_
---------- ------- ------- -------------------------------------------------- ---
        11         ONLINE  +WAL_OLOG/wal/onlinelog/group_11.257.810042773     NO
        11         ONLINE  +WAL_MLOG/wal/onlinelog/group_11.257.810042775     NO
        12         ONLINE  +WAL_OLOG/wal/onlinelog/group_12.258.810042775     NO
        12         ONLINE  +WAL_MLOG/wal/onlinelog/group_12.258.810042777     NO
        13         ONLINE  +WAL_OLOG/wal/onlinelog/group_13.259.810042779     NO
        13         ONLINE  +WAL_MLOG/wal/onlinelog/group_13.259.810042779     NO
        14         ONLINE  +WAL_OLOG/wal/onlinelog/group_14.260.810042781     NO
        14         ONLINE  +WAL_MLOG/wal/onlinelog/group_14.260.810042783     NO
        21         ONLINE  +WAL_OLOG/wal/onlinelog/group_21.261.810042803     NO
        21         ONLINE  +WAL_MLOG/wal/onlinelog/group_21.261.810042805     NO
        22         ONLINE  +WAL_OLOG/wal/onlinelog/group_22.262.810042807     NO
        22         ONLINE  +WAL_MLOG/wal/onlinelog/group_22.262.810042807     NO
        23         ONLINE  +WAL_OLOG/wal/onlinelog/group_23.263.810042809     NO
        23         ONLINE  +WAL_MLOG/wal/onlinelog/group_23.263.810042811     NO
        24         ONLINE  +WAL_OLOG/wal/onlinelog/group_24.264.810042813     NO
        24         ONLINE  +WAL_MLOG/wal/onlinelog/group_24.264.810042813     NO
        71         STANDBY +WAL_OLOG/wal/onlinelog/group_71.265.844256955     NO
        71         STANDBY +WAL_MLOG/wal/onlinelog/group_71.265.844256957     NO
        72         STANDBY +WAL_OLOG/wal/onlinelog/group_72.266.844256957     NO
        72         STANDBY +WAL_MLOG/wal/onlinelog/group_72.266.844256959     NO
        73         STANDBY +WAL_OLOG/wal/onlinelog/group_73.267.844256961     NO
        73         STANDBY +WAL_MLOG/wal/onlinelog/group_73.267.844256963     NO
        74         STANDBY +WAL_OLOG/wal/onlinelog/group_74.268.844256965     NO
        74         STANDBY +WAL_MLOG/wal/onlinelog/group_74.268.844256965     NO
        81         STANDBY +WAL_OLOG/wal/onlinelog/group_81.269.844256969     NO
        81         STANDBY +WAL_MLOG/wal/onlinelog/group_81.269.844256971     NO
        82         STANDBY +WAL_OLOG/wal/onlinelog/group_82.270.844256973     NO
        82         STANDBY +WAL_MLOG/wal/onlinelog/group_82.270.844256975     NO
        83         STANDBY +WAL_OLOG/wal/onlinelog/group_83.271.844256975     NO
        83         STANDBY +WAL_MLOG/wal/onlinelog/group_83.271.844256977     NO
        84         STANDBY +WAL_OLOG/wal/onlinelog/group_84.272.844256979     NO
        84         STANDBY +WAL_MLOG/wal/onlinelog/group_84.272.844256981     NO
 
32 ROWS selected.

Started to investigate on redo generation:

SQL> ALTER SESSION SET nls_date_format='dd-mon-yyyy hh24:mi:ss';
 
SESSION altered.
 
SQL> SET lines 200
SQL> SET pages 100
SQL> col metric_unit FOR a20
SQL> col metric_name FOR a25
SQL> SELECT * FROM (SELECT * FROM DBA_HIST_SYSMETRIC_HISTORY
     WHERE metric_name='Redo Generated Per Sec' ORDER BY snap_id DESC) WHERE rownum<=10;
 
   SNAP_ID       DBID INSTANCE_NUMBER BEGIN_TIME                 END_TIME                      INTSIZE   GROUP_ID  METRIC_ID METRIC_NAME                    VALUE METRIC_UNIT
---------- ---------- --------------- -------------------------- -------------------------- ---------- ---------- ---------- ------------------------- ---------- --------------------
      9947 2087071167               2 02-may-2014 14:00:35       02-may-2014 14:01:35             6017          2       2016 Redo Generated Per Sec    344912.814 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:09:35       02-may-2014 14:10:35             6017          2       2016 Redo Generated Per Sec    185112.049 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:08:35       02-may-2014 14:09:35             6018          2       2016 Redo Generated Per Sec    131244.799 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:07:35       02-may-2014 14:08:35             5917          2       2016 Redo Generated Per Sec    281785.567 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:06:35       02-may-2014 14:07:35             6018          2       2016 Redo Generated Per Sec    153441.941 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:05:35       02-may-2014 14:06:35             6018          2       2016 Redo Generated Per Sec    154841.542 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:04:35       02-may-2014 14:05:35             6018          2       2016 Redo Generated Per Sec    105532.802 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:03:35       02-may-2014 14:04:35             6017          2       2016 Redo Generated Per Sec    217653.582 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:02:35       02-may-2014 14:03:35             6017          2       2016 Redo Generated Per Sec    251052.285 Bytes Per SECOND
      9947 2087071167               2 02-may-2014 14:01:35       02-may-2014 14:02:35             5917          2       2016 Redo Generated Per Sec    209524.624 Bytes Per SECOND
 
10 ROWS selected.

I’m not generating huge amount of redo log information (no mass update currently running) as I’m around 200-300 KB per second and the transport lag is not showing any gap…

Finally had a look to V$ARCHIVE_DEST_STATUS table and have seen, on primary:

SQL> SELECT dest_id,status,database_mode,recovery_mode
FROM v$archive_dest_status
WHERE status <> 'INACTIVE';
 
   DEST_ID STATUS    DATABASE_MODE   RECOVERY_MODE
---------- --------- --------------- -----------------------
         1 VALID     OPEN            IDLE
         2 VALID     MOUNTED-STANDBY MANAGED

On standby:

SQL> SELECT dest_id,status,database_mode,recovery_mode
FROM v$archive_dest_status
WHERE status <> 'INACTIVE';
 
   DEST_ID STATUS    DATABASE_MODE   RECOVERY_MODE
---------- --------- --------------- -----------------------
         1 VALID     MOUNTED-STANDBY MANAGED
         2 VALID     UNKNOWN         IDLE
        32 VALID     UNKNOWN         IDLE

So in short all is ready for real time redo log apply but the feature has not been activated. So a final command is expected:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE;

References

About Post Author

This entry was posted in Oracle and tagged . Bookmark the permalink.

3 thoughts on “Data Guard apply lag gap troubleshooting

  1. Colin Dawson says:

    Interesting article.

    Following a SR query Oracle came back with the number of standby logs should be redo logs+1

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>