Monitoring Data Guard
Proactive monitoring plays a big parting in making sure that your Data Guard environment is working correctly and that potential problems are picked up very quickly. This is one area the business never see's, it takes a lot of work to keep a system running 365 days per year without downtime. This section will cover both physical and logical standby databases, we look at log files, statistical information, log gaps, redo transport and we will also be taking a better look at the Data Guard Menu utility which I mentioned in my broker section.
We first start at the alert log file which every dba should be familiar with, I use a script which was supplied by the www.dataguardbook.com website, I have made a few changes and linked into my monitoring system (Big Brother or BB for short). The script compares the alert.log to the previous version that was checked and any new ORA errors are then alerted. The script is called alert_log_monitor.ksh, you need to setup an environment first which means creating a file called .ORACLE_BASE which you would source first or put in your profile as that it gets sourced every time you login
.ORACLE_BASE | export BASE_DIR=/home/oracle export ORACLE_BASE=/home/oracle export PATH=/usr/local/bin:/bin:/usr/bin:/usr/sbin:$PATH export SH=$ORACLE_BASE/general/sh |
The script uses sqlplus -V to determine if you are using 10g or 11g and thus it will use the correct location of your alert log or you can just edit the script to point to your own location. When the script detects new ORA errors it calls the alert_notification.ksh script which in turns uses the alert_notication.ctl (which you edit to point to the email addresses that you wan to alert).
Here is an example of the alert_notification.ctl file, notice the catch all recipient in the last line, the script will pickup the following categories MINOR, WARNING and CRITICAL.
alert_notification.ctl | blockmon.ksh DB CRITICAL dbasupport@datadisk.co.uk itsupport@datadisk.co.uk logical_log_alert.ksh DB CRITICAL dbasupport@datadisk.co.uk itsupport@datadisk.co.uk alert_log_monitor.ksh OSE WARNING dbasupport@datadisk.co.uk archmon.ksh OSE MINOR dbasupport@datadisk.co.uk *: OTH MINOR dbasupport@datadisk.co.uk |
You can also just check the alert log each day, I generally mail it to myself and have a look, this has the added benefit of that you become familiar with it and can see what else Oracle is up to, this really only apply's if you have a small environment through.
In an ideal world a log switch should occur every 30 mins, but no less than 15 minutes, you can check this by viewing the v$log_history view, try and size the logs so that you hit 30 mins (remember that if you increase the primary log sizes you must increase the standby ones too). you can use the below formula
So if you had a 3 node RAC that has 3 ORLs per instance you would need 12 standby redo logs (3+1) * 3 = 12
You the below to identify the peak archive times for a specified day
identify peak archive times | ## You probably will have to increase your linesize |
Check the archive_lag_target parameter as this will have an affect on your log switching, by default it is set to 0, which means it will not influence log switching, if setting the parameter it is recommended to set to 1800 (seconds) which is 30 mins.
Eventually you will have gaps in the archive logs sequences maybe due to network issues, remember these occur when an archive log is generated on the primary database but not received at the standby site. Archive gaps logs can be monitored by examining the low and high sequence numbers in the v$archive_gap view, you can also use the DG_Menu (which we discuss later). I also see my troubleshooting section regarding manually resolving gaps.
archive gap logs | # Use the thread# when using RAC an detect missing sequences select thread#, low_sequence#, high_sequence# from v$archive_gap; |
You can identify delays in the redo transport by comparing the highest sequence numbers of both the primary and physical standby databases, you can use the script dg_archive_monitor to monitor the archive logs as well.
delays in redo transport | select max(sequence#), thread# from v$archived_log group by thread#; ## you can use the dg_archivelog_monitor.sh script, which accepts three parameters, primary, physical |
Identify the missing logs on the primary | ## On the primary run the below ; |
If the archive log destination are in ASM you can monitor the free space in ASM by using the dgmon_arch script
ASM free space | ## The script uses three parameters, alert notification threshold, the database name and the diskgroup dgmon_mon.ksh <alert notification> <databasename> <diskgroup> |
You can monitor the apply rate and active rate by using the below
apply rate and active monitoring | select to_char(start_time, 'DD-MON-RR HH24:MI:SS') start_time, item , sofar |
You can find out how far behind the redo transport and redo apply processes are, the transport lag column will indicate how far in time that the shipment of redo from the primary to the standby database is behind, remember this is what you would lose if the primary were to go down. The apply lag indicates how much time it is behind this can be affected by the 30 min delay attribute in the parameter archive_log_dest_n as you may want to have a delay. You can also use the script dg_time_lag to find out how far you are behind but with the added bonus of seeing a real time.
transport and apply lag | col name for a13 |
You can review the progress of the standby database in managed recovery mode, you can see the redo apply and redo transport services
Viewing the status of the managed recovery process | col client_pid for a10; select pid, process, status, client_process, client_pid, thread#, sequence#, block#, blocks from v$managed_standby; |
The Data Guard Utility Menu (DG Menu) is a very useful tool to view reports of your current system, you can also see what the scripts are doing in the background and see what tables are being used to look up this information. you and even add or change the menu as long as you know how to script a little. You can even use the scripts in cron to report and email you on a regular basis.
I am not going o go into great detail here as the menu and options are all self explaining, the one you need to do after downloading the tool is to setup the configuration file called dg.conf. Here is a copy of my configuration to give to an idea on how to configure it
dg.conf | Note: the tool only uses SQLPlus it does not use tnsnames.ora file |
I get this tool working in all my database servers as you never know which one may fail, at least its ready to go if a failover occurs.
The tool has two menus, one I call the main and the other a standby menu which is invoke from the main menu option 20.
DG Menu Utility | DG Menu ======================================== # ------------------------------------------------------------------------- # # # # Data Guard Check List - primarydg01 # # # 0. Review database information and status # # 1. Check for password file # # 2. Check for forced logging and unrecoverable activities # # ------------------------------------------------------------------------- # # 3. Check for archive log mode # # 4. Check for standby redo logs # # 5. Check current SCN on primary and standby databases # # 6. Check archive log destinations # # ------------------------------------------------------------------------- # # 7. Check Data Guard Status View for errors and fatal messages # # 8. Check Managed Recovery Process Status # # 9. Check for missing archive logs # # 10. Check archive log gaps on the standby database # # 11. Check average apply rate / active apply rate # # 12. Check transport / apply lag # # 13. How far behind is my Data Guard in terms of time? # # # # ------------------------------------------------------------------------- # # 20. Launch the Logical Standby Data Guard Submenu # # ------------------------------------------------------------------------- # # 21. Generate init.ora entries for primary database # # 22. Generate init.ora entries for standby database # # 23. Generate tnsnames.ora entries for primary and standby databases # # 24. Generate SQL syntax to create standby redo logs # # # # ------------------------------------------------------------------------- # # 30. Generate syntax to duplicate standby database from active database # # # # x. Exit # # ------------------------------------------------------------------------- # # Enter Task Number:
# ------------------------------------------------------------------------- # |
As you can see there are lots of options which report on all sorts of things, so have a play around and don't forget to have a look at the scripts to get a better understanding on what tables are being accessed to supply the information.
There are many things that you can do to monitor your Data Guard, some like menus, others like automatic cron jobs running checks and emailing, myself I have used the Data Guard Menu Utility scripts and modified them to work in a monitoring tool called "Big Brother", I edit the script to alert when a condition or a threshold has been meet, I don't monitor to much just enough to alert me if things are working correctly. I also have built a complete test environment at home in my VMWare setup I have running, this gives me the practice that I need when testing the DG environment and generally playing around with parameters, tuning, etc.