Data Guard - Client Failover

Now we know how to switchover/failover the database what about the clients or even middleware servers, how do they react to a switchover/failover. In this section I discuss automatic site and client failover, I will cover how to configure your client applications and databases so your connections seamlessly and transparently reconnect to the new primary database. The failover can be divided into three categories

Service relocation - The database service used by the primary application to connect to the database should be active only on the primary database, if a failover or switchover occurs this service should be automatically migrated to the new primary database
Client notification - Once the failover has completed the application should be notified that a failover has occurred and that connections should be migrated to the new primary database
Efficient reconnection - The new and previous sessions should be quickly be able to locate the new primary database and not get stalled waiting for timeouts on unavailable hosts or networks.

You can have two types of problem a complete site failover or a partial site failover

complete site failover

This typically means that a whole DC has gone down, you have converted your physical standby into the primary database and restarted your middleware servers in the DR site and point these to the new primary database. You can use VIP's or DNS to make this task easier

partial site failover

This is were you just lose the primary database server (possible hardware problem), thus you convert your physical standby to a primary database (this could be local or at a DR site), it may be possible that the middleware servers can reconnect to the new primary database without any problems, just a slight pause, again you can use VIP's and DNS to make this happen seamlessly. However if your primary database is in a remote location you may have increased network latency which may or may not affect you application.

Although what I am about to discuss does not just apply to Data Guard it could be used if using Oracle RAC or your own DR architecture, Oracle uses two mechanisms for evenly distributing connections called client-side and server-side load-balancing. I will start with client-side load balancing and connect-time failover in your TNSNAMES.ora file you can add additional addresses so that if the first address is not connectable then the second will be tried, this is the default or setting the LOAD_BALANCE option to no, you you set the LOAD_BALANCE option to yes, then Oracle will randomly choose a address from the list, this will have the affect that all clients should not end up at the same server. You can also split address lists in kinda of a group so it starts with the first group then the second group, here is an example

tnsnames.ora

SALES =
(DESCRIPTION =
   (ADDRESS_LIST =
      (LOAD_BALANCE = YES)
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2)(PORT = 1521))
   )
   (ADDRESS_LIST =
      (LOAD_BALANCE = YES)
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1dr)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2dr)(PORT = 1521))
   )
   (CONNECT_DATA =
      (SERVICE_NAME = SALES)
   )
)

One important point when using the address list in the way above it that if both prod1 and prod2 where down, the client would still try and connect to the first group, so it will try both prod1 and prod2 before trying prod1dr and prod2dr, which would take considerable time. Also remember that if you have your standby in read-only mode then it might be possible to connect to this instead of the primary databases. Bottom line here is that make sure you think about how you are going to configure the tnsnames.ora file from resilience.

I mentioned above that the client will try each address in the list, what if the host is down how long do we wait until the client gives up and moves onto the next host in the list, by default it will wait until the TCP connection times out. You can now configure a timeout option in the sqlnet.ora file for both clients and JDBC clients

sqlnet.ora timeout options (client)

## Time is in seconds
sqlnet.outbound_connect_timeout=10

sqlnetdef.tcp_conntimeout_str=10

Transparent Application Failover (TAF)

What happens to the sessions that are already connected to the primary database when it fails over, this is were Transparent Application Failover also known as TAF comes into play, when existing sessions detect a problem Oracle Net will automatically begin attempting to reconnect that session to another instance. TAF can failover any selects statements that were running but is unable to failover inserts, updates or deletes, an application can be configured to use OCI callbacks to capture those statements and replay them once the new session has been created. Again with the client we use the Oracle NET alias to configure the TAF, the failover_mode attribute has a number of attributes

Type	This attribute describes the type of failover session - this specifies that when the failover occurs, TAF should create session and perform no other action select - this specifies that when the failover occurs, TAF should create session and restart any select statements that were running at the time of the failover, only rows that were not previously returned will be returned to the user none - do not perform a TAF failover
Method	This attribute determines when the session is created basic - this establishes the session at the time of the failover preconnect - when the initial connection to the database is made, this creates the failover session using Oracle Net alias designed by the backup attribute
Retries	This attribute specifies the number of times Oracle Net will go back to the address_list and attempt to connect to the surviving instances.
Delay	This attribute specifies the number of seconds to wait between each retry

With this import our Oracle Net alias would look like something below

tnsnames.ora

SALES =
(DESCRIPTION =
   (ADDRESS_LIST =
      (LOAD_BALANCE = YES)
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1dr)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2dr)(PORT = 1521))
   )
   (CONNECT_DATA =
      (SERVICE_NAME = SALES)
      (FAILOVER_MODE = (TYPE = SELECT)(METHOD = BASIC)(RETRIES = 20)(DELAY = 15))
   )
)

If you have many client this could become cumbersome to change, so what you can do is to configure it on the server-side and the clients will inherit these settings, on the server-side you have a number of options on how to configure this

init.ora parameter
by using the srvctl in a RAC
by using the dbms_service PL/SQL package

You can view your sessions by looking at v$session

view your TAF configuration

select service , failover_type, failover_method, failed_over from v$session;

You now understand from a client point of view how your sessions are failed over but what about your application (middleware servers), how are they going to know that a failover has occurred, this is where Oracle's Fast Application Notification (FAN) solution can help, FAN can notify applications when a resource becomes unavailable, it will also notify the application when the resource becomes available again. FAN comes in two flavors

FAN Oracle notification service (ONS) - designed for both both JDBC thick and thin clients
FAN OCI is used for all OCI-based clients

FAN ONS is for both Oracle JDBC think and thin drivers, it delivers messages to the client application by using ONS daemons that are running on the database or cluster hosts. When you develop you application you need to configure your JDBC to subscribe to the available daemons upon application startup. Whenever a resource within the cluster changes state, Oracle clusterware will publish a message to the ONS daemons which is then consumed by the application, your code will then react to that message. To subscribe to ONS daemons and receive FAN events they must first be configured for Fast Connection Failover (FCF), all JDBC applications must meet the following

Use the JDBC driver
Have the implicit connection cache enabled
Connect to the database using services
ONS daemons must be running on the database hosts
The JVM must have oracle.ons.oraclehome set to point to your oracle_home

FAN OCI is designed for OCI clients, is differs from NS as it does not have ONS daemons to subscribe to and consume messages, instead it has the database deliver messages directly to the OCI client. When the OCI client that is configured for FAN OCI connects to the database, an entry is placed in the reg$ view that describes the application and it can be contacted. If the state changes a messages is placed into the database alert queue, a database process will wake up and send the message to all OCI clients that are registered in the reg$ view, the application will then act appropriately., FAN OCI clients must meet the following

The OCI_EVENTS must be enabled at the environment create time on the client
The OCI application must be linked with the client or O/S thread library.
The database service used by the application must have the aq_ha_notifications set to true (see dbms_service.modify_server to set this)

One point to note is that logical standby databases do not support FAN OCI messaging.

During a switchover or a failover Data Guard performs a DB_ROLE_CHANGE system event, this event is fired each time the database role is changed, this both applies to physical or logical standby databases. You can use this system event to perform any number of functions at the failover time, an example script is below

using the DB_ROLE_CHANGE system event

create or replace trigger failover_actions after db_role_change on database
BEGIN
  dbms_scheduler.create_job(
  job_name=>'publish_events',
  job_type=>'executable',
  job_action=>'/u01/failover_scripts/failover_actions.sh',
  enabled=>TRUE
  );
END

Conclusion

There are a number of ways that I have described to allow the client or application to failover when you either switching or failing over a database, you can use one or more of each method to make sure that your clients and applications failover seamlessly

You can use VIP's
DNS can be used to change IP addresses and point to the correct database
Use multiple addresses in your Oracle Net alias
For applications you can use FAN ONS or FAN OCI
You can call failover scripts using the DB_ROLE_CHANGE system event