Client Failover

Now we know how to switchover/failover the database what about the clients or even middleware servers, how do they react to a switchover/failover. In this section I discuss automatic site and client failover, I will cover how to configure your client applications and databases so your connections seamlessly and transparently reconnect to the new primary database. The failover can be divided into three categories

You can have two types of problem a complete site failover or a partial site failover

complete site failover This typically means that a whole DC has gone down, you have converted your physical standby into the primary database and restarted your middleware servers in the DR site and point these to the new primary database. You can use VIP's or DNS to make this task easier
partial site failover This is were you just lose the primary database server (possible hardware problem), thus you convert your physical standby to a primary database (this could be local or at a DR site), it may be possible that the middleware servers can reconnect to the new primary database without any problems, just a slight pause, again you can use VIP's and DNS to make this happen seamlessly. However if your primary database is in a remote location you may have increased network latency which may or may not affect you application.

Although what I am about to discuss does not just apply to Data Guard it could be used if using Oracle RAC or your own DR architecture, Oracle uses two mechanisms for evenly distributing connections called client-side and server-side load-balancing. I will start with client-side load balancing and connect-time failover in your TNSNAMES.ora file you can add additional addresses so that if the first address is not connectable then the second will be tried, this is the default or setting the LOAD_BALANCE option to no, you you set the LOAD_BALANCE option to yes, then Oracle will randomly choose a address from the list, this will have the affect that all clients should not end up at the same server. You can also split address lists in kinda of a group so it starts with the first group then the second group, here is an example

tnsnames.ora SALES =
(DESCRIPTION =
   (ADDRESS_LIST =
      (LOAD_BALANCE = YES)
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2)(PORT = 1521))
   )
   (ADDRESS_LIST =
      (LOAD_BALANCE = YES)
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1dr)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2dr)(PORT = 1521))
   )
   (CONNECT_DATA =
      (SERVICE_NAME = SALES)
   )
)

One important point when using the address list in the way above it that if both prod1 and prod2 where down, the client would still try and connect to the first group, so it will try both prod1 and prod2 before trying prod1dr and prod2dr, which would take considerable time. Also remember that if you have your standby in read-only mode then it might be possible to connect to this instead of the primary databases. Bottom line here is that make sure you think about how you are going to configure the tnsnames.ora file from resilience.

I mentioned above that the client will try each address in the list, what if the host is down how long do we wait until the client gives up and moves onto the next host in the list, by default it will wait until the TCP connection times out. You can now configure a timeout option in the sqlnet.ora file for both clients and JDBC clients

sqlnet.ora timeout options (client) ## Time is in seconds
sqlnet.outbound_connect_timeout=10

sqlnetdef.tcp_conntimeout_str=10

Transparent Application Failover (TAF)

What happens to the sessions that are already connected to the primary database when it fails over, this is were Transparent Application Failover also known as TAF comes into play, when existing sessions detect a problem Oracle Net will automatically begin attempting to reconnect that session to another instance. TAF can failover any selects statements that were running but is unable to failover inserts, updates or deletes, an application can be configured to use OCI callbacks to capture those statements and replay them once the new session has been created. Again with the client we use the Oracle NET alias to configure the TAF, the failover_mode attribute has a number of attributes

Type

This attribute describes the type of failover

  • session - this specifies that when the failover occurs, TAF should create session and perform no other action
  • select - this specifies that when the failover occurs, TAF should create session and restart any select statements that were running at the time of the failover, only rows that were not previously returned will be returned to the user
  • none - do not perform a TAF failover
Method

This attribute determines when the session is created

  • basic - this establishes the session at the time of the failover
  • preconnect - when the initial connection to the database is made, this creates the failover session using Oracle Net alias designed by the backup attribute
Retries This attribute specifies the number of times Oracle Net will go back to the address_list and attempt to connect to the surviving instances.
Delay This attribute specifies the number of seconds to wait between each retry

With this import our Oracle Net alias would look like something below

tnsnames.ora SALES =
(DESCRIPTION =
   (ADDRESS_LIST =
      (LOAD_BALANCE = YES)
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod1dr)(PORT = 1521))
         (ADDRESS = (PROTOCOL = TCP)(HOST = prod2dr)(PORT = 1521))
   )
   (CONNECT_DATA =
      (SERVICE_NAME = SALES)
      (FAILOVER_MODE = (TYPE = SELECT)(METHOD = BASIC)(RETRIES = 20)(DELAY = 15))
   )
)

If you have many client this could become cumbersome to change, so what you can do is to configure it on the server-side and the clients will inherit these settings, on the server-side you have a number of options on how to configure this

You can view your sessions by looking at v$session

view your TAF configuration select service , failover_type, failover_method, failed_over from v$session;

You now understand from a client point of view how your sessions are failed over but what about your application (middleware servers), how are they going to know that a failover has occurred, this is where Oracle's Fast Application Notification (FAN) solution can help, FAN can notify applications when a resource becomes unavailable, it will also notify the application when the resource becomes available again. FAN comes in two flavors

FAN ONS is for both Oracle JDBC think and thin drivers, it delivers messages to the client application by using ONS daemons that are running on the database or cluster hosts. When you develop you application you need to configure your JDBC to subscribe to the available daemons upon application startup. Whenever a resource within the cluster changes state, Oracle clusterware will publish a message to the ONS daemons which is then consumed by the application, your code will then react to that message. To subscribe to ONS daemons and receive FAN events they must first be configured for Fast Connection Failover (FCF), all JDBC applications must meet the following

FAN OCI is designed for OCI clients, is differs from NS as it does not have ONS daemons to subscribe to and consume messages, instead it has the database deliver messages directly to the OCI client. When the OCI client that is configured for FAN OCI connects to the database, an entry is placed in the reg$ view that describes the application and it can be contacted. If the state changes a messages is placed into the database alert queue, a database process will wake up and send the message to all OCI clients that are registered in the reg$ view, the application will then act appropriately., FAN OCI clients must meet the following

One point to note is that logical standby databases do not support FAN OCI messaging.

During a switchover or a failover Data Guard performs a DB_ROLE_CHANGE system event, this event is fired each time the database role is changed, this both applies to physical or logical standby databases. You can use this system event to perform any number of functions at the failover time, an example script is below

using the DB_ROLE_CHANGE system event create or replace trigger failover_actions after db_role_change on database
BEGIN
  dbms_scheduler.create_job(
  job_name=>'publish_events',
  job_type=>'executable',
  job_action=>'/u01/failover_scripts/failover_actions.sh',
  enabled=>TRUE
  );
END

Conclusion

There are a number of ways that I have described to allow the client or application to failover when you either switching or failing over a database, you can use one or more of each method to make sure that your clients and applications failover seamlessly