High Availability Options for DB2/LUW

1 March 2007

 

 

This article lists and briefly describes the options for increasing the availability of your DB2 on Linux, UNIX and Windows (DB2/LUW) systems through the use of redundant systems and clustering.  Please contact me if you want to discuss any of them more in depth.  This sort of technology is very appealing to many of us because we like to brag to our friends about the cool new technology that we are using.  While there are many legitimate uses for this technology, I will to take a moment at the beginning of this article and rain on your parade.  However boring it may be, following good practices in planning, building, maintaining and monitoring your systems yields the best results and is very cost effective.  Since these are not the focus of this article I will not get into them here, but I would be happy to discuss them with you at length.  I would also like to note that I consider these good practices to be a prerequisite to using any of the more advanced features for high availability, as they add complexity and can DECREASE availability if you have not laid a good foundation.  However, once you have implemented good practices the following items can add that incremental availability that many systems need. 

 

In this article I will describe the 4 primary high availability schemes that we have for DB2/LUW.  For simplicity I will assume two server clusters except where noted, but most of the scenarios allow for more servers. The four options that I will discuss are:

 

1.  Traditional Fail Over

2.  DB2 HADR feature

3.  Replication

4. GridSCALE

 

1.  Traditional Fail Over

This is the traditional fail over scenario where you have a shared disk system that can mount file systems on either of the servers in the cluster.  Your instance and database file systems must be able to be mounted on either one of the servers.  Since the instance files can only be on one server at a time, it can only run on one server at a time.  This requires clustering software such as Tivoli System Automation (TSA which free with DB2 on Linux) that can automatically detect and move the file systems and start the instance after a failure.  You can have more than one instance in the cluster.  Under normal circumstances you can either have all instances on one server with the other server as a passive backup, or you can have some instances on each server with the surviving server taking over all instances in the event of a problem.  In the later scenario with two servers, you still need about 50% unused capacity in the cluster to allow full performance processing in the event that a server fails.  To decrease the unused capacity you can have clusters with more than two servers and designate one server as a passive backup for the rest of the cluster.  For example, you could create a 4-server cluster with three servers running at full capacity and one idle server.  In this case you only need about 25% unused capacity.  You just need to keep in mind that while the probability of two servers failing at the same time is low, but it is a possibility.  In addition to TSA, other clustering software options exist including Veritas Cluster Server, HACMP and Microsoft Cluster Server.

 

Advantages

·      Low cost

·      Proven technology -  This failover practice has been in use for years

·      With minimal configuration, many of the clustering software offerings can detect database failures and perform the fail over. 

 

Disadvantages

·      Database is a single point of failure.  Having the data on RAID helps, but not for logical corruption such as when someone deletes a file.

·      Fail over time increases with the number of logical disk volumes.

·      Must use external shared disk hardware

·      Servers must be in relatively close physical proximity.

 

 

2.  DB2 HADR feature

With DB2 High Availability and Disaster Recovery feature you have two separate copies of the database -- one on each server.  There is one active copy and one is a passive copy that is kept up to date automatically by DB2.  There is and active DB2 instance on each server with one containing the primary copy of the database and the other containing a standby copy.  The standby copy is perpetually in "roll forward pending mode” and as changes are committed on the production copy of the database they are automatically sent to the secondary copy.  As with traditional failover, you can have a mutual take over situation if you have multiple databases in the cluster or you can have one server designated solely as a standby.  The HADR feature automatically keeps the copies of the database synchronized, but you need clustering software like TSA to if you wish to automate the process of detecting failure and making the standby the primary database.

 

Advantages

·      Eliminates all single points of failure

·      Extremely fast fail over -  you just tell the secondary copy to takeover primary processing

·      You can use either internal disk or external disk storage

·      Allows for Disaster Recovery

·      Allows for “rolling” fix pack upgrades – Since the instances can be a different maintenance levels for short periods of time, you can perform maintenance upgrades with an outage that lasts only as long as it takes to tell the secondary to become the primary.  This does not apply to version upgrades like going from DB2 v8.2 to DB2 v9.1.

·      Secondary copy can be at your disaster recovery site.

 

Disadvantages

·        Consumes more disk spaces because you have two copies of the database.

 

 

3.  Replication

This option allows you to have two active databases.  For HA you typically only want to have updates occur on one database at a time for the least complex maintenance.  This is often called a master/slave replication.  The really good thing about replication is the ability to do reporting, ad hoc queries and backups from the secondary server without impacting your primary at all.  There are two types of replication -- SQL Replication and Q-replication.  Replication automatically keeps the copies of the database synchronized, but you need clustering software like TSA to if you wish to automate the process of detecting failure and making the standby the primary database.  You can also use our replication products to replicate data to or from certain other non IBM databases.  You can read more about the traditional SQL Replication and the newer Q-Replication at these links:

A Practical Guide to DB2 UDB Data Replication V8

WebSphere Information Integrator Q Replication

 

Advantages

·          Eliminates all single points of failure

·          Extremely fast fail over -  you just tell the secondary copy to takeover primary processing

·          You can use either internal disk or external disk storage

·          Allows for Disaster Recovery

·          Allows offloading of reporting tasks.

·          Secondary copy can be at your disaster recovery site.

 

Disadvantages

·          Consumes more disk spaces because you have two copies of the database.

·          More Complex configuration and maintenance

 

 

 

4.  XKOTO GridSCALE

 

Look for an announcement regarding this technology in the near future!

 

xkoto is an IBM partner that has a database load balancing technology called GRIDSCALE.  In essence, it allows you to have three or more copies of your database that are kept in sync and the read load is divided among the databases.  This option is designed primarily for increasing the capacity of databases by allowing you to scale out with additional commodity servers as your load grows, but since there are multiple copies of it, you also get high availability with no down time.  When one server fails, the rest just keep processing.  The servers can be in the same or location or widely separated. 

 

In this option the software GridSCALE software intercepts all incoming SQL and farms it out to the individual servers as needed.  Each server has its own copy of the database.  For updates/inserts/deletes it sends them to all servers in the cluster.  Your application gets the appropriate SQL return as soon as the first server completes the data change, and GridSCALE then ensures that all servers perform the update or takes a broken server out of the cluster if it has failed.  Queries are assigned to the server with the smallest load.  It also keeps track of the status of updates and will not execute a query on a server that has not processed an update that would affect it.  For additional details pleas see:

http://www-304.ibm.com/jct09002c/gsdod/solutiondetails.do?solutionId=22691&lc=en&stateCd=P&page=1&sa=false#sd

 

Advantages

·          Eliminates all single points of failure

·          Zero fail over time.

·          You can use either internal disk or external disk storage

·          Allows for Disaster Recovery

·          Allows increased scalability.

·                    A copy or copies can be at your disaster recovery site.

 

Disadvantages

·          Consumes more disk spaces because you have multiple copies of the database.

·          More Complex configuration and maintenance

 

 

Further Reading

 

The following link is a really good reference for DB2 backup and recovery as well as for a more in depth discussion of traditional fail over and HADR:

Data Recovery and High Availability Guide and Reference

 

The following links contain technical information about DB2 and Clustering software.  If you have others please send them to me.

Implementing IBM DB2 Universal Database Workgroup Server Edition V8.2 with Microsoft Cluster Server

Automating DB2 Universal Database HADR Failover using Heartbeat for Linux

DB2 Universal Database for Linux with DRBD and Heartbeat

Automating IBM DB2 UDB HADR with HACMP

DB2 UDB Version 8 and VERITAS Storage Foundation for DB2

 

 

 

The End