High Availability Options for DB2/LUW

Dean Compher

1 March 2007

Updated 31 March 2015

This article lists and briefly describes the options for increasing the availability of your DB2 on Linux, UNIX and Windows (DB2/LUW) systems through the use of redundant systems and clustering. The technologies that I will discuss in this article include DB2 High Availability and Disaster Recovery (DB2 HADR), Traditional Failover, Replication and DB2 pureScale. So if you need to determine the best approach to adding high availability, this summary is a good place to start.

Before we jump into the clustering technologies I want to note that following good practices in planning, building, maintaining and monitoring your systems yields the best results for availability and is very cost effective. Since these are not the focus of this article I will not get into them here, but they are very important. I would also like to note that I consider these good practices to be a prerequisite to using any of the more advanced features for high availability, as they add complexity and can DECREASE availability if you have not laid a good foundation. However, once you have implemented good practices the following items can add that incremental availability that many systems need.

In this article I will describe the 4 primary high availability schemes that we have for DB2/LUW. For simplicity I will assume two-server clusters except where noted, but most of the scenarios allow for more servers. The four options that I will discuss in order of least to most complex are:

1. DB2 HADR feature

2. Traditional Fail Over

3. Replication

4. pureScale Feature

1. DB2 HADR feature

With DB2 High Availability and Disaster Recovery feature you have at least two separate copies of the database -- one on each server. There is one active copy and there are one or more passive copies that are kept up to date automatically by DB2. There is an active DB2 instance on each server with one containing the primary copy of the database and the others containing a standby copy. The standby copy is perpetually in "roll forward pending mode” and as changes are committed on the production copy of the database they are automatically sent to the secondary copies. You can have a mutual take over situation if you have multiple databases in the cluster or you can have one server designated solely as a standby. The HADR feature automatically keeps the copies of the database synchronized, but you need clustering software like TSA to if you wish to automate the process of detecting failure and making the standby the primary database. DB2 provides this clustering software at no extra charge. Please see the HADR Best Practices guide for information on how the feature is implemented and operated.

On a single cluster of DB2 HADR servers, you can run multiple databases. You can have one primary server that runs all production work of databases with the other server acting only as a standby in case the first fails. Usually with this configuration, you pay a much reduced DB2 licensing fee for the secondary server. Another alternative is to run some production databases on one server and others on the other. In this alternative, each server has some primary databases and some secondary databases, but if a server fails all databases run in primary mode on the surviving server. With this alternative you would license DB2 fully on each server. The HADR feature is available with all editions at no extra charge except for Express-C.

An interesting feature of HADR is the Reads on Standby (ROS) feature. With this feature you can use the secondary copy (or copies) of the database to read data using queries as long as those queries use the Uncomitted Read isolation level. This allows you to off load reporting or ad hoc queries to the secondary server. When using this option you must fully license DB2 on the secondary server.

As of DB2 10.1, you can have up to three standby databases. This allows you to have a pair of servers in your primary data center for fail-over and another server in a remote data center for disaster recovery, and several other interesting configurations. One or more of these servers can also be on a time delay so that they are running some number of hours in the past. This is useful if you frequently have problems with massive updates done in error. In this case you can begin using the time delay database that has not yet seen these updates as the primary.

Advantages

· No single points of failure

· Extremely fast fail over - you just tell the secondary copy to takeover primary processing

· You can use either internal disk or external disk storage

· Allows for both high available and disaster recovery

· Allows for “rolling” fix pack upgrades – Since the instances can be a different maintenance levels for short periods of time, you can perform maintenance upgrades with an outage that lasts only as long as it takes to tell the secondary to become the primary. This does not apply to version upgrades like going from DB2 v10.1 to DB2 v10.5.

· Easy to implement and maintain. Least complex HA solution.

Disadvantages

· Consumes more disk spaces because you have two copies of the database.

2. Traditional Fail Over

This is the traditional fail over scenario where you have a shared disk system that can mount file systems on either of the servers in the cluster. Your instance and database file systems must be able to be mounted on either one of the servers. Since the instance files can only be on one server at a time, it can only run on one server at a time. This requires clustering software such as Tivoli System Automation (TSA which free with DB2) that can automatically detect and move the file systems and start the instance after a failure. You can have more than one instance in the cluster. Under normal circumstances you can either have all instances on one server with the other server as a passive backup, or you can have some instances on each server with the surviving server taking over all instances in the event of a problem. In the later scenario with two servers, you still need about 50% unused capacity in the cluster to allow full performance processing in the event that a server fails. To decrease the unused capacity you can have clusters with more than two servers and designate one server as a passive backup for the rest of the cluster. For example, you could create a 4-server cluster with three servers running at full capacity and one idle server. In this case you only need about 25% unused capacity. You just need to keep in mind that while the probability of two servers failing at the same time is low, but it is a possibility. In addition to TSA, other clustering software options exist including Veritas Cluster Server, HACMP and Microsoft Cluster Server. This not a DB2 feature, so you can run any DB2 edition under it. However, Express-C does not include the TSA clustering software.

Advantages

· Low cost

· Proven technology - This failover practice has been in use for years

· With minimal configuration, many of the clustering software offerings can detect database failures and perform the fail over.

Disadvantages

· Database is a single point of failure. Having the data on RAID helps, but not for logical corruption such as when someone deletes a file.

· Fail over time increases with the number of logical disk volumes.

· Must use external shared disk hardware

· Servers must be in relatively close physical proximity.

· UNIX/Linux UID and GID numbers must be identical across servers and other complexities exist.

3. Replication

This option allows you to have two active databases. For HA you typically only want to have updates occur on one database at a time for the least complex maintenance. This is often called a master/slave replication. The really good thing about replication is the ability to do reporting, ad hoc queries and backups from the secondary server without impacting your primary at all. There are three types of replication -- SQL Replication and Q-replication and InfoSphere Change Data Capture (CDC). Please see Sean Byrd’s Replication Options article for a comparison of these types of replication. Replication automatically keeps the copies of the database synchronized, but you need clustering software like TSA to if you wish to automate the process of detecting failure and making the standby the primary database. You can also use our replication products to replicate data to or from certain other non IBM databases. SQL Replication is available in all DB2 Editions at no extra cost. Q-Replication and CDC are available in Advanced Workgroup Server Edition (AWSE) and Advanced Enterprise Server Edition (AESE) at no extra charge.

You can read more about the traditional SQL Replication and the newer Q-Replication at these links:

SQL Replication Guide and Reference

Dynamic Informaiton with IBM InfoSphere Data Replication CDC

A Practical Guide to DB2 UDB Data Replication V8

WebSphere Information Integrator Q Replication

Advantages

· Eliminates all single points of failure

· Extremely fast fail over - you just tell the secondary copy to takeover primary processing

· You can use either internal disk or external disk storage

· Allows for Disaster Recovery

· Allows offloading of reporting tasks.

· Secondary copy can be at your disaster recovery site.

Disadvantages

· Consumes more disk spaces because you have two copies of the database.

· More Complex configuration and maintenance

· Must deal with update conflicts if updates are allowed on more than one database in the replication complex.

4. DB2 pureScale.

The DB2 pureScale feature allows you to run a single instance of DB2 across several servers against one shared copy of the database. It provides the most uptime because there is no fail over. In the event that a server fails, the remaining servers in the cluster continue processing transactions for their clients and any connections to the failed server just connect to a surviving server to continue working. In the event of a failure, one of the remaining servers performs crash recovery of the update transactions in-flight on the failed server. Only rows being updated by the failed server are unavailable to the other servers, and as soon as crash recovery completes those locked rows become available as well. To provide disaster recovery a copy the pureScale database can be maintained using HADR. The DB2 pureScale option also allows you to add capacity by adding additional servers into the cluster. The DB2 pureScale feature is included with DB2 Advanced Workgroup Server Edition (AWSE) and Advanced Enterprise Server Edition (AESE) at no additional charge.

The DB2 database process that is run on a particular server is called a member. All the members in the cluster constitute the DB2 instance. DB2 is licensed on the servers or VMs running the DB2 members. However, the servers or VMs running only the processes that run the DB2 cluster (Cluster Facility) do not need to be licensed. Further, as of DB2 10.5 fix pack 5 there is a new feature called the Business Application Continuity (BAC) offering. This is a licensing model for a 2-node cluster that is similar to DB2 HADR where the secondary server is not used for business processing and the licensing costs of the secondary is much lower. For AESE and AWSE customers, you can also run one node in the cluster as purely a standby and get the limited BAC licensing on that node a well.

Advantages

· Highly Scalable unless using BAC licensing.

· Zero down time upon node failure - rest of cluster just keeps processing as normal

· Allows for both high available and disaster recovery when using HADR too.

· Allows for “rolling” fix pack upgrades as of v10.5 fp5. One member at a time can be taken out of the cluster while the fix pack is applied and then be put back in.

· On a purely standby node that has the reduced licensing charge, you can off-load certain utility processing such as Backup, Runstats, Monitoring and others and still comply with the licensing terms.

Disadvantages

· More complex than HADR

· Requires specific server hardware (except BAC)

· Requires specific server interconnect switch hardware when running for scalability

Other

The original article discussed a feature called XKOTO GridScale. This technology is no longer offered and is not supported.

1. DB2 HADR feature

2. Traditional Fail Over

3. Replication

4. DB2 pureScale.

Other

Further Reading