DB2 LUW WLM Work Load Manager
Mark Mulligan
27 June 2011
Our company had a large corporate data warehouse database used by power users to provide information to senior executives and high level regional executives. We also had a smaller data mart database with 19 months of information available for all other associates.
After beta testing DB2 LUW’s new workload management features that came with V9.5 we determined that we could combine the two databases into one and allow all users to share one information source. We planned to use workload manager to help manage performance. We thought that we may also be able to use the threshold limits to help manage access and avoid problem SQL but were not certain how to set those at first.
We determined that a work load at our company was a unit of work run by a particular user id. We determined that we had different types of users and that we wanted to manage workloads by these user groups. We had power users who provided information to senior and high level regional executives. We had regular users who make up the rest of the associates who have access to our data warehouse information. We also have operational users who monitor and maintain our production environments. Finally we have application job user id’s that are used by our production extract, translate, load (ETL) jobs.
We decided to build our workload management structure around our user groups. We created a USERS super class with subclasses for the POWER, REGULAR, NEW, OPERATIONS and APPLICATIONS users. We created workloads with names that matched our subclasses for simplicity and added the user id’s to each workload as system users.
Our user id’s have intelligence built into the names so we know which id’s are application job id’s and which are user id’s. We put all of our application job id’s in the APPLICATIONS workload. Since we knew which users were power users and which worked in operations we put those id’s in the POWER and OPERATIONS workloads. Finally we put the rest of the user id’s in the NEW workload. We monitored SQL run by NEW users and when we determined they were running well tuned SQL on a consistent basis over time then we moved them into the REGULAR workload.
We developed a script that adds new user id’s that are not already in a workload to either the APPLICATIONS or NEW workloads and run this on a periodic basis. We manually override this initial user id setup when we know a user belongs in the REGULAR or POWER user workloads.
We change the CPU, memory and I/O priorities for the various subclasses to give the POWER, REGULAR and NEW subclasses higher priority during the day. We give the APPLICATIONS subclass priority during the batch cycle at night.
We established thresholds for the POWER, REGULAR and NEW users to prevent them from running SQL with more than 20 million timerons. Initially we set the threshold limit as a soft limit that would allow the SQL to continue to run. We determined that a 20 million timeron limit would be best for our database environments because our users who use federated nick names to join tables from the local data warehouse to other databases have SQL that use between 10 and 20 million timerons. Most everything above this limit are mistakes that often result in Cartesian products. This estimated SQL cost threshold helps avoid wasting CPU, memory and I/O resources to try to process SQL like this.
We established thresholds to limit the number of rows returned by different groups of users. Our POWER users are allowed to return up to 50 million rows and our REGULAR and NEW users are limited to returning 20 million rows. We established these threshold limits because we determined that we had users that would at times attempt to return billions of rows. In the case of POWER users this was usually a mistake that was made while working under pressure from senior executives to come up with something new fast. In the case of NEW users this was usually due to a lack of experience or knowledge about what information is available in our data warehouse databases and how much is available.
We also established thresholds to limit a workload to being active or idle for our users for no more than 24 hours. We determined that users who run SQL for more than 24 hours usually need technical assistance with what they are trying to accomplish. We also determined that many users do not terminate their connections and having a threshold to clean these up helps free up resources for other work.
We determined that long running SQL often has a negative impact on the performance of shorter running SQL and so we created thresholds to remap these from the POWER, REGULAR and NEW subclasses to a new set of subclasses we created and named POWER_LONG_RUNNING, REGULAR_LONG_RUNNING and NEW_LONG_RUNNING. We set the memory and I/O priorities on these new subclasses to medium to reduce the negative impact that long running SQL was having on our memory and to reduce the I/O contention that the long running SQL was having on short running SQL.
Finally, we determined that SQL that users who were running SQL that read more than 1 billion rows per partition were causing problems for our data warehouse environment by flushing bufferpools and filling up our temporary tablespaces with large result sets. We created a soft threshold limit for rows read so that the SQL would continue to run. We monitor this activity and have ourselves notified when this is happening. We then communicate and work with business users to develop more efficient SQL or to break up the SQL into multiple SQL statements with smaller units of work. For example, we determined that running 5 SQL statements to get information for the last 5 years shared resources better than 1 SQL statement that selected 5 years of information.
We developed a script that monitors the number of connections per user id and forces off connections that exceed a certain limit. This helped us prevent some users from using too many of the database server resources at the expense of everyone else that shares this information source.
We developed another script that identifies agent id’s that are currently executing and then use db2pd to get the application information, including SQL, for each agent id. We load this information into an operations SQL table hourly and analyze the SQL that users run and communicate with them if we have recommendations that will improve the performance of what they are running.
We created one DMS LARGE partitioned tablespace where users are allowed to create their own user tables to support their analytical reporting and decision support business requirements. We developed a script to identify when users attempt to run INSERT or DELETE SQL against their user tables and this notifies us so that we can contact them and educate them how to use the load utility to load rows into tables or to delete all rows from a table by loading an empty file into a table. This script identifies agent id’s that are currently executing from WLM and then checks the SQL being executed via the db2pd utility and then notifies about what is happening. We contact the users and teach them how to use the load utility in our data warehouse instead of INSERT transactions.
We recently had to meet some federal and industry security requirements to be able to restrict access from users to information in databases without having to have their id or anything else about the user changed. We were able to accomplish this by creating a subclass and workload named RESTRICTED. The workload uses the disallow database access option and this prevents a user from connecting to the database if we put their user id in this workload. Our company could keep the user id in the operating system and have all of the other security information and history available for security auditing purposes and know that the information in our databases was secure from being accessed by this particular user.
We had some performance issues when we turned on too many of the activity information gathering options for WLM objects. We turned most everything off except for some of the activity tracking and the threshold violations and only use those event monitors to help us manage our workloads. Details about all of this are available in the Power Point Presentation that accompanies this document. You can download that presentation that includes WLM scripts from the “Files” section of the db2Dean and Friends Community page.
We use DB2 LUW WLM (Workload Manager) daily to help us manage and monitor the SQL that our user and application teams run against our data warehouse database environments. We manually override the CPU, memory and I/O priorities to help particular groups get their work done faster when this is needed. We would not have been able to move all of our users to be able to share one information resource without DB2 LUW workload manager. This feature of DB2 LUW has helped us manage the performance of our workloads while our data warehouse has grown from 15 terabytes to 50+ terabytes.
***
I hope that you found this discussion of the DB2/LUW Workload Manager useful. If you found this useful please “Like” the post about this article facebook.com/db2dean and leave any feedback that you think would be useful to Mark or others who may be considering the use of WLM.