My manager said I need to build a Data
Warehouse and I can't drive a forklift
A primer for the DB2 for z/OS DBA
Written By Jeff M. Sullivan
Data Warehouse on z Team
19 September 2008
Gasp! What do you mean I have to build a data warehouse? Before you panic in trying to make sense of all this, realize that you probably already know what makes up a data warehouse but are just lacking the definitions that are currently in fashion. Looking at business processes over the years there is one certainty - there is nothing new under the sun. At the end of this article, you should not only know the terminology but more important you will realize that you already have the knowledge, skills, and tools to manage a data warehouse in your z/OS environment.
Starting with the basics
Most zSeries shops have a data base management system (DBMS), possibly a few. You might have CICS or IMS transaction monitors for your on-line transaction processing (OLTP), perhaps a nightly batch processing time, and so on. If we put on the data warehouse glasses, these DBMS databases regardless of on-line, batch, or both are referred to as operational data stores (ODS) which means they are used for day-to-day processes. Do not let anyone confuse you because these are the very same databases that you already reorganize, image copy, and tune.
Notice that the data storage systems can be darn near anything - VSAM, IMS, DB2, flat files. The problem is having a standard storage model. Sure, you can have the tools for a heterogeneous DBMS environment for your warehouse, but because of the data-crunching required by data warehouse applications the warehouse itself may not have the speed and throughput necessary in order to meet your service-level agreements. Now, if you are running a zSeries machine for your ODS, a good solution would be to standardize on DB2 for z/OS. Why? It makes business sense since there is no need for retraining the underlying staff supporting the platform. And if you are running DB2 on z/OS, you already know about throughput compared to other platforms. It is like comparing an IV tube to a fire hose.
What’s all the hype on data warehousing?
As a DBA, you are probably asking yourself what the hype is all about. Easy! Remember all those 3GL languages like Easytrieve and DYL-280? Those were created to be able to quickly produce a report on-the-fly. There were 2 things which made these products work well for the time - they used some extract created by one to many files and they were usually historic in nature. Well, there is a third thing; those users could run their reports and leave YOU alone.
If you ask you friendly data warehouse evangelist the definition of a data warehouse, here is what you will get:
Data Warehouse (DW or sometimes WH) is a multi-subject oriented database populated from operational systems and/or Operational Data Stores (ODS’s). It is historical in nature (versus point-in-time) and typically contains detailed data. It is often looked upon as the single source of corporate “truth”. It is organized by subject, not by application, to support analysis. It is optimized for reporting and stored differently from transaction-oriented database (Star Schema, Snowflake or near third normal form). Finally, information is consistent across all subject areas, e.g. data like customer name and product description are consistent.
Yikes! Let's rewrite that definition again as if a DBA was explaining it:
A data warehouse is made up of the production data, both batch and on-line, supporting your customers. It can populate its' own set of tables, or it can be put into a materialized query table (MQT), or it can use a production table as long as performance is not a problem. We are DBAs, so we can't necessarily speak to the truth of the data but we do make sure it performs fast just like everything else. The criterion for back-up depends on the volatility of the data and the source. If we are using an MQT, a back-up may not be necessary unless the folks using the data have a timing issue (weekly data summary to the hour increment). And because the data is analyzed across subject areas, there are extracts that pull all this data into either MQTs or dedicated DW tables.
See? That was easy! One final definition concerns data extracts which not only pull in data from various tables but also may do some data massaging. This is referred to as data acquisition/data transformation. Data warehouse people will use an acronym to describe this process as ETL or Extract, Transform, and Load.
So the “hype” is the cool factor in all of this: it empowers your user community to not only get their own data, but they can start looking at data relationships. After all, it really is their data, not yours in I.T.
By the way, there is yet another really neat thing if you use Cognos as part of your solution. Remember the hours upon hours of time it would take you developing those entity relationship diagrams (ERD)? Cognos has a component called Framework Manager used to create the metadata model of the data used by two other components of Cognos called Report Server and Query Server. The metadata model created is nothing more than a very sophisticated ERD. That piece alone is worth the price of admission into the data warehousing world. We are no longer in the dark on the data we protect on our watch.
What is metadata? Remember being brought in on those data mapping exercises where some analyst gave you, the DBA, a piece of paper listing all the fields a new system was going to require? Then, being told it was your job to figure out how to populate those fields from the current system? What we’d have to do is talk to using departments, run extracts and SQL, and go through programs and the scant documentation to figure out how fields were used. That exercise created the information on the fields used by the current system. It is, for lack of a better definition, data about data.
If you have an ERD and can clearly explain every field with context of use as well as show the relationships to other data then you would have a metadata model. Why do you need it? At the heart of any business intelligence system is the requirement to correlate data. As a DBA, this may or may not be your problem. But having the system give you a relationship diagram is very nice indeed.
We talked about an Operational Data Store (ODS) above. DW people define it as a set of tables to support queries for point-in-time information for a specific business function such as payables, premiums, credit card transactions. It is highly de-normalized and is typically similar to a transaction record. It is up-to-date (vs. historical) and detailed (vs. summarized). Remember those old Easytrieve flat files?
A Data Mart (DM) is a database designed to support the analysis of a particular business subject area. Data Marts are typically, though not always, extracts from a central data warehouse. Data has usually been transformed and aggregated from the source DW or operational system. Data Marts can be relational, multidimensional (OLAP) or statistical in nature. See below on definition of OLAP. Think subset table(s) with data marts.
A Star Schema consists of one or many fact tables referencing any number of dimension tables. As a DBA, what does this sound like? Referentially defined tables should come to mind. If you haven’t guessed already, a fact table is data that are facts while a reference table is a special twist of the data. For example, a COMPANY_SALES with columns DATE, PRODUCT_ID, STORE_ID, and AMOUNT would be a good fact table, while a dimension table referencing this fact table would have columns PRODUCT_NAME, PRODUCT_BRAND, PRODUCT_MANUFACTURER. Can a dimension table be slightly de-normalized? Yes. Can it be a view? Yes. The point is that this is just a data warehouse concept but a DBA should know the meaning behind-the-scenes.
From a DBA standpoint, these above definitions are more of a question of usage. You would probably see an MQT or possibly a view set up for a data mart but would probably have dedicated tables for the operational data store. Again, it is just semantics.
By now you can probably see the need to have a separate database name for these data warehouse tables. You will want to keep them isolated for easy identification as well as performance tuning. And from an SQL standpoint, expect to see just about anything, so be prepared for some bad SQL. Perhaps turning on the SQL governor would be a good idea. Also, have the system programmer handling Workload Manager (WLM) carefully watch DDF and ssidDIST workload.
The need for tools
Use a real-time performance monitor such as IBM Tivoli Omegamon XE for DB2 Performance Expert on z/OS where you can drill into the SQL. If you have production DB2 data, you probably have a suite of tools you use for backup, recovery, and reorganization. You might also need a fast unload utility to create some of these warehouse tables.
So far we've talked about Data Acquisition (extracts/data massage) and Data Storage (DB2 using MQTs, temp tables, and dedicated DW tables). The final 2 pieces are Information Generation and Information Presentation. Typically, these include end user tools for query, reporting, analysis, and dash boarding.
Going back to the DBA perspective, there really is not much to do here other than getting the distributed mainframe connections working and tuning the occasional out-of-control query. This may take time for both of these because, typically, the data warehouse / business intelligence users work in the actual department requiring the data. These folks are not IT technicians and will require your help. From personal experience the DBA department is usually the dumping ground for unknown IT issues.
To recap on tools used in IBM-centric data warehouse environments, expect to see and use the following:
At the end of the day
Here are some more terms you will hear. As a DB2 DBA, you will want to know these terms because your end-user customers will be using these terms but try not to worry too much about them:
There is no great mystery in setting up a DB2 data warehouse. Understand that these warehouses are 2-tier and 3-tier architected products in most cases. They take data from about anywhere on your production system but store it centrally and if you are using z/OS, ideally on DB2 for z/OS. Before any data is stored in either dedicated tables or MQTs there may be some data manipulation and data massaging. The DB2 objects may be quite large in size so be prepared with a sizeable DASD pool along, enough real storage for your buffer pools, and the use of partitioning.
Finally, once they are in place, there might be changes to the tables as your end users start realizing the full potential of a data warehouse.
Just remember to breath!
Jeff Sullivan is a Software Specialist working with the Data Warehouse for System z Swat Team at IBM's Silicon Valley Lab. He has worked on IMS DB/DC and DB2 for most of his 28 years on systems ranging from health care to aerospace to banking and has been both a DB2 systems programmer and DB2 database administrator. His expertise is focusing on DB2 performance tuning for z Series, but also has assisted in DB2 installation, workload manager set-up, and DB2 education with knowledge transfer. He may be reached at firstname.lastname@example.org.