Managing your Youngest and Oldest Data

Dean Compher

27 December 2011

 

 

The number and sizes of everyone’s databases seem to keep growing, but there is no additional help to keep up with the additional work of making them performing well.  You are expected to keep several test/QA environments updated with new data, but you can’t copy any sensitive information and those sensitive fields have to look like real data.  Your production databases are overflowing with old data, but you are not allowed to delete anything because someone might need that 8 year old record.  You are caught in the middle because the existing hardware will not perform well with that much data and no one will allow you to purchase any more hardware, but you are still expected to meet those strict service level agreements.  If that is not bad enough you’ve got systems still around that are not used, but no one really knows what is in there so you have to keep that old software running with the associated databases because you might have to produce some report from it.  The good news is that IBM provides the integrated InfoSphere Optim Information Lifecycle (ILM) set of tools to manage these situations for you whether you are using our databases or our competitors’. 

 

They all use the same shared code base and have the same common interfaces for doing similar actions.  They all operate on many database platforms including Oracle, Informix, SQL Server, Sybase, DB2 and several others, plus it works for databases on most Linux, UNIX, Windows, System Z, System I and even VSE operating systems. 

 

One of the key shared functions is the implementation of the concept of the “Complete Business Object” where Optim understands the relationships between tables in individual databases and across databases.  Because the Optim ILM tools understand these relationships between tables it can intelligently copy data out of a database or set of related databases.  This means that with the complete business object you can define criteria that select rows in one table and Optim ILM tools will know to also pull rows in all related tables where they are directly related or indirectly related through many generations from the start table.  This goes on until all generations above and below the designated tables are pulled.   For example, this allows you to easily create an archive request for orders that are older than 2 years, and Optim can automatically archive all related rows to those orders including address and order detail rows and rows related to those rows.  In another example, I might want to create a test database for customers from a certain region because of an application issue with that region that I want to test.  Again, I just need to define the criteria and the Optim ILM tool will know how to copy all of the related rows to the target databases. 

 

The Optim ILM tools provide a number of ways to define the complete business object.  It will import the relationships defined in databases and certain data modeling tools, and you can enter them manually.  There is even another tool that will discover those relationships based on the data in the tables called InfoSphere Discovery.  

 

The Optim ILM tools have an internal scheduler that can schedule your processes to run automatically and they also provide a set of commands that allow you to run them from a script that is called by your favorite job scheduler like “cron”.  When you create extract criteria, you can also specify that variables are to be used so that you can change the criteria each time you call the job.  For example this week when I run my job to refresh my test databases, I pass in a date of “15 June 2011”, but when I ran it last week I passed in a date of “8 June 2011” to get orders entered since that date. 

 

I will now provide a short summary of each of the ILM tools. 

 

 

InfoSphere Opim Test Data Management with Data Privacy (TDM/DP)

 

A major problem in many organizations is having good up-to-date test data in their test/dev/QA environments that is stripped of sensitive data like credit card number or name while having that data being replaced with data good for testing.  TDM/DP takes care of these issues and others as well.  One of the other big issues that TDM/DP solves is that of the non-production environments in total being much bigger than production because there just several full copies of production in those environments.  Because TDM/DP uses the complete business object it is easy to get just the subset of data you want you want by specifying the criteria.  You can even choose to get every Nth row of the start table along with all of its related data.  This even works across several databases.  While it is copying the data, TDM/DP is also obfuscating (masking) the sensitive fields.  There are several functions built into the product to mask common fields like first name, last name, full name, address, e-mail, social security number, credit card number and many others.  As part of the test data management definition that you create, you pick which functions you want to have run on the various fields to mask them.  If one of the dozens of functions that are included in the tool do not meet your needs, then you can write your own and call it from the tool.  These functions mask data in context.  So for example, a Visa credit card will still be a Visa with the appropriate check digit after masking, but the number will be different than the number in the source database.  For strings like names and address we typically use a lookup function.  For these functions we provide thousands of good names, valid postal address, and other strings that the lookup functions can use to get a good replacement value.  So for example, Dean might be replaced by Larry, but it would not be replaced by “cXkp%xyt”, that would not make any sense to your testers. 

 

Once you have your various definitions, you can automate the refreshes of your non-production systems either using the scheduler in TDM/DP or by calling it in your scripts.  That this way, you can have several, non-production environments of different sizes and based on different extract criteria that have sensitive data removed, and are refreshed frequently.  This is just a high level summary of what TDM/DP does.  It has many more very useful features for creating and managing your test databases.

 

 

InfoSphere Optim Data Growth Solution

 

One common problem of production databases these days is that they have much more data in them than is needed by the primary applications using them.  This is because a lot of historical data is kept in the main production databases because someone needs that data on some infrequent basis.  Often these databases have more historical than current data and this causes performance problems.  The answer to this problem is to archive the data with the Optim Data Growth Solution.  This archive tool will copy the data from the database into Optim extract files and intelligently delete the data from the database.  The good news about this is that Optim will manage these files for you and allow access to them using SQL through an ODBC or JDBC interface.  Most organizations use their favorite report writer like Cognos or Crystal Reports to access that data as needed.  Optim also copies the table definitions so it knows the table names, column names and data types making the job of creating reports as easy as it is when you used a regular database.  This also allows you to easily copy the data back into the existing database a different database by just pointing Optim at that database and telling it to load the data. 

 

The Optim archiving puts this data into compressed, immutable (can’t be changed) files that are stored on what ever tier of storage is appropriate.  This saves money, because you can often save DBMS license costs, the data is compressed, and the data is on cheaper storage than your database files.  Optim can also manage these files through their lifecycle, moving them to progressively lower tiers of storage and finally to tape automatically keeping track of the files for you.  Because the files are immutable, the auditors will love this way of archiving data.  You can also index the data to make queries faster. 

 

To make the archiving easier, Optim uses the same complete business object that the other Optim products do.  Once you have that complete business object defined, you can create the criteria such as removing all data that is older than say, 2 years.  So using our order example, you can easily tell Optim to archive all orders older than 2 years and to get the order details as well because the complete business object (CBO) defines that relationship.  Because of the CBO, Optim will also know that the customer row for the order is also related, but you can say that you only want the customer copied to the archive, but not deleted from the database.  That way the customer is still in your database for when they place their next order, but when someone queries the order that has been archived to Optim, they can easily see the related customer data as well. 

 

Optim archiving can also be scheduled every week, month, quarter or other desired interval to archive any more data that has aged into your archive window.  This can be done using the scheduler in Optim or by running the archive from a script.  You do need a component called “Optim Connect” (formerly known as ODM) to allow query access across multiple files.

 

 

InfoSphere Optim Decommissioning Solution

 

I know many companies who keep old systems around that really are not doing their original function, but are just kept on life support just in case anyone needs to query some data in that system.  Some of these organizations are keeping their fingers crossed that these old systems don’t fail, because the “ancient ones” who knew how to fix these systems have long ago travelled to other galaxies.  The resolution to this quandary is the InfoSphere Optim Decommissioning Solution.  This product is much less expensive than keeping old software and their related database licenses around and still provides easy access to the data.  It is just a special license model of the InfoSphere Optim Data Growth Solution described above.  The main difference is that instead of archiving data as it ages like with archiving, the decommissioning (application retirement) solution archives everything to a set of files over a short time, but allows the same SQL access through ODBC/JDBC.  Again most people will use their favorite report writer to access the data. 

 

 

InfoSphere Discovery

 

At this point are you saying to yourself that this all sounds great, but no one around here knows the relationships between our tables and there are no foreign keys to define this in the database.  We can solve that problem with the InfoSphere Discovery tool.  You just point it at a databases or a group of databases and turn it on.  It will compare the data in every column in each table with every other column in other tables and suggest the relationships that likely exist.  For example, every order table may have a column called “ID” that contains the customer numbers that relate to the “CUST_NO” column in the customer table.  By comparing the data in customer and order rows it may find statistics like

 

-          99.3% of the values in the ID column of the order table have a matching value in the CUST_NO column of the customer table

-          91.0% of the values in the CUST_NO column of the customer table have a corresponding value in the ID column of the order table.

 

From this Discovery would deduce that you likely have a parent-child relationship between the customer and order tables and show that in the E/R diagram that it produces after running.  Another way to look at this is that you may have customers where most of them (91.0%) have orders with several not having orders yet.  Also all orders should have a corresponding customer, but since there are no foreign keys to enforce that some garbage rows have crept in over time so only 99.3% of orders have a corresponding customer.  Discovery is built to find these relationships, show them to you and let you remove the relationships that are just a coincidence.  This can save an enormous amount of time over manually looking for these things.  Discovery can also find complex relationships such as multi-part keys, substrings of columns that contain a foreign key and even patterns of columns that indicate a relationship.  Discovery can also find sensitive data based on patterns in the data such as e-mail address, credit cards and the like based on the patterns of the data. 

 

Once Discovery finds the relationships, they can easily be imported in to any of the tools above, giving you that complete business object.  I may be going out on a limb, but in my opinion, you can’t do any of the test data management or archiving functions well without having the relationships whether you use Optim tools or not. 

 

 

***

 

These tools are very useful and save many organizations a lot of time and money.  Should yours be the next to start using them?  Please take a moment to add your thoughts to my db2Dean Facebook Page or to the “Message Board” section of my db2Dean Community page. 

 

 

HOME | Search