DataWorks
Dean Compher
20 April 2016
One of the great things about the cloud databases and other analytics tools provided on Bluemix is the ability for anyone to inexpensively and easily deploy a data base for analysis. You do not even have to know how to administer a database to do this. But once you have the database deployed, how do you get the data into that database in the format you want it? The answer is IBM DataWorks. It is a tool that makes it easy to shape the data from your on premises sources into your cloud data store or even Watson Analytics. Through an easy to use interface it allows you to join tables from your source, filter data, remove or rename columns, remove spaces, change case and several other things. It also shows you a lot of information about the quality your data. It is undergoing rapid development, so any feature not shown in this article may be there in the weeks and months ahead.
DataWorks is a Bluemix service that you deploy with a few clicks. Once deployed, it allows you to configure your source and target data stores and then define how you want to copy data from the source to the target and which transformation (shaping functions) you want to apply to the data as it is being copied such as filtering, removing columns joining, etc. When you create the set of refinements from a particular source to a particular target, you create an activity and give it a name when you save it. You can then run that activity over again anytime you like. You can also crate several different activities and save them for use under various circumstances. To protect the data in transit from your on-premises data sources, you can configure a secure gateway.
I start this discussion by showing you some of the more interesting things you can do with DataWorks, and then give an overview of how to begin using it. You can also see the documentation for DataWorks. If you go to this link and scroll down, you can see the list of sources and targets. This list is being increased rapidly, so I didn’t add the list here because it would get out of date so quickly.
Once you have configured connections to your target and source, entered the refine process, selected the source tables with which you want work you get to the screen shown here where you begin the process of refining your data. To make it easy to perform the function the first 1000 rows from each table are selected and shown as you proceed:
Here I am in the refine and copy panel. You can see information about the quality of the data. You can also see that I’ve selected three tables to work with called Order_Details, Order_Header and Product. Order_Details is selected and you can see the first few columns and some data in them. You can see this information for the other tables just by clicking on them. Along the left side of the table you can see a number of table level functions. Some of the most interesting refinements that you can do you get to by clicking on the column name that interests you.
Shaping Columns
Using the shaping functions you can do several things to the data including removing columns, running functions on the data, sort and filter. The functions you can choose is dependent on the data type of the particular column. As you can see here, I chose a character column and I get the “TEXT” choice that is not on non-character columns like integers and dates.
Function |
Description |
Properties |
Shows you information about the column like the data type, length, table name, name of the column in the source table and some quality information like cardinality and number of empty values. This dialog box also allows you to change the column name. That name will be used throughout the rest of your activity and will be used when creating the name in the target. |
Filter |
Allows you to work with rows that contain only the values you select in this dialog box. |
Sort |
Allows you to sort the rows in the table by this column |
Remove duplicates |
If selected will select only one row with this value when extracting data from the source. |
Text |
This dialog box allows you to trim spaces in a variety of ways and allows you to change the case of the data to all upper case, all lower case or “Title Case”. This will be applied when you run the activity. |
Remove column |
Columns removed with this function will not be put into the target data store, and will be removed from view while working with this activity. |
Table Refine Level Functions
As you can see in the figure below, the main table and activity level functions are listed to the left of the table of data. In this case I’ve selected the “Metrics” option which inserted a row above the first data row showing quality data for each column. For Column, Fix data and Metrics functions you will be working with one table at a time. So be careful to select the correct table along top of the menu to make sure you are working with the right one. In this instance I’m working with the ORDER_DETAILS, ORDER_HEADER and PRODUCT tables.
Columns
In the columns dialog box you can remove columns from view and reorder the columns. To rearrange the columns with in the selected table just drag individual columns to where you want them to be. You hide a column by unchecking it. It is only hidden from view while you are creating your activity. It will actually be copied when the activity is run. We discussed how to actually remove columns from the activity above.
Fix Data
When you select a table and click “Fix data” it will show you any data quality issues for individual columns. Types of data quality issues include a significant number of missing data fields or non-unique values when DataWorks thinks it should be unique. You can check any fixes you like and click the apply button. This will cause any columns with bad data to be removed from your view and from the target when the job is run.
Metrics
By choosing a table and then clicking Metrics, you will be shown information about the sample data in the first row of the table as shown above. You can hover over the information presented to be shown more details.
History
Click the history button at any time to see the actions that you have taken on the tables and columns that will be implemented when you run the refine process.
Join
One of the things that you can do once you selected the tables that you work with is to join them. This allows you to combine the columns of various related tables into one table in the target database. DataWorks will create the new table in the source if you want it to. This makes some analysis tasks much easier. There are a number of ways to join the tables including only using rows if the common join column(s) are in both tables or if they are in one or the other or both. In the join box, you can see the following join types and select one.
Getting Started
Now that I’ve shown some of the more interesting things that you can do with your data I’ll briefly describe how to get started using DataWorks. This is by no means a comprehensive, step-by-step set of instructions. I’m just providing a few hints about what steps to take to get started. I found it was pretty easy to figure these things out, so I recommend that you just get in and clicking around. Generally the steps you take to get started are:
You can also call DataWorks using the API.
***
There are several other things that you can do that I did not cover in this short summary. If you find any features that you particularly like please post them to my Facebook Page or db2Dean and Friends Community.