CPD First Steps
31 July 2020
In my previous article I discussed how IBM Cloud Pak for Data makes it easy to find data and see all sorts of interesting things about that data like data types, quality measures, associated business terms, data profiles, and data classifications. How did those attributes get linked and defined? It would be a lot of work to do that manually on more than a trivial amount of data. The really great news is that IBM Cloud Pak for Data does much of that work for you with advanced algorithms to make those connections and measurements and add all or part of that data into your catalog. In this article, I’ll describe how this gets done.
This article assumes that you have installed IBM Cloud Pak for Data (CPD) and deployed the needed features like Watson Knowledge Catalog. For details on installing CPD, please see the Planning and Installing sections of the CPD Knowledge Center. To see how to do many of the items I describe in this article, watch Halle Burns’ YouTube First Steps video.
The first thing that you need to do with your new IBM Cloud Pak for Data (CPD) is to add users either locally in CPD or using LDAP, and assign roles to them that allow them to connect to remote data sources and files, and run the processes that evaluate the data. There are several predefined roles with various authorities and you can modify those roles and create new roles. See the Predefined Roles and Permissions page for descriptions. Your administrators and data engineers will likely be among the first users assigned. The predefined roles are:
· Business Analyst
· Data Engineer
· Data Quality Analyst
· Data Scientist
· Data Steward
Next, users with Data Engineer or Data Steward roles will create good business terms, policies and rules for at least the top 50 to 150 critical data elements. They can be imported from certain other dictionaries or entered manually using CPD input screens. Either way, you should define these with good descriptive names, short and long descriptions and abbreviations. These will be important later on as all four of these attributes are used when CPD matches terms with columns in sources you connect or files you upload.
Review the built-in data classes, disable the ones you don’t need, create new classes and link them to the business terms. Classes include things like indicator, code, birth date, city, driver’s license, etc. This helps CPD determine the right business terms and classifications of data that you connect. See the data classes instructions page.
You can create data quality rules for certain types of data elements that get applied to columns in data sources that are related to that term. For example, let’s say you have a standard format for US Social Security number. It may be something like 000000000 or “000-00-0000”. You define this rule and then when CPD scores the quality of a data source, it can tell you the percentage of values in the column that meet your rule. This way you can determine how well sources conform to your standards.
Review the built-in data quality dimensions, disable the ones you don’t want and create new ones that are important to your organization. There are 10 dimensions that apply to any column. Examples include missing values, duplicate values and data type violations. New quality dimensions are created as new data quality rules as described above.
While you are doing the above items, you can start connecting your databases and other data sources to CPD and begin uploading files that you wish to use in your analysis. However, you will want to wait until you have entered the items above before running the built-in automated discovery process on your sources. Data sources include:
· Databases (relational and NoSQL)
· IBM S3 and Amazon S3 storage
· Google Cloud Storage
· File servers with CSV and Excel files on which you are allowed to install the Data Virtualization connector
You will need someone with credentials to these sources to configure the connections. For databases, the user id used to make the connection needs to have access to the database’s catalog that contains the schema of the database. The catalog has things like column names and data types that you will ultimately want copied to the catalog of data assets in CPD.
When adding a connection, you can add it as shared or private. Shared means that any user accessing actual data in the source, for example when previewing data, will be shown the data with the connection being made using the credentials entered by the user who created the connection. Private means that anyone accessing the data will be prompted to enter their own user id and password that is defined in the data source. On the surface it would seem that making connections private may be the best way to go, but that would mean that all users who may need to use the private sources in any capacity would need a user id in those sources. The good news is that CPD allows governance rules that control access to those shared connections with governance policies and rules that you define. Therefore, within CPD you can create shared connections, but allow or disallow access to individual sources and possibly masking data in sensitive columns by user. That will be a topic for another article though.
You can add a new data source in the Collect feature of CPD. This way of adding a new source only allows you to create a private connection. Alternatively, you can add a source to CPD from the Catalog that allows you to select whether you want a shared or private connection. From the main menu choose All Catalogs, then select desired catalog if there is more than one, then click Add to Catalog link and select Connection from the menu. Then select the desired service and enter the credentials where prompted and choose the shared or private radio button.
Once you have entered your governance assets like terms, data classes and quality rules and at least one data source, you are ready to discover assets. This is where the magic happens. You can run a Quick scan or Automated discovery. The quick scan is good for fast initial analysis of many tables from unfamiliar data sources and evaluate whether it is worth running an automated discovery. With automated discovery, database names, table names, column names, data types and other metadata gets copied to the catalog. Further, sophisticated analysis is done on the table and column names as well data within them to assign the business terms and data classes as well as determining quality scores, frequency distributions and more. You can then review the results of the analysis and then choose whether to publish that information to the catalog or not. This is where you can verify that your quality rules and class information is what you want, and if not tweak them and rerun discovery. When everything looks good, you can publish it to the catalog where all this information can be used to search and understand the data that anyone finds.
You run the automated discovery process as shown in the Running automated discovery instructions. You can run the automated discovery process on the entire database, on just the tables with a particular schema name or on just individual tables. For efficiency’s sake, it is best to run discovery on just the schemas or tables you need. Once you have performed the steps in this article, analysts and data scientists in your organization are ready to easily find and use the data they need as described in my Finding and Using Data article.
Over time you will add more data sources and when you do you should run the discovery process on them to build up your catalog of metadata as well as link the fields in those sources to your business terms. This way, your ability to easily use varied data sources in your analysis continues to get better and better. If you would like to see how to actually perform many of the steps I describe in this article then you can watch Halle Burns’ YouTube First Steps video.
These are just the basic steps of configuring Cloud Pak for Data to allow you to organize your data. If there are other steps you think others would benefit from knowing, please tell us about them on my db2Dean Facebook Page and share your thoughts about them.