Finding & Understanding Your Data
Dean Compher & Halle Burns
29 May 2020
It has been said that when doing data analysis including the use of Artificial Intelligence and Machine Learning to develop predictions, having a lot of good data is more important than having more sophisticated algorithms. Your organization may have lots of data, but if you can’t find it or understand what you’ve found then all that data is of little use. To solve this problem IBM has introduced Cloud Pak for Data which is a platform for collecting access to all of your data, cataloging the metadata, organizing it and then using it in analysis as I described in my previous overview article. In this article I’m going to focus on how you can use the platform to both find data and understand what you find. With Cloud Pak for Data you can see categories and classifications of data assets you find, associated business terms, profile of the data in the columns, quality of the data and other useful information so that you can decide if what you found works for your purposes.
In this article, I will highlight a number of the useful features of Cloud Pak for Data so that you can get an idea of the sorts of things that can be done to find and evaluate your data assets. Please see Halle Burns’ Finding Data with Cloud Pak for Data video to see how you can use many these features and others that I only mention briefly.
An essential aspect of being able to find your data is having connections to it in Cloud Pak for Data (CPD). CPD generally uses JDBC to connect to remote data sources. Files in a variety of cloud storage facilities can also be connected including Amazon S3 and IBM Cloud Object Storage. CSV and Excel files on servers in your organization can be connected to CPD using an agent you install there and users can upload their own files to be stored in CPD. As these various types of connections and files are added, users should allow CPD to craw them to capture the meta data like file name, table name, column name, data type, etc., as well as to evaluate quality and assign terms and categories. By doing this, your organization builds a catalog of your data assets that can be searched and appraised. How this is done will be the subject of a later article, but you can see more in Halle’s video right now. You should note that the administrators of the system can control access to the data and metadata by user or role so that individual users will not be able to see data from which they are restricted.
The search capability is such a fundamental aspect of CPD that there is a search bar on the top of nearly every screen in CPD where you can always easily search the catalog of assets. For example, if I need to do some analysis on hospitals, I can search on “hospital” and see what is out there for it. In example 1, you can see that I searched on “hospital” and the results included the “Hospital” business terms and Data Assets including the HOSPITAL_INFO table and the HOSPITAL_READMISSION.csv file.
Example 1. Search for data
You can click on any of the items found and drill into them more. There are many things you can see about the items including related items. I show a few examples below to illustrate this, but there are many more things that you could do than I can explain here without being even more boring than my articles typically are. However, you can try them out with a free demo system that you can explore on Cloud Pak for Data Experiences.
You could click on a table or file and see the information about it including which terms it may be associated with. If you clicked on the Hospital term, you would see the description for it, any categories to which it belongs and other information about the term itself. The “Term” page will also have a link called Related Content. Clicking this link shows all of the items related to the term as you can see in example 2.
Example 2. Assets related to the Hospital term
Why is this interesting? Because if your organization has lots of different data, you can search on things that are of interest and find what has already been cataloged in the system, see relationships between your terms and your files, tables and columns, see how they are categorized and even see what ratings and reviews that other users of the system gave it. You can not only find the data sets that are of interest, you can learn more about them and possibly see ones that you were not aware of previously. You can even see statistics about the data set like numbers of columns and rows and quality of the data as shown in example 3 for the HOSPITAL_INFO table.
Example 3. Quality information of HOSPITAL_INFO
In addition to categorizing data by their terms, you can also classify data by how it is used or needs to be governed. For example, you may want to make a classification of “Personally Identifiable Information” so that you can easily find that sort of data in your system. You can also build rules around that classification to tell CPD how to find it and how to control access based on it. In example 4 I show the related content for the PII classification.
Example 4. Personally Identifiable Information related content
To this point I’ve discussed finding data using the search capabilities which will show anything related to your search term including terms, categories, tables, files, etc. You can then drill into what you find to show what is related to it and get information about those assets too. This is probably the most common way of using your data. However, you can start at the top level of your catalog hierarchy and then drill down into the various layers. Let’s say you believe that a particular database has what you want and you just want to check that it really your tables. In this case you don’t need to do a general search. Instead you could start on the hierarchies page. Example 5 shows the top-level hierarchies. You can click on any that interest you. In this example, you could click on Implemented Data Resources and which would show you the connected sources, where you could drill into the databases under them and the schemas under them and the tables under them and so on. At any level in the hierarchy, you can look at the information about any assets you want.
Example 5. Data Asset Hierarchies
As you find data assets that you want to use in your analysis, you can add them to a “project”. A project is a way to create a view of just the assets you are using and developing for a particular task. A project references to the data assets and analytics applications you may build using the project data assets. So, as you find things that are of interest in your searches, you will want to add them to your project. You can have multiple projects and you can add other CPD users as collaborators in the project. CPD has a robust set of analytics tools included, like modern dashboarding capabilities, Python, R, Jupyter Notebooks, Machine Learning training and deployment. Once you have found the data assets you want, and evaluated them with the provided tools, those data assets are at your fingertips. You just start building the analytics application in the project where you click on the data assets to use them in the application. For example, if you are writing Python in a Jupyter notebook, you can just click on the table you want and a cell is populated with the Python code to query that table. Example 6 shows a project with a few data assets plus a Jupyter Notebook and an Analytics Dashboard.
Example 6. Project
In addition, you can make that data available to external analytics applications through the Data Virtualization capability CPD. With virtualization you can make many of your tables and some files from other sources available that all seem to be in one database. These disparate objects can even be joined by creating a virtual view in CPD or just joining the data assets in a query in your external application.
I’ve discussed many things you can do with the data. If you have a small group of analysts using the system then you may allow everyone to do all functions and see all data, but other organizations may prefer to have a separation of duties. In this case one person may be able to search the catalog to see what tables are available but need to request access to be allowed to use that data. CPD supports workflow where one person can request access and another fulfills that request. We will discuss this more in a future article.
There are many more useful facets of Cloud Pak for Data when it comes to finding and understanding your data. If there are any you especially like, please tell us about them on my db2Dean Facebook Page and share your thoughts about them.