Watson Data Platform
30 January 2018
Like most cloud platforms the IBM Cloud (formerly IBM Bluemix) has many, many services available to do all sorts of things including data stores, infrastructure, development environments, etc. However, most professionals will typically need to use specific subset of these capabilities and can best use them if they are easy to snap together. So the IBM analytics team has introduced a new platform called the Watson Data Platform to help professionals like business analysts, report developers and data scientists by providing an integrated set of tools for cataloging, finding, refining, and analyzing data. Most of these tools are already on the IBM Cloud, but the Watson Data Platform console makes it even easier to use them together with better integration and flow from one step to the next.
At the core of the Watson Data Platform (WDP) is the ability for an organization to catalog a list of data sources including cloud, on-premises, relational, non-relational, Hadoop sources, cloud object stores, and files in one place, describe them and tag them so that future projects can easily find and understand those sources. As projects generate new files, tables and other data assets they are automatically added to the organization’s catalog where they can be found and used by future projects. Connectivity to the databases and other data stores is maintained by the catalog. Further, we provide ways to classify data by sensitivity level in the catalog and to restrict access to data assets as desired. Another base feature of the WDP is a cloud object store with significant storage. This is used to hold the information in the catalog as well as files that you upload to WDP through the catalog or generate in projects.
You can start your use of the WDP by cataloging and describing many of your data sources or you can let each new project start by cataloging the new sources it needs, building the catalog as you go. In either case as time goes on you will find that you are adding more and more data sources to your catalog. As sources area added they can also be described and tagged to allow others to easily find them in searches and understand them.
While it is wonderful to have a place to search your data sources and connect to them just by clicking on them, it is even better to have analytics tools right at your fingertips to actually do the analysis of that data. So with WDP you get the Data Science Experience, Watson Machine Learning and streams processing based on IBM Streams or open source streaming services right on the IBM Cloud. With Data Science Experience (DSX) you get Jupyter Notebooks R Studio. Jupyter Notebooks allow you to hold your notes, various blocks of Python and Scala code, and visualizations of data in one place. You can also execute those blocks of code in Spark and easily access the data sources in the catalog. Watson Machine Learning provides the ability to crate machine learning models, score them, and track their accuracy over time to know when it is time to retrain them. The streaming service allows you to ingest any sort of streaming data and process it, possibly dropping it into a data source in your catalog while flagging anomies.
Once you get your machine learning model or analytics function created you want to make it easy for applications to use them. Therefore, WDP makes it easy to deploy your model as an API on the IBM cloud so applications have easy access to them. Further, anything you create in your project in WDP is available outside of it as well. So, any Notebooks you create can be shared outside of WDP. Any data assets you create are not only available in the Catalog of WDP but available for other applications as well including Watson Analytics, Cognos Analytcs, other BI tools and even applications on your premises.
If you deploy the SPSS service on the IBM Cloud your service will automatically appear on your WDP console. From there you can easily build your predictive models using the data sources in your catalog and integrating with the other services in WDP.
The WDP console allows you to create groupings the analytics tools and sources from the catalog into an entity called a project. The creator of the project is automatically the owner and the owner can invite others to join the project to work collaboratively. While the catalog is available to all projects, the analytics tools will be executed only from projects. The screen shot below shows the Assets tab of a project I created and added some data assets from my catalog, but did nothing else. If I were to create or import notebooks, create some Streams flows or models they would appear under the appropriate asset listing in the project.
At a minimum, a new project allocates access to the cloud object storage area for storage of project information, models, notebooks and other objects and descriptions. As those items are created they and their meta data are stored in the object store.
Once you’ve found some data either already in your catalog or from someplace else and you’ve placed connection to it into your catalog, you want to understand what is in it. WDP provides useful tools to understand your data assets. It allows you to see the file or table in a tabular format, showing the datatype as well. So you can browse through some of the data in a file just to get a handle on what is in it. WDP also profiles each column in the file or table so you can get a sense of the quality of the data as well as getting a summary of the data contained in each column including the most frequent values, number of missing values, number of unique values and so on. Additionally, it allows you to visualize data. You can pick one or more columns and WDP will show you data in a chart, map or graph allowing you to change the type of visualization. For example, in a data source you might select sales and state, and WDP may show you a map with states color coded by the sum of sales in each state.
Your data assets are not likely to be exactly as you want them or where you want them, so WDP allows you to copy a data asset and at the same time refine and cleanse the data as you copy it. A few of the refinements available include:
- Removing columns
- Filtering on particular values
- Combining or spitting columns
- Summing, Counting grouping and other functions
To illustrate some of the features of WDP, I’ll now provide a simple scenario for using it. Let’s say that your team has been given the task of gathering data from 3 on premises databases, 2 files and a Cloudant JSON data store and a PostgresSQL database in the cloud for analysis. Further, you need to put some of the data into a new Db2Warehouse on Cloud to allow fast and easy business analytics in Cognos Analytics and create a predictive model to use in an application that is being developed. So here are the steps you can take.
1. Deploy a new Db2 Warehouse on Cloud (Db2WoC) system in your IBM Cloud system.
2. Log into your WDP console and create a new project called MyProject. Add everyone to the project who will be working on it.
3. Next use your WDP console and search the catalog to see if anyone has already created connections to any of the data sources. To your delight, you see that there are connections to two of the on premises databases, the Oracle Database and the Informix database, and the cloud based PostgreSQL database and that someone as entered good descriptions of them so you can verify that these are indeed the databases you need.
4. Add those existing databases to MyProject
5. Create new connections to your new DB2WoC warehouse, the existing DB2 database that is on premises and the Cloudant Jason store. You are good team player so when you add them, you also enter a detailed description and useful tags relating to the content of the databases so that the next team can easily find and understand them.
6. You know that your team is not as familiar with the individual tables in these databases as you are, so you create entries in the catalog for all relevant tables with tags and good descriptions as well that were not already there. These catalog entries are called “Connect Data” entries.
7. Using the catalog load facility, you upload the two files that are needed for your project. Since WDP comes with a significant amount of cloud object storage those files are just loaded into it. As with the new connections and tables you added, you describe and tag these files as well.
8. Finally, from the catalog, you add all of the new and existing data assets to your project.
9. Next each person on the team can click on MyProject to open the project. Everyone will notice that all of the data assets you added are under the Data Assets section.
10. To verify if the data in your various sources have the information you need and are of the quality you must have, you and your team members start the Refine process on these assets, viewing the data in a tabular format, profiling the data and visualizing the data in charts and graphs.
11. The team members realize that few of the tables, documents or files are in the format that you need for your project, so you create refinement data flows to shape the data as you desire, adding/deleting columns, changing data types, summing data, removing rows with bad data and so on. Each refinement ends with a new table being placed into the Db2WoC database you deployed for this project. WDP creates the table to hold the data and then copies and refines the data from the source. You don’t need to know anything about Db2’s data definition language. Each of these new tables in now available in the catalog for any project to use, assuming the members of the other projects have the appropriate authority. Of course, you describe and tag each of these new assets appropriately.
12. The BI development team starts generating dashboards and analysis in Cognos Analytics on the new tables in Db2WoC as they become available.
13. Next you import a Juypter Notebook into the notebooks section just by pointing WDP to the URL of the notebook. You do this because this notebook has the Python code used to develop a similar model in the past. You modify the code as you need, easily adding the data assets needed.
14. From here you can continue with your note books, create streams flows and machine learning models as needed.
This is not a complete guide to completing such a project. Instead it is just intended to show how various components of WDP might be used. If you have other interesting ideas about using Watson Data Platform, share them on my Facebook Page or my db2Dean and Friends Community along with any other comments you may have.