IBM Cloud Private for Data Overview

Dean Compher

30 July 2018

Updated 29 May

 

 

Please note that IBM Cloud Private for Data is now called IBM Cloud Pak for Data.  Part of the reason for the name change is that it now uses Red Hat Openshift as the Kubernetes implementation.

 

IBM Cloud Private for Data is a platform that allows you to configure connections to many data sources and provides tools to access those sources, refine them and use them in development of analytics applications and dashboards.   So, each time you start a new analytics project, you will have access to the data you need and the ability to get it into the format you want and the tools to actually begin analyzing it like Jupyter Notebooks and Cognos-like dashboards.   It provides options to profile those sources, describe interesting objects in the sources, refine and copy data from one source to another and control access to the data, making it easy to use them in analytic applications.  This platform is deployed as a modern microservices architecture orchestrated by Kubernetes and monitored and controlled by a variety of open source tools.  You provide your own cluster of servers in your data center or IaaS servers on your favorite cloud provider and access it from a web browser. 

 

For the rest of this article I will refer to IBM Cloud Private for Data as ICP4Data.  That is not an official term, but I like it.

 

Imagine the difference in the amount of time it takes you to begin your analytics development when you have a searchable set of data sources already connected from a system that allows you to perform data transformation and development versus having to ask around about where the data you need is in the organization, determining the database type that it’s in, getting access, installing the drivers for those new sources, installing tools to refine the data and finally beginning to do your development.  If this sounds useful then keep reading.  While it would be great if your organization has a data steward who finds and connects those sources to ICP4Data this is not necessary.  If your organization uses it for each new project, then you will quickly build up a set of connections and refined data sources that make each subsequent project easier because more and more sources get defined to the system.  Fortunately, the drivers/connectors to various data sources are already installed.  You just access the ICP4Data system with your browser and all of the tools are at your fingertips.

 

ICP4Data allows you to keep your various data sources where they are and add a centralized access point with the tools you need to manipulate it, copy it and use it.  Additionally, you can create or import a catalog of terms and policies.  You can then assign terms and policies to the data sources, tables, files, columns, entities, etc., within the various data sources.  Those on the ICP4Data system can add descriptions to any level of the data and rate the data sources (1 star vs. 5 stars).  That will affect future search results and the more information each person adds the better the system gets for everyone.  Further, you can point ICP4Data to any data source or schema within in it and have the product profile the quality of data in the source and catalog all tables, columns, schemas, record types, documents (JSON, BSON, XML), Hadoop, etc. 

 

 

I foresee that the first thing anyone starting a new project will do after creating an analytics project within the tool, is to go to the search bar on the home page and begin searching terms that are of interest to them.  The tool will show tables, files, databases, document types, etc. that meet the search criteria.  We have even built in machine learning so that the results get better over time.  As they find likely candidates they will add them to the project.  A project in ICP4Data is just a way of grouping objects of interest.  Once the project owner has collected the interesting data sources and added any new ones to the ICP4Data catalog, then they can add collaborators to the project and begin verifying that the sources are actually useful.  For data sources to which you have access, you can easily preview the data in those sources and do additional profiling and simple visualizations for the purposes of determining if the data you have found is useful.  You can then remove any unwanted sources from the project.  With the latest release you can deploy a Db2Warehouse database and some open source databaases to hold relational data locally on your cluster.  One big use for this database will be as a sandbox database onto which to copy your data for analytics purposes.  

 

 

While being able to easily find the data and examine it is very useful, data will frequently not be in just the format you want it.  Therefore, ICP4Data also provides a set of transformation tools.  These tools include a variety of connectors and stages from the Information Server Data Stage product.  Note:  Data Stage was not brought into ICP4Data.  Like other parts of ICP4Data, some features of Data Stage have been containerized and brought into the product.  The look and feel are similar to data stage though.  These connectors and stages allow you to easily copy data from a source or set of sources to another while changing the data with many easy to use functions.

 

The next aspect of ICP4Data that I’ll discuss is the “Project”.  The project has major components of the IBM Data Science Experience (DSX) and allows you to group items or access to data in one place and allow collaboration with others.    As I noted earlier Data Sets are one of the types of objects you can have in a project.  A data set can be any entity like a table or file and will generally be one that was available in the ICP4Data catalog that you selected for your project.  Once a data set is in the project it is quite easy to use it in any of the development objects like notebooks or dashboards.   Here is a summary of the objects in a project:

 

Data Sets

Tables, files or other entities that you selected from the ICP4D catalog.  You can upload files directly into your project and optionally allow them to be accessed by all in the ICP4Data catalog.

Notebook

·      Jypyter with Python or Scala running on GPU or Spark

·      Zeppelin with Anaconda2 or Python

By selecting a data set from the menu when in a cell, ICP4Data will generate the code to access the data set so you don’t have to know how to code access to all sorts of different data sources.

Dashboard

Create nice looking reports and visualizations of your datasets with a Cognos-like pallet.  Again, a number of features of Cognos have been containerized and added to ICP4Data.  Like the Notebook, you can just click on data sets in the project to use them.

RStudio

Run RStudio functions on any data set added to the project.

Models

Add Machine Learning or Decision Optimization models that you have created or download them for Git or other sources.   Easily train these models and add a REST API to call them from any application.

SPSS Modeler Flows

Add SPSS Modeler Flows to your project

Scripts

Add a Python or R script to your project. 

 

The above are brief descriptions of some of the main functions of ICP4Data.  There is also a menu of features that allow you to administer ICP4Data.  The Admin Dashboard gives you a summary of things like Worker and Master note memory and disk usage and allows you to drill into more detail.  You can see the status of various services in the cluster and Kubernetes Pods.  You can view the cluster log and information about individual notes.  You can see alerts and manage users, plus do a number of administrative tasks. 

 

This has just been a brief description of the IBM Cloud Private for Data to make you aware of it.  If this seems interesting to you there are several short videos and some guided demos you can use to explore it further.  Please see the links below.

 

ICP4Data Community:  Join for lots of great information including videos and documentation:

https://community.ibm.com/community/user/icpfordata/home

 

ICP4Data DTE Page:  Videos and a self-guided product tour.

https://ibm-dte.mybluemix.net/ibm-cloud-private-for-data

 

ICP4Data Whitepaper

https://ibm.co/2tDKUVF

 

ICP for data can be run on an independent cluster of servers (on premises or IaaS servers) or it can be run in larger microservices infrastructure systems like IBM Cloud Private or Red Hat Open Shift.   On IBM Cloud Private and OpenShift you will be able to deploy ICP4Data using a HELM chart. 

 

***

 

ICP4Data provides a single portal into your data for developing your AI and business intelligence applications.  There is a remarkable amount of development going on, so be prepared for lots of changes to the web pages and lots of new functionality in the coming months.  If you see things that you really like please post them on my Facebook Page or my db2Dean and Friends Community along with any other comments you may have.  

  

HOME | Search