DB2 and Machine Learning
19 December 2022
Db2 can do much more in the AI realm than just provide data to train your Machine Learning models. It can also be used to host your models allowing them to be called by a user defined functions and REST services. It also has built-in functions such as test-train-split and linear regression that allow you to train models right in the database. Training a model in the database server can save significant network resources since you don’t need to copy huge result sets to an application server or workstation. There are much more sophisticated facilities for hosting your models like IBM Cloud Pak for Data, but if you already have Db2 databases and need a fast way to give your applications easy access to your machine learning models then Db2 is the ticket. This article is also useful for those who want to start experimenting with machine learning models as you can easily try these examples. If you don’t already have a Db2 environment, then you can get one on your workstation by getting the free Db2 Community Edition.
Frequently machine learning models are not used for grand purposes, but instead are used to score or predict small things such as retailer’s order website calling a model to suggest an additional purchase to the customer, or in a back-office application calling one to determine the likelihood that a transaction coming from some device is legitimate. So, as you can see these models can proliferate in an organization and you don’t want to have to maintain lots of custom interfaces between existing applications that call them and the models themselves. That is why you would want a central place to host them that any application can easily call over the network. Your Db2 database is already there processing SQL requests and when you call your machine learning model through a database user defined function it can just as easily give applications predictions and scores. Further, for applications that don’t already use the database, you can implement the Db2 REST endpoint and allow those applications to use the models in the database by calling REST services instead of making database connections.
You can see my Db2 Rest Endpoint article for information on the end point and see my examples of creating and calling REST services for machine learning models on Github. See the Deploying External Models with Python UDF directory in the db2-samples Github repository. This directory contains an example of creating and deploying a model in Db2 and my Db2 REST Jupyter Notebooks show the steps for creating and calling the REST Service.
If you are already training and saving your models with IBM AutoAI, Python Notebooks using scikit learn or other development tools, you can still host your models in Db2. Those tools will commonly save the trained model as a joblib file. There are three objects you must add to the database to allow applications to execute the model and one optional one:
· Copy the model to the Db2 database server.
· Copy the body of your function (Python, C++ or Lua) to the database server.
· Create a database function to execute the function body.
· Optionally create a REST service to call the function.
Copy the model to the Db2 database server
There are two ways to copy your machine learning model to the Db2 Server:
· Copy the model joblib file to a directory in the database server. It must go into a directory accessible by the Db2 Fenced user.
· Convert the joblib file to a base64 encoded string and insert it into a table in the database.
Copy the body of your function to the database server
Write a function body in Python, C++ or Lua to call the model you created. You will then copy the body into a directory on the database server that can be accessed by the Db Fenced user.
Create a database function to execute the function body
While connected to the Db2 database use the CREATE FUNCTION command to catalog your user defined function. The Creating and managing user-defined extensions page describes how to create and deploy functions, (user-defined extensions). Drill down into the Managing UDXs section to see instructions for Python UDX’s. This function will call the function body, which in turn executes the model.
Optionally create a REST service to call the function
Once you have created the function in the database you can create a REST service to execute it just as you can for any other Db2 query. I provide generic examples of creating Db2 REST services in my Db2 REST End Point and Database Security article.
As you have probably noticed I enumerated the steps but left out a significant amount of detail. The good news is that the In Db2 Machine Learning directory of the Github db2-samples repository contains directories with detailed examples of doing the steps noted above.
· In DB Model Deployment and Management the README has the detailed steps with code examples for creating a machine learning model as a joblib, converting it to Base64, inserting the model into a table and then creating the user defined function to call the score.py function body to do the prediction. It even has DDL to create the table to hold the model and data files to run the example.
· Deploying External Models with Python UDF provides ML model joblib and python function body files you can download and the README describes how to place them on the Db2 server, create the Python user defined function on them and execute the function. It also includes needed DDL and sample data to run the example yourself. Finally, it includes the Jupyter notebooks, including the Db2 RESTful External Py UDF.ipynb notebook I wrote that creates a REST service on the function and calls the function. That notebook depends on other notebooks in the directory that all start with “Db2 RESTful”.
In addition to being available to host machine learning models for applications to execute, you can train new models right in Db2 using the many training functions and stored procedures in Db2 including data normalization, clustering and linear regression. They are all in the schema “IDAX”. This is especially useful if the training data is already in Db2. It is also useful if you need data from several sources to train your models. This is because with the Db2 Federation capability, you can access Db2 tables and tables in remote databases like Oracle, Postgres and several others in queries to the Db2 database. So instead of having to copy data from several sources into your training environment, you can just write an individual query to join the data from several sources.
The various Db2 Warehouse implementations (Db2 Warehouse, Db2 Warehouse on Cloud and IBM Integrated Analytics System) are automatically installed with those functions. Many of those same functions can be added to your traditional Db2 implementations. In addition to functions like linear regression, clustering, etc, there are many for data exploration, data transformation, model tuning and model evaluation. If you’ve always wanted to try creating your own machine learning models, but didn’t have an environment to do so, you just need a Db2 11.5 database to get started. See the In-database machine learning section of the Db2 knowledge center to see how to do this and get a list of all the machine learning functions available.
Even if you aren’t really interested in machine learning a number of these procedures are useful for other purposes. See the full list. Some examples include:
· IDAX.SUMMARY1000 shows statistics for the columns of a table including missing value count, average, variance, skewness, min, max and others
· IDAX.COLUMN_PROPERTIES creates and populates a table for the source table showing the source table’s column names, data types, etc.
· IDAX.SPLIT_DATA Creates two new tables with the same columns as the source table and copies the rows of the source tables randomly to one or the other target table. You also tell it the percentage of rows that are to go into each of the target tables.
Before you can start using these functions in a traditional Db2 install, you will need to follow these instructions to activate the facility. After that you can visit the Building ML Models with Db2 for step-by-step instructions and detailed examples of using these functions. You can also download Jupyter notebooks that have examples of building Classification, Clustering and Regression models using the Db2 in-database functions.
Kelly Schlamb also created some great information that can help you get started using the in-database machine learning functions and Python UDFs. Please see these links:
You can get all of the components to run the examples from the In_Db2_Machine_Learning repository at no charge. You can install them on your workstation or server. I have them installed on my Mac and ran them all successfully. Here is the list of items and the links where you can download them.
· Db2 Community Edition – On Windows or Linux you can install it natively or as a Docker container. On Mac the Docker Container is your only option. I run it under the Podman container run time, but you can use Docker or another container runtime.
· Anaconda – A popular platform for getting Python and Jupyter notebooks. That is where I got those environments for my Mac.
· Db2 REST Endpoint - You can do all of the Db2 ML examples without the REST Endpoint except the REST examples. So this is only required if you want to try those too. It ONLY runs in a container no matter which platform you have.
· Podman - Use any container run time you like including Podman or Docker.
If you have ideas about how Db2 can be used for Machine Learning or how it can be used in interesting ways or to make life easier, please share them to my Facebook Page.