Data Virtualization
Dean Compher
29 Sept 2020
The IBM Data Virtualization tool allows you to have one place to query and join tables that physically exist in a variety of places and vendors’ database technologies. This makes it convenient for analysts and developers to access many different data sources with one connection. Data Virtualization is a database with tables you can query, but instead of the data behind those tables being in files physically in the database, the data exists in physical tables in other databases. Data Virtualization efficiently queries underlying tables when you select from the virtual tables in it. You can also create tables on Excel and delimited files on remote servers. Think of the tables in the Data Virtualization database as views like you can have in any relational database, except these “views” are built in tables in remote physical databases or files on remote servers. You can use your favorite driver to connect to the DV database like JDBC, ODBC or CLI from your application, BI tool, etc., and query these virtual tables. Advantages of using DV include allowing applications to connect to one database to query and join tables from multiple databases, and one central place to control access to all of those databases. Data virtualization also provides ways to increase performance by allowing you to cache data and to put agents at remote sites to possibly reduce the amount of data sent back to Data Virtualization.
In this article I’ll describe what Data Virtualization does and how it works so that you can decide if it merits further investigation. I’ll frequently shorten Data Virtualization to DV.
What is Data Virtualization? It is similar to any relational database. All relational databases have two main components, the engine or process that runs on a server and the files that contain the data, indexes and other objects. When the engine receives SQL to process, it reads the data in the files and sends the result set to the client. Data Virtualization is the same in that it has an engine that runs on a server, and when it receives SQL, it processes it and sends the result set back to the client. The difference is that instead of finding data in local files, it intelligently sends queries to the source databases that actually have the data. There are several database technologies that it can use including Db2, MongoDB, SQL Server, Oracle, Snowflake, Hive and several others. You can see the full list by going to the Supported data sources page, expanding the data source groups and looking at the Data Virtualization Column. It can even make Excel and CSV files on remote servers appear as relational sources, but you need to run the DV agent on those servers to provide communication.
As you might imagine the first thing you do after installing Data Virtualization, is to log in as the administrator and configure the connections to the data sources. It primarily uses JDBC connections, so you need to enter things like host name, port and database name along with a user id and password for the remote connection. Of course, only tables and views available to that user in the remote source can be used by Data Virtualization. Even after you configure the connections to your remote tables, there are no tables that can be queried from Data Virtualization yet. To allow that you would need to explicitly make tables available for query by creating virtual tables.
Before you start making tables available, you will probably want to create some schema names to organize the tables you make available. These schemas do not have to be related to the schema names in the source databases and you can have tables from many different databases available in one schema. This isn’t required, but is a good idea. Otherwise, the schema for each virtual table will be the user name of the administrative user configuring access.
The process of making a table or tables available for query is called “Virtualize”. In this step the administrator selects the tables to be virtualized from the sources, optionally changes the table names, renames or excludes some columns, and optionally assigns different schemas to those tables. This causes objects to be built in DV called virtual tables. This is accomplished by searching the tables in the sources and selecting the ones to be virtualized. At this point only certain users with a high level of authorizations can see the virtual tables. Finally, one of the administrators grants access to regular users by giving authorities to individual users or roles that contain several users. Once this is done individual users or applications can connect and query data. Data can only be SELECTed: updates, inserts and deletes are not allowed through DV.
There are four DV Roles with different levels of authority:
- DV Admin
- DV Engineer
- DV User
- DV Steward
Users can be defined and authenticated locally in DV or can be users in your LDAP, AD or other IAM system.
DV also lets you create views on your virtual tables. This is convenient if certain tables are typically joined. In this case you can create views that join tables across different databases, making things easier for end users and developers. You can also add columns based on functions such as substring, concatenate, or any number of mathematical columns. For straight forward join views and union all views there are wizards that make them easy. For more complex joins, DV provides an SQL editor where you can issue the CREATE VIEW command.
You may be thinking that selecting data through DV might be slower than directly querying the sources and in some cases you would be correct. However, the speed for most uses will be adequate and using the performance features of DV can mitigate performance issues or make it even faster than connecting to multiple remote sources. First of all, DV has an optimization engine that is used to intelligently query remote sources. IBM used its years of experience engineering Db2 Federation for performance in Data Virtualization. Next you can cache data in DV. You do this by giving it queries that bring back data used by user queries that you want to accelerate and that result is cached. You can have DV automatically refresh these results as frequently as every hour. Queries launched against the cached data do not have to exactly match the query used to create the cache. If the data in the cache will help, DV will use it automatically. The DV console also allows you to easily view which caches are being used and performance statistics for them.
Another very interesting performance capability is the ability to put DV agents in remote locations. If all of your source database systems are in the same data center then the agents are not as useful, but they can increase performance significantly when those sources are in remote data centers. The agent can be placed on one or more source database servers. If you do not wish to install an agent on any data base servers, then the agent can be placed on a server near your database servers. The agent at a remote data center becomes especially useful when there are multiple database servers there whose tables are frequently joined in queries. This is because agent will query the database servers near it, do the join and send only the needed results to the central DV system. This not only reduces the amount of work that the central DV server has to do but can greatly reduce the amount of data transmitted between sites. As a bonus these agents can communicate with each other and have the intelligence to learn the fastest routes back to the central DV server and use the best routes. There is no additional license charge for the agents. Finally, any CSV or Excel files on the server where the agent is located can be configured to appear as virtual tables in DV.
All of the administration can be done through the Data Virtualization Console as shown below. The DV window here shows the main menu in the upper left over the Connection details screen. This screen is where a user can find the information needed to connect the Data Virtualization database from a remote client and even has the link to download the drivers needed.
Users of the virtual database include applications, business analysts creating reports and dashboards, and data scientists. Administrators can also use the client to do any administration task that can be done using SQL including creating schemas and creating views.
Like a number of IBM Data and AI products, Data Virtualization uses a common administration layer for things such as user management, troubleshooting, and deployment. Further, you do not buy a software part called Data Virtualization. Instead you buy enough Cloud Pak for Data for the Data Virtualization that you want and deploy just the components needed for Data Virtualization. On the other hand, if you own Cloud Pak for Data, and have the capacity in your system, Data Virtualization is one of the many features you can deploy and use without acquiring licenses for an additional product.
In this article I hit the highlights of the Data Virtualization features. Additional features and instructions can be found in two sections of the Knowledge Center. The installation and administration section describes how to do things like manage users, create schemas and manage access to virtual tables and views. The Virtualizing Data sections describes how to add data sources, restrictions, creating objects and caches and several other items.
***
This article describes some of the useful features of Data Virtualization. If you discover any interesting use cases, please tell us about them on my db2Dean Facebook Page and share your thoughts about them.