Loading Data to dashDB
30 August 2016
When you want to move a lot of data to your tables in dashDB you have several options. I have been using a number of them and will share my experiences along with information I have learned from Kelly Schlamb in this article. The less data that you have to load the more options you have, because for large data sets you will want to use utilities that compensate for network speeds. That is because the speed of uploads is limited by the network speed of your internet connection. In this article, I’ll start small and discuss options for larger data sets. A number of these options are also available for DB2 on Cloud as well.
In this article I will discuss loading your own structured data into dashDB. dashDB has facilities to get various types of unstructured data like JSON documents, Twitter feeds, and others into it, but I’ll discuss those in another article.
General Table Loading Tips
Please note this about loading data from files into dashDB and DB2 on Cloud. It is a very common practice to load data from CSV and other files exported from other databases for your initial load of table. You will want to be very careful about two things when creating these files. If you use double quotes to enclose character strings, make absolutely sure that you do not have double quotes in any of your character stings. This causes problems for the load and other utilities. The easiest way around is to use something other than double quotes for your character string delimiters when you create the file. Just make sure that the character string delimiter that you use is not in any of the character strings. The other issue that I frequently see is having the new line character in character strings. This causes utilities to break records at inappropriate places. You can either remove those characters from the file or use the delprioritychar modifier in the dashDB load utility. This is only available when you call the LOAD directly since other facilities that call the load for you may not allow passing of parameters like this.
Using the LOAD utility for the initial load of a table will frequently create the best compression dictionaries for column organized tables. Column is the default organization for dashDB analytics databases. It is also good to have a large amount of data in the initial load for the best compression dictionaries. This is because the more data that you have when the compression dictionaries are built the more representative the data set is likely to be of what will ultimately be in the table. The better your compression dictionaries are the better your performance of column organized tables will tend to be.
Loading Less than 100 MB
When loading data sets under approximately 100 Megabytes, the field is wide open. This is not a hard limit. It is only a rule of thumb for an approximate amount of data that can be uploaded in a reasonable time without utilities to compensate for data transfer speeds. Depending on your network speed and definition of acceptable time, then you may need to adjust this number.
A list of options I describe in this article includes:
Š Load files from your workstation using the dashDB Console.
Š Use your favorite ETL Tool like Information Server
Š Use DataWorks
Š Use the DB2 Ingest, Import or Load from Client utility.
Š Use Federation to insert data using SQL
Š Write an application to read your input and insert into you cloud database.
Š Use more advanced utilities discussed below for larger data sets.
Essentially, when working with smaller data sets, you can connect to dashDB or DB2 on cloud just like you would with your on-premises DB2 or other database to load data into tables. For details on connecting including making secure SSL (TSL) connections, please see my recent Connecting to dashDB article.
dashDB from desktop files
The easiest way to load a smaller file is to use the dashDB console. This is the web based interface for administering and using your dashDB service. To upload a file using this console, just select the “Load” option on the left side of the panel and choose “Load from Desktop”. Then follow the wizard. It will accept CSV and Excel files. One really nice thing about using this option is that dashDB can create a new table for you figuring out the likely data type of each column and using the column names if they are present in the first row of the file. It lets you modify its suggestions before creating the table. For more information see my article about converting a file to a table. That uses the old SQLDB database in the examples, but the steps are the same in dashDB. The console also has a feature that is currently in beta for uploading larger files with our new Aspera technology.
You can use your favorite Extract, Transform and Load (ETL) tool like Information Server (Data Stage) that you already have installed on your premises to load data into dashDB and DB2 on Cloud (DB2oC). This is a great option if you need to do lots of transformation of the data before putting it into dashDB. It is also a good choice if you have lots of dirty data that needs to be cleansed. For large amounts of data, use your ETL tool to drop the data into CSV files and load those to the cloud using a method shown below. dashDB and DB2oC can be targets of your ETL just like any other on premises database. See my Connecting to dashDB article for more information connection information including SSL encryption and drivers.
DataWorks is a easy to use ETL tool in the cloud. It allows you to select from several different sources, does shaping of data and then put the data into dashDB or DB2oC.
DB2 Ingest, Import, or Load from Client Utilities
When you install the DB2 Data Server Client on your machine and configure it to connect to dashDB or DB2oC you can use the local utilities it installs to load your on-premises data into dashDB or DB2oC. See my Connecting to dashDB article for more information about where to get this client and for making secure connections. It is best if you are using the very latest DB2 Data Server Client version, as older versions may not work with dashDB. These utilities include the INGEST, IMPORT and the LOAD command using the “CLIENT” option.
With InfoSphere Federation Server (formerly called WebSphere Federation Server) or the federation features built into the advanced editions of DB2 you can create a DB2 database on your local DB2 server that contains objects that look like tables, but are really views of tables in other databases. You can think of this as a virtual database. Your federated (or virtual) database can build these objects using dashDB and other IBM databases (DB2, Informix) and non-IBM databases (Oracle SQL Server, MySQL, etc.) for sources and targets. Once all of these objects (called nicknames) are configured properly in your federated DB2 database, you can just write queries or stored procedures that select from a table or tables in one database and insert rows into your dashDB database. For example, let’s say you have federated database with a nickname called LESSER_DB_TABLE_A built on table in SQL Server and also a nickname called SUPERIOR_DASHDB_TABLE_A. You could then write this query to copy all rows using your federated database:
INSERT INTO SUPERIOR_DASHDB_TABLE_A
SELECT * FROM LESSER_DB_TABLE_A
Another nice thing that federation may allow you to use is the db2move command that is very handy for exporting lots of little tables. The db2move command allows you to export many tables at once and then easily use those files to import into other tables with one more command. While I have not tried this command with federation or dashDB yet and, I suspect that the LOAD option will not work and that you will need to use it with the copy or import option. If you try this, please share your experience on my Facebook Page or my db2Dean and Friends Community
Loading Over 100 MB
One of the challenges of getting large data sets into any cloud database is the speed of uploading data across the internet. In this section I will discuss two utilities that can speed this process significantly.
Cloud Object stores and moveToCloud utility
For larger data sets, you can copy data to IBM Softlayer Object storage or Amazon S3 and then load it to dashDB or DB2 on Cloud (DB2oC). Since I’ve only worked with Softlayer, I only describe my experiences with it, but I understand that Amazon is similar. The Softlayer console has a facility for uploading files up to 20 MB. There are quite a few utilities for copying files to these cloud storage systems, but you may want to use the moveToCloud utility to get your files to the cloud for a couple of reasons. First it compresses your file as part of the move, and that can save a lot of time copying your file over the internet. The next reason is that it automatically breaks the file into pieces in such a way that that the file is put back together as part of the load into the database. Softlayer and S3 have a file size limit of 5 GB. That is why larger files need to be broken up. There is no additional charge for using moveToCloud. You can see the install instructions and get the download link for the utility on the moveToCloud page.
Click this guide for step by step instructions with examples for using moveToCloud with Softlayer Object Store. It even shows where you to get URLs and credentials from your Softlayer account needed for the moveToCloud and load steps.
After you get the file copied to the cloud store you will run a load utility to actually put it into a dashDB or DB2oC table. For dashDB you can initiate the table load from the dashDB Console. In this case you would use the Load from Cloud option that has a set of panels where you enter the file location and cloud storage credentials. For dashDB and DB2oC you can call the load utility from your favorite administration client including IBM Data Studio, Aginity Workbench or a CLP or CLPPLUS command line. There are some particular ways that you will need to do this that I will discuss later. This this guide shows how to get the credentials and URL’s needed in the LOAD utility.
dashDB Console for Cloud Storage Files
Once your data in the Softlayer Object Store or Amazon S3, you can use the dashDB console to load that data into dashDB. Just go to your dashDB console, select the Load option on the left side of the page and then choose Load from Cloud. See these step by step instructions for using the dashDB console to load the data and see exactly where to get the URL and credentials from Softlayer Object Store and where to put them in the Load from Cloud wizard.
dashDB LOAD utility
Once you use the moveToCloud facility to copy your files to the cloud, you can also use the LOAD command run from a local DB2 client including IBM Data Studio, Aginity Workbench or a CLP or CLPPLUS command line on your workstation or on-premises server. You can get the client and configure your connection to dashDB using the instructions in my Connecting to DashDB article. Because it is best to execute the LOAD command on the dashDB server using the libraries on that server, you need to use the admin_cmd facility to call the load. Here is an example of using that facility using the Softlayer information from step 2c in the moveToCloud instructions guide as if we were loading the file uploaded using the commands in that document to load the data into the INVENTORY_LEVELS table.
call sysproc.admin_cmd ('load from Softlayer::https://dal05.objectstorage.softlayer.net/auth/v1.0::IBMOS123456-sandra::aso09823ono78923tea98765abcsf8798302987qq45098axw8978g::dashDBLoadDemo::INVENTORY_LEVELS.csv.gz of del modified by dateformat="YYYY-MM-DD" messages on server replace into INVENTORY_LEVELS')
In this example I added the datefomat modifier for the date format of my input file and the messages on server parameter. You can use the any of the input parameters the LOAD utility allows. Because I used the “messages on server” parameter, the output of the command will give me the query to run to see any errors or warnings that might come out of the load process.
The output from the above load will show a query that looks something like this. It will show any warnings or errors encountered by the load.
SELECT SQLCODE, MSG FROM TABLE(SYSPROC.ADMIN_GET_MSGS('284487704_944477997_XXX0001')) AS MSG
Bluemix Lift, formerly known as DataWorks Lift can copy data from certain on-premises databases and CSV files into dashDB or DB2oC databases using the amazingly fast Aspera technology. When connecting directly from your on-premises database to your cloud database, it is also quite efficient because it does not need to write to an intermediate file to some cloud storage server and then read it. It copies the data directly from your on-premises tables into dashDB or DB2oC tables. When you are loading from a CSV file on your local server it stages the file on your dashDB system and then loads it. Because none of the Bluemix Lift targets need a separate cloud storage area, it has very good security implications. Further, it connects to your dashDB or DB2oC database using SSL/TSL by default.
After you download and install Bluemix Lift you can execute it from a browser or from the command line on the server where it is installed. The current on-premises sources are DB2/LUW, PureData for Analytics (Netezza) and CSV Files. You can load either dashDB or DB2oC from CSV files and you can use CSV files created from any source including Oracle, SQL Server and others. From Pure Data for Analytics you can do the direct load into dashDB. You can also do direct loads from DB2 to DB2oC. As of the time that I am writing this article, Bluemix Lift is only available through a beta program, but I expect that it will be released soon. If you are interested in trying it out, you can contact your friendly IBM technical specialist.
I have used Bluemix Lift as part of a dashDB proof of concept for a customer and found it to be remarkably fast and easy to use. Because I like command line interfaces, that is the only interface I used. One note I would make about using CSV files as your source is to make sure that your data is very clean. For example, if your csv file use quotes to delimit character string data, you need to make sure that there are no quotes in that string data. If your character data contains quotes, then pick a different delimiter around your character strings. Also you need to make sure that your character strings do not contain any new line or character returns. These just need to be removed from the character strings in the csv file.
Sometimes it is convenient to load your data into staging tables on your dashDB system so that you can do additional validation or transformation before moving your data to the real production tables. Here is an example of using the load from cursor, copying one table in dashDB to another. Notice that I use the load command to get the best possible compression dictionaries.
call sysproc.admin_cmd ('load from
(SELECT * FROM STAGE.INVENTORY_LEVELS
between ''2015-01-01'' and ''2015-12-31''
ORDER BY INVENTORY_CNT)
messages on server
In this example notice that I used two single quotes around the date strings. They normally would have one single quote, but because the whole input parameter given to the admin_cmd procedure is quoted, I had to do it this way. For other instances where I had to use single quotes, I actually used three or more quotes. Unfortunately, I didn’t save any of those examples. I found it easiest to create a view on the staging table that used the CAST and other functions to transform my data instead doing it directly in the load command. Views also made it easier to combine tables using a UNION ALL when I needed to load a table in parts to multiple staging tables.
Shipping a Drive
There is one other option for sending large amounts of data and that is mailing a physical hard drive to IBM. There is a little information on this on the dashDB Mail a Drive page, but for this option you should contact your IBM Technical Sales Specialist.