DB2 INGEST Utility

DB2/LUW INGEST Utility

Dean Compher

28 March 2013

The DB2 Ingest Utility or “Continuous Data Ingest” is a new feature of DB2 for Linux, UNIX and Windows (LUW) 10.1 that allows a lot of flexibility for managing data in your database based on records in a file or pipe and is very fast. It can insert new rows, update existing rows, or do a combination of both which is sometimes called a merge or “upsert” operation. It can also delete rows in a database based on the data in a file or pipe. It allows you to use DB2 functions and expressions so you can manipulate the data as it is being ingested. It is faster than the INSERT utility and can be used to load massive amounts of data into a database in a reasonable amount of time. It even exploits partitioned databases! While it is not as fast as the LOAD utility, it does not have the concurrency issues that the LOAD utility faces. I will discuss these topics in more detail in this article.

The ingest utility is only available in DB2 Advanced Enterprise Server Edition (AESE) in DB2 Version 10.1 and the IBM Pure Application System. It is compatible with DB2 pureScale as long as you are licensed for both the pureScale feature and DB2 AESE.

As with load and import, ingest can take input from a file or set of files. You can also get its input from a named pipe or set of pipes. Being able to ingest from pipes means that you can start it up and it will sit there waiting for records to appear on the pipe and when they do, the utility will process them into the database. In this way you don’t have to land data in files in a directory and then process and manage the files. Records in the file or pipe can contain delimited or positional fields.

While you can copy your files to the database server and use the ingest utility there, we anticipate that it will primarily be run from an application server or file server where the data lives. To allow this you just need to install the “IBM Data Server Client” or the “IBM Data Server Runtime Client” on the servers where the data exists. They are a free download for users of DB2 databases and you can read more about them on my DB2 Clients page where you can also find the download links. Once you install the DB2 10.1 or later Client on your application server you can just run the ingest utility to read from files or or set of pipes. If you have the ingest reading from a pipe or pipes, you will also need to modify your application to put the records on those pipes and then DB2 will drop the records into the database as they appear. As of version 3.2, Data Studio does not provide the ingest utility.

As I said earlier, you can do more than just insert new data into the database. The functions shown in Table 1 are provided by the utility. Further you can use advanced SQL features like the case statement to tell it how you want the data processed.

Table 1. INGEST SQL Commands

INSERT	Allows you to use the standard INSERT statement to tell the utility how to insert data into the table. Records that would cause duplicates as defined by primary keys or unique indexes on the table will be rejected.
DELETE	Rows are deleted from the database whose keys are specified in the input file or pipe.
MERGE	Allows you to insert records that do not exist in the table, and update specified columns of records that already exist in the table.
REPLACE	Deletes ALL of the rows in the table and then inserts the records in the file. Be careful when using this command on large tables because the delete is run in one transaction and can cause you to run out of log space. A better alternative may be to manually run the TRUNCATE statement prior to using INGEST.
UPDATE	Updates columns that you specify in the table where the row matches the input record key. Records that have no matching row are ignored.

One of the things that makes the ingest utility really useful is the fact that you can name the input fields. Giving each field a name allows you to process it using DB2 expressions and functions when adding data to the databases. When processing numeric fields you can add, subtract, multiply and do any other math with them to come up with a value to add to the database or use in a predicate (where clause) plus you can use any other mathematical function. For character data you can use substring, concatenate and any string function that you like. I will show a few examples in Figure 1 that I stole from the “What’s New in DB2 10.1” presentation of using fields with different SQL commands. However, for a more comprehensive set of examples I suggest that you review the INGEST Command Info Center Page and visit the set of links appearing at the end of this article.

Figure 1. Examples

-- Input fields are given names meaningful to user and start with “$” symbol

-- Uses fields in INSERT as individual values and in an expression

INGEST FROM FILE my_file FORMAT DELIMITED

(

$prod_ID INTEGER EXTERNAL,

$prod_name CHAR(8),

$unused_field CHAR(1),

$base_price DECIMAL(5,2) EXTERNAL,

$shipping_cost DECIMAL(5,2) EXTERNAL

)

INSERT INTO my_table(prod_ID, prod_name, total_price)

VALUES($prod_ID, $prod_name, $base_price + $shipping_cost);

-- Uses fields in DATE function and uses concatenate operator

INGEST FROM FILE my_file FORMAT DELIMITED

(

$month CHAR(2),

$day CHAR(2),

$year CHAR(2)

)

INSERT INTO my_table(date_column)

VALUES( DATE('20' || $year || '-' || $month || '-' || $day) );

-- Simple merge example

INGEST FROM FILE my_file.del FORMAT DELIMITED

(

… skip field definition …

)

MERGE INTO my_table

ON (key_col1 = $key_fld1) AND (key_col2 = $key_fld2)

WHEN NOT MATCHED THEN

INSERT INTO my_table

WHEN MATCHED THEN

UPDATE my_table SET (data1_col, data2_col, data3_col) =

($data_fld1, $data_fld2, $data_fld3);

-- Delete rows based on keys from an input file.

INGEST FROM FILE my_file.del FORMAT DELIMITED

(

… skip field definition …

)

DELETE FROM my_table

WHERE (key_col1 = $key_fld1) AND (key_col2 = $key_fld2);

Configuration Parameters

The examples of parameters shown so far are only a fraction of what is available. There are many more that can be specified when invoking the utility for control processing including:

what delimiter is used or specifies positions of fields
control restarting after failure,
what to do with records that do not have ingest-specified data type or can’t be inserted due to a unique for foreign key constraint
how many warnings can be encountered before the utility quits
all of the field data types you can specify
business time for temporal tables
several others

In addition to the parameters that that can be specified as part of the utility command, there are others that are defined prior to invoking the utility as part of the Command Line Processor (CLP) environment. That is, you can specify them once after opening the CLP window, and those settings will be in effect for all of the ingest commands that you run until stopping the CLP. For example, if I run the following set of commands at the command line or put them into a file and run it with the “db2 –tvf <file_name>” command:

INGEST SET commit_count 300;

INGEST SET commit_period 10;

INGEST SET shm_max_size 2 GB;

INGEST FROM FILE my_file1.del FORMAT DELIMITED … ;

INGEST FROM FILE my_file2.del FORMAT DELIMITED … ;

then the both INGEST FROM commands would use the settings configured in the INGEST SET commands. See this link for the complete set of ingest environmental configuration parameters and detailed descriptions of each one.

One important factor in any utility that adds or changes lots of data is the commit frequency. By default, ingest will issue a commit once every second. This means that by default the maximum amount of time that the ingest utility will hold a lock on the inserted rows in one second, so the most time another application that wants to use one of the ingested rows is one second. This default is generally a good trade off between efficiency and concurrency. However, you can change this default and also allow commits to be done after a certain number of rows are processed. You configure the settings using the INGEST SET command as described above with the “commit_count” and “commit_period”.

Restarting Failed Processing

If you are running an ingest for millions of rows, the last thing that you want to have to do is to restart it from the beginning if a failure or cancelation of the job occurs after you have completed hours of work. The ingest utility makes it easy to pick up where you left off at the last commit point, but there are some steps that you have to do to allow this to happen. The main preparation that you have to do is to create one restart table in each database where you want to use the ingest utility and grant access to that table to each user who may run it. The ingest process is described in detail in the “Ingest Utility” section of the Data Movement Utilities and Reference manual. The ingest section of this manual also shows you in detail how to restart the utility if it does fail and provides good examples. If you don’t want to restart the job, you can also terminate it using the “RESTART TERMINATE” parameter. This is also described in the manual. There are a number of restart parameters that are part of the ingest command that you will want to review that are described on the the INGEST Command Info Center Page

Ingest Performance

There are three major phases of the ingest utility and you can change settings that influence the behavior of each phase to change overall performance of the utility. The phases are Transport that reads the file(s) or pipe(s), the Format that formats the input, and the Flush that issues the SQL statements to DB2. You can read more about the phases in the Ingest Utility summary page.

Monitor Ingest

Finally you may want to monitor the progress and view statists of running ingest jobs. Here are some commands that you may find useful:

INGEST LIST Show basic information about ingest jobs running
INGEST GET STATS Show detailed statistics about ingest jobs running
LIST UTILITIES Show all utilities, including INGEST, currently running in the system.

List of Links

Unleashing DB2 10 for LUW	IBM Redbook: Chapter 8 discusses the ingest utility, gives lots of great advice on using it and provides several examples.
Data Movement Utilities and Reference	There is an entire section devoted to the ingest utility that gives lots of great advice on using it and provides several examples.
Ingest Utility Summary	DB2 10.1 Info Center page describing the utility and providing additional links for more reading.
Ingest Command	DB2 10.1 Info Center page showing detailed syntax and all parameters for the command, plus a number of use cases and examples.
Ingest CLP Environment Parameters	DB2 10.1 Info Center page listing all environmental parameters that can be set with links to detailed descriptions of each.

***

I hope that you found this introduction to the ingest utility useful. If Please comment on this article in my Facebook Page or my db2Dean and Friends Community and help the extended community.

HOME | Search