Overview of Big SQL 3.0 Functionality

Anjali Khatri

29 July 2014

 

Apache HadoopTM (Hadoop) has received huge recognition in the Big Data market space because of the great flexibility in processing heterogeneous data structures. However, to add even greater value, it is critical for Hadoop to integrate with various analytical tools that are widely used by different enterprises.  Most of these tools utilize standard Structure Query Language (SQL) and SQL procedural functionality. It is also important to allow the enterprises to utilize their existing SQL expertise, which they have been building for years. 

In the last few years, this notion brought to life multiple SQL-on-Hadoop solutions such as Cloudera Impala®, Teradata-Aster®, Presto®, Pivotol Hawq® and others. The purpose of IBM InfoSphere® BigInsightsTM (BigInsights) - Big SQL 3.0 is to provide users a SQL interface built upon the Hadoop infrastructure that is consistent with the SQL interfaces they’re using in the relational database world. [2]

This article consists of two parts. The first part introduces the Big SQL 3.0 functionality as part of the SQL-on-Hadoop concept.  The second part provides a step by step tutorial of how to get hands on experience with IBM’s InfoSphere BigInsights 3.0.

Introduction to Big SQL 3.0

 

Big Data is derived from multiple sources. It involves not just traditional relational data, but all paradigms of unstructured data sources that are growing at a significant rate. For instance, machine-derived data multiplies quickly and contains rich, diverse content that needs to be discovered. Another example, human-derived data from social media is more textual, but the valuable insight are often overloaded with many possible meanings. Trying to efficiently extract the meaningful insights from such data sources quickly and easily is challenging.[6]

Many people believe that the Hadoop technology is the best answer to those challenges. Hadoop is an open source platform, which many vendors extend in search of robust additional functionality. [4] IBM InfoSphere® BigInsightsTM (BigInsights) is IBM's approach to enrich Hadoop with enterprise class capabilities. The latest release of BigInsights - 3.0, provides a single platform to manage all data types, and obtain insights that integrates the usual silos of data. With such integration, BigInsights is a viable platform for storage, analysis and visualization of wide variety of data sources.[5] Several new features of BigInsights 3.0 concentrate on integration and workload optimizations that bring improvements for storing, processing and querying data. Big SQL 3.0 is one of those capabilities. (See Figure 1)

 

 

                Fig. 1. Logical view of Hadoop ecosystem with Big SQL 3.0

The idea of SQL-on-Hadoop is not new to the industry. In addition to the open source Apache Hive, several vendors like Cloudera Impala®, Teradata-Aster®, Presto®, Pivotol Hawq®, Hadapt®, Stinger® and others are present in the SQL-on-Hadoop landscape today.[1]  With SQL querying, the vendors try to extend the traditional MapReduce functionality built for large scale and complex data processing. However, up until now, it has been especially difficult to provide high performance, full ANSI SQL compliance and enterprise level support of rich SQL language that executes over the heterogeneous Hadoop ecosystem.[4] BigInsights – Big SQL 3.0 is a SQL-on-Hadoop functionality that breaks the dependency with Hadoop’s traditional MapReduce framework.[1] It delivers a new, more comprehensive way to rich SQL support that also includes enterprise level performance, security, management and monitoring - all features traditionally found in the  IBM® DB2® for Linux, UNIX and Windows (DB2 LUW) relational database management system.

 

Big SQL 3.0 achieves it’s capability by leveraging IBM’s leading relational SQL expertise. Big SQL 3.0 introduces IBM SQL compiler and runtime engine built on IBM’s expertise in shared-nothing, low-latency parallel processing database architecture. It has the advantage to deploy directly to the Hadoop cluster and access that data natively while still allowing the data to remain in its original state on the Hadoop Distributed File System (HDFS).[4] It provides the ability to access custom data storage formats through a Java I/O module. It provides a logical view of the data where users use the Hive database catalog (HCatalog) for table definitions, metadata, location, storage format and encoding of input files.

 

There are four main Big SQL 3.0 functionality pillars built and run upon the Hadoop cluster: rich SQL language support, performance, application portability and integration, and enterprise capabilities.[1]

Functionality of Big SQL 3.0

 

Let’s review briefly the four main pillars of Big SQL 3.0 grouped by functional areas. 

 

As the name suggests, the first focus of Big SQL 3.0 is the rich SQL support that leverages IBM’s SQL capability layer to provide the most complete ANSI SQL coverage. It stretches over a wide variety of standard SQL relational functionality that provides support for: variety of data types (tinyint, smallint, int, bigint, boolean, float, double, string, varchar, and others); standard APIs and common client drivers (Java Database Connectivity (JDBC) or Object Database Connectivity (ODBC)); procedural language structures such as in DB2 SQL PL or user defined functions in Java or C++. It has built-in statistical, analytical, and business intelligent functions that encapsulate SQL querying with very few modifications. It is designed for large-scale analytic aggregation and supports some of the main OLAP functions like LEAD, LAG, RANK, DENSE_RANK, CORRELATION, STDDEV, REGR_AVGX, and others. Big SQL 3.0 comes also equipped with an extensive library of more than 250 built-in functions and 25 aggregates.

 

Big SQL 3.0 is a simple, but comprehensive way to apply SQL to existing data processed by Hadoop. It provides a logical view of the data that uses Hive Catalog (HCatalog) for table definitions, metadata, location, storage format and encoding of input files. When creating a table, HCatalog definition and Hive APIs get stored in HDFS warehouse directory within its designated schema. Hive's metastore catalogs table definitions could be read and/or write, and they are shared across the Hadoop ecosystem - they can be accessed by other components (such as Pig, Sqoop, etc.) throughout the Hadoop cluster.[2] To facilitate query execution, the HCatalog stores some of its metadata locally on each node. Based on this, when in the parallel processing mode, the server connects to each specific node that contains the partitioned query-able data and generates a result set of stored data within this specific node of the cluster. The query execution also stores intermediate data into the Big SQL 3.0 engine memory for any future or additional processing. (See Figure 2)

 

Although the rich SQL support is the essence of the BIG SQL 3.0, there are other very important features that compliment and raise the value of the SQL functionality.

 

The second functionality pillar is application portability and integration within BigInsights – Big SQL 3.0. This feature introduces portability that allows data sharing amongst the Hadoop ecosystem and other SQL querying software products as well as usage of SQL language instead of the MapReduce/Java syntax. In some ways, this integration is the ultimate driving force behind the Big SQL 3.0 feature. This is an enhancement that fully integrates with third party SQL application tools that leverage common client drivers like JDBC and ODBC drivers. The benefit is that the existing queries can be run with little to zero modifications. Big SQL 3.0 tables can be accessed using the db2 data server client as well as the Big SQL JDBC driver.  This is the same client that is used to access DB2 on Linux, Unix and Windows.  This makes it possible to easily access Big SQL tables with many popular API’s including PERL, Python and Ruby.  The Big SQL node and database are cataloged using the same commands as for a DB2 LUW database.  In addition, the CONNECT statement and almost all SQL statement are consistent with DB2 LUW.  In this illustration, a query is run to access a table in Big SQL on a Hadoop cluster from the DB2 client command window. Also available is an advanced query federation of query-able archives that allows data to be combined - the data does not need to live in isolation on a single system. This is an additional integration capability that allows us to send a distributed request to multiple data sources with a single SQL statement. These data sources include different relational database management systems like DB2, Teradata, Oracle®, Netezza®, etc.

 

Big SQL 3.0’s third functionality pillar is performance, facilitating monitoring of resources and software management. The performance tooling is built from the ground up to traverse low latency and optimized query execution. Its operations run in memory, and its daemons live directly on the Hadoop cluster. [1] As a result there is less overhead and reduced latency to retrieve data. There is no need to initiate MapReduce jobs on the cluster where data is located.

 

Big SQL 3.0 query optimization has the following forms: query rewrites, query plan optimization, predicate pushdowns and statistics-driven optimization. Query rewrites, is an ideal way to execute queries with minimal resources. Query plan optimization includes many sub queries optimization. For instance, one of the main sub queries is a join amongst data that is spread out throughout the Hadoop cluster. Other queries are shared aggregation and distinct elimination, which leverage key definitions for a cohesive rewrite. Predicate pushdown processes data reads based on static variable definitions. For example, Big SQL 3.0 distinct elimination does its query writes by eliminating the use of the DISTINCT data type if the key values are already unique.[1] Finally, statistics-driven optimization stores its column data in the query optimizer to determine an ideal query execution. These transformations are based upon years of IBM’s relational database experience and greatly improve statistics and heuristics.

 

The enterprise level capabilities are the fourth pillar of the Big SQL 3.0. These capabilities include functionality like monitoring, management and security of the SQL engine running over the Hadoop system. Resource, event, memory and workload management enhance the Hadoop functionality for stored data in BigInsights. The security capabilities include secure user authentication, authorization, auditing capabilities, and fine grained row/column access control. Another key component is the scheduler that eliminates any stray data not relevant for a SQL query (known as partition elimination). It also ensures data processing amongst nodes is done cohesively, as close to the data as possible.

 

 

Fig. 2. Hadoop components interaction with Big SQL 3.0

 

A simple way to get hands-on-experience with Big SQL 3.0

 

This section describes how to get hands on experience with Big SQL 3.0. IBM has provided an easy way for you to get familiar with Big SQL 3.0. We have set up a cloud environment to allow you to get hands on experience with this component of BigInsights.

NOTE: On Windows platform, the URLs can be accessed using Mozilla Firefox. On MAC platforms, use Safari.

 

SETUP

 

1.       Go to http://bigsql.imdemocloud.com/ and submit the request form to gain access to Big SQL 3.0 Technology Preview.[7]

 

·         The IM Demo Cloud Team will email you with instructions on how to setup your account on the cloud.

 

 

·         Once you’ve set up your account, go to http://my.imdemocloud.com and login with your credentials.

 

 

 

·         After you log in, verify you have access to Big SQL 3.0 Technology Preview project.

 

 

2)       Click on the Big SQL 3.0 Technology Preview project.

 

·         Go through the guided tour.

 

·         Once the tour is done, click on the tab called “Systems”.

 

·         Under “Systems” it will include information for BigInsights web console and hostname of the Big SQL 3.0 environment.

 

·         Launch the BigInsights web console by clicking on “BigInsights Web Console” or going through URL: http://hostname:8080. Use IM Demo cloud credentials to login.

 

·         Verify BigInsights services are up and running for Big SQL 3.0 by clicking on “Cluster Status”.

 

BigInsights supports a command line interface for Big SQL 3.0 called Java SQL shell (JSqsh pronounced “J-skwish”). JSqsh is an open source project for querying JDBC databases through a shell-like environment.[3] To issue Big SQL 3.0 commands from JSqsh, launch a terminal window (for instance, PUTTY).

 

3)       Use BigInsights hostname and log in using IM Demo Cloud credentials.

 

4)       Launch the JSqsh shell by issuing command: $JSQSH_HOME/bin/jsqsh.

 

5)       During the initial launch, begin the JSqsh setup wizard by hitting Enter.

 

6)       When prompted, enter C to launch the Connection wizard.

 

·         The Wizard will display a list of JDBC drivers, enter the number corresponding to IBM DB2 db2 (IBM DB2).

 

·         The wizard will display some default values to the connection properties and prompts you to change them.

 

·         Change each of its variables, one at a time, by entering the variable number and specifying it’s designated new value.

 

o        For instance, to change the value of the password variable, type variable 5 and hit Enter. It will prompt you to enter a new password. Type the new password value and hit Enter again.

 

o        Do this and apply changes to all variables.

 

o        Inspect the new settings and verify your change.

 

o        Enter t to test your connection and verify the test succeeded.

 

o        Enter s to save your connection information and specify bigsql as the connection name.

 

o        Once the connection is created, enter q to quit the wizard.

 

7)       Connect to the bigsql engine from the JSqsh terminal – once the Big SQL 3.0 connection is created.

 

·         Type \connect bigsql and hit Enter.

 

·         Once you’re connected, verify JSqsh command prompt starts with 1>.

 

USAGE

 

Now that you’ve created the bigsql engine, you can issue Big SQL 3.0 queries to become familiar with the JSqsh shell.

 

1)       Type \show tables –e | more to display information for all available tables.

 

·         This information will vary depending on your environment.

 

2)       Next, create a simple Hadoop table called demo

 

·         create Hadoop table demo (col1 int, col2 varchar(5));

 

3)       Insert a row into the table.

 

·         insert into demo values (1, ‘one’);

 

4)       To view metadata for the tables, use \describe command with the table name in uppercase.

 

·         \describe USER.DEMO

 

·         Another way to query the system for metadata, run the following SQL command

 

o        select tabschema, colname, colon, typename, length from syscat.columns where tabschema = USER and tabname= ‘DEMO’;

 

5)       Load a data file from a local file system into HDFS.

 

·         Use Hadoop put command to load the data: “hadoop fs –put /local/file/system/f.txt /hadoop/file/system/.

 

6)       For complex Big SQL 3.0 querying:

 

·         Create a Hadoop table with table schema, delimiter value and table format.

 

CREATE HADOOP TABLE IF NOT EXISTS sample

( key INT NOT NULL

, value INT NOT NULL

, total_unique_key DOUBLE

, total_value_per_key DOUBLE

) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

 

·         Next, load data from HDFS into the newly created table.

 

load hadoop using file url ‘file:///sample/data/file.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SAMPLE overwrite;

 

·         Run a SELECT command to verify data was loaded into the table

 

Select * from sample;

 

Now you are ready to create your custom tables and try to test more complex queries.

 

Note:  Within JSqsh, a query can be split across multiple lines. Query will only execute with a semi-colon at the end. To review the history of commands that were recently executed in JSqsh, type \history and hit Enter.

 

Conclusion

 

As we outlined above, the BigInsights – Big SQL 3.0 introduces rich SQL support in combination with features that provide enterprise level performance, management and monitoring.  IBM also provides an easy way for you to explore this functionality. Give it a try and let us know what you think!

References

[1]  Gary, Scott C. Ozcan, Fatma. (April 2014) SQL-on-Hadoop without compromise. How Big SQL 3.0 from IBM represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions. IBM Software Group. IBM.com. Retrieved from: http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=SWW14019USEN#loaded

 

[2]   IBM InfoSphere BigInsights Version 3.0. IBM® Big SQL. IBM Knowledge Center. IBM.com Retrieved from: http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_sql_access.html

 

[3]   Saracco, Cindy. (July 2014). Big SQL on Hadoop Part 2: The Big SQL Command Line Interface. Developers Works. IBM.com. Retrieved from: https://developer.ibm.com/hadoop/docs/853/

 

[4]   White, Tom. (2012). Hadoop: The Definitive Guide 3rd Edition. Sebastopol, CA: O’Reilly Media Inc.

 

[5]   InfoSphere BigInsights: Bringing the power of Hadoop to the Enterprise. IBM.com: Retrieved from: www.ibm.com/software/data/infosphere/biginsights

 

[6]   IBM Big Data and Information Management. Now known as IBM Watson Foundations. IBM.com: Retreived from: http://www-01.ibm.com/software/data/bigdata/

[7]    Katsnelson, Leon (Author). (2013). Welcome to the BigInsights Quick Start on the IM Demo Cloud. United States. Youtube.com

 

Additional Information

 

1)       Additional information on SQL-on-Hadoop at: http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=SA&subtype=WH&htmlfid=SWW14019USEN#loaded

 

2)       Big SQL 3.0 additional information: http://www-01.ibm.com/support/knowledgecenter/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_sql_access.html

 

3)       For more information in JSqsh, consult the following documentation. (http://sourceforge.net/apps/mediawiki/jsqsh/index.php?title=Main_Page) and https://developer.ibm.com/hadoop/docs/853/

 

4)       Learn more about InfoSphere BigInsights at: www.ibm.com/software/data/infosphere/biginsights

 

5)       Learn more about IBM Big Data Platform: http://www-01.ibm.com/software/data/bigdata/

 

 

***

If you have any thoughts on this topic please share them on my Facebook Page or the db2Dean and Friends Community