db2Dean’s 2014 Insight Highlights
28 November 2014
This year at Insight I got my picture taken with the History Channel Counting Cars guys and chatted with them a bit. I even gave my flashing BLU button to Roli and if you look closely, you can see it on his belt! So we now know which in-memory, columnar database is his favorite!
Although I didn’t win the twitter contest to meet Gwen Stefani and the No Doubt members, I did get to see an awesome concert. Thank you everyone who retweeted me! Again this year there was tons of information at the conference and I’ll give you a summary of what I learned along with some random tips and tricks. Again this year I concentrated on Big Data, but still had time to get to some great DB2 sessions. If you attend the conference then you can download many of the presentations from the Conference Sessions Page. As time has gone on many more have been added, so try again if something you wanted was not available on the site when you first got back. I look forward to seeing you at the conference next year! Also please add anything else you though was really interesting to my db2Dean Facebook Page or to the “Message Board” section of my page.
Cloud, Database & Big Data
Much of IBM’s Database and big data technology is moving into the cloud. At the conference a number of new cloud offerings were announced like dashDB and existing cloud services were discussed like Bluemix and Cloudant. There are a number of reasons why this is being done. One of the largest reasons is the need to be able to develop and prototype applications quickly and cheaply without building a physical infrastructure yourself. With Bluemix, IBM has put many development tools including our own and open source tools on a platform along with a menu of database choices such as “dashDB Analytics Database” or “Time Series Database” that the user can choose and start using right away. Hadoop is another data store that can be chosen. With these you do not install or build anything, you just start building tables and adding data. There are also ETL tools (IBM DataWorks) that can move data between your cloud data bases or between your on-premise data stores and your cloud data stores. Also while the conference was taking place, IBM announced a partnership with Twitter and one of the big parts of that partnership is that tools for analyzing twitter data will be available in Bluemix that allow you to do the analysis without having to figure out how to get the data from twitter and store it. Many of the analysis tools such as text analytics that allow you to gage sentiment were already available and now you can use them on Twitter data by just choosing to do that. I’m not sure exactly when this will be available. Below, I will show a few of the data centric tools available, some of which are currently in beta.
IBM Bluemix is a platform that combines many development tools, run time tools like web servers and application server, along with database tools like dashDB for analytics and SQLDB and Cloudant for transaction processing. It even has a sophisticated set of tools for Extract/Transform/Load (ETL) that allow you to move data between data sources. To browse through the tools and databases available, use the services tab and scroll through the list of links on the left you will see the Data Management and Big Data tools available along with everything else.
dashDB is the analytics database you can implement as part of your application in Bluemix and from the Cloudant site. When selected, it instantiates a relational database using DB2 BLU technology, with R predictive modeling, SAS, ESRI and other tools built-in like Netezza. That way with many analytics applications you can do the processing local to the data rather than doing a large extract. It also provides a host of tools for ETL and other data manipulation needs. With it you can connect your favorite reporting tools such as Excel, Microstrategy or Cognos. It is a fully managed database so you don’t have to worry about tuning the server or instance, backup or other administration tasks. Also since this is our columnar, in-memory database technology you don’t need to worry about building complex objects like indexes and Materialized Query Tables since they are not needed. If you are using Cloudant as your transactional JSON database, there is even a tool to automatically copy your JSON data into your dashDB relational tables. I you are using it in Bluemix, you can use the DataWorks tools co copy and transform data from any on-premise or cloud source. It also has tools to allow you to easily import data from CSV, Excel and other data sources.
SQL Database and Time Series Database. These are implementations of DB2 for transactional processing and Informix optimized for time series data respectively. Several other databases including open source ones like MySQL are also available. You can browse the list by clicking the services tab and scrolling through the Big Data and Data Management sections on the left side. These databases are also fully managed for you.
Cloudant is a JSON NoSQL data store that can start small and grow extremely quickly as needs dictate. Cloudant has been a cloud only offering, but an on-premise was recently announced. Cloudant is available through Bluemix as well as through its own portal. On Bluemix it a feature is provided to mine the JSON data in Cloudant and automatically create a schema in dashDB and load the cloudant data into that schema.
Big Insights is IBM’s Hadoop based big data analytics system. It includes a non-forked Hadoop distribution along with tools that make it ready for the enterprise and comes with many analytics tools already built in like BigSQL. Through Bluemix it is available as a fully managed cloud offering. This way you can get into big data with a few clicks instead of having to purchase and configure a cluster of your own.
DataWorks provides a number of cleansing, load, ETL and transformation tools to keep data up-to-date in your cloud databases. They can also be used to move data between data sources whether they are in the cloud or not.
There are a number of IBM Cloud offerings and the list above only covers a few that are part of Bluemix. The links are ones that provide information about the topic and do not require you to sign on to any site. .
DB2 10.5 Cancun Release (FP4)
Fix pack 4 or as it is better known, the Cancun release, added several improvements to DB2/L:UW 10.5.
· BLU shadow tables were introduced that allow you to create column organized copies of your transactional row organized tables. This is a great feature for databases that have a mixed OLTP and reporting workload. No queries have to be changed to begin using this feature since the optimizer will use the BLU tables where appropriate instead of the row organized tables automatically even though the query only lists the row organized table. This is a similar behavior to the Materialized Query Tables (MQT) optimization. Under the covers DB2 uses CDC replication to keep the shadow table synchronized. This means that the shadow table does not need to have all of the columns that are in the primary table. It also means that the write to the shadow table is not part of the transaction when the primary table is updated, reducing the performance impact of the shadow table. Further some users have seen improvements to their OLTP updates because they were able to drop indexes on the primary tables that were only used for reporting.
· There were also a number of other improvements to the BLU feature:
o Better insert update, and delete performance
o Even better compression especially for char and varchar data
o Load is simpler since it can use Automatic Sampling
o Now allows MQTs on BLU tables
o HADR supported for BLU tables.
· Improvements were made to the pureScale feature. The biggest one is that you can now use regular Ethernet networks for the interconnect and no longer need an RDMA capable network. This feature is primarily intended to be used when you want to use pureScale for high availability clusters where updates are only fed to one node in the cluster. It is also great for off-loading reporting and other read-only workload to another node. Fix pack 4 also allows you to take incremental backups in pureScale.
· As of this fix pack all backups will be encrypted if you are using Guardium Data Encryption.
· DB2 10.5 with BLU turned on can now run in Oracle Compatibility mode.
Big Insights Updates
· While Big Insights lets you use the traditional Map/Reduce facility in Hadoop, it also provides a feature called Adaptive Map/Reduce that is much faster. It is C++ based rather running a JVM like traditional Map Reduce.
· It comes with Text Analytics built-in. In addition to other things these tools can put information gleaned from text into relational tables or XML formats to facilitate reporting and other activities that rely on formatted data.
· Big-R is a feature that builds several R features into Big Insights so that data can be analyzed by R in the Hadoop cluster rather than extracting it for analysis. This moves the processing to the data which is usually faster than moving the data to the processing.
· Geo-Spatial capabilities are now in beta that will allow that type of processing to be done local to the data.
· The BigSQL feature allows you to write queries that federate various databases with Hadoop and it also allows you to federate CSV, JSON and other file types as well.
· While some queries can be written using HIVE, BigSQL allows you to run any ANSI compliant SQL. As a matter of fact we ran all 99 TPC-DS queries against Big SQL successfully. It is also much faster.
· Big Insights is already on Bluemix and it will soon be available on IBM Softlayer.
· IBM Optim now supports HIVE.
There is now a Guardium Database Activity Monitoring.
I hope that you found at least a few pieces of information in this article to be new and useful. I hope to see you at the conference next year.