db2Dean’s 2016 WoW Highlights
28 November 2016
This year at World of Watson (formerly Insight) I met the feather woman on her way to one of the after-hours events. She was a lot of fun! I also learned a lot of interesting things and got to some fun events like the one party at the top of the Mandalay Bay.
Again this year there was tons of information at the conference and I’ll give you a summary of what I learned along with some random tips and tricks. This year I concentrated on Cloud and on-premises databases. If you attend the conference then you can download many of the presentations from the World of Watson Session Expert. As time has gone on many more have been added, so try again if something you wanted was not available on the site when you first got back. Whether you attended or not, you can watch replays of a number of the sessions through the WoW IBMGO site. I look forward to seeing you at the conference next year! Also please add anything else you though was really interesting to my db2Dean Facebook Page or to the “Message Board” section of my page.
A lot of work is going into Data Server Manager (DSM) and it is becoming the primary database monitoring, tuning and administration tool for DB2 and dashDB. It is much more stable than the old Optim Performance Manager and allows you to run SQL and do most administration functions using a browser interface instead of the IBM Data Studio client. At one of the sessions a customer who was managing about 250 DB2 databases described using it to measure resource consumption on the databases and charge the various user organizations for the cost of their usage. This can be done because DSM Enterprise Edition collects Workload Manager data among other measurements and allows you to use the repository to extract and/or report anything you like.
Here are some interesting DB2 BLU tips I got that really WoW’ed me!
· While committing individual inserts is slower for column organize tables than row organized, that performance can be the opposite when doing many inserts in a transaction.
· A good rule of thumb for column organized tables is to have the Utility Heap set to at least 1 million pages when doing initial table loads so that compression dictionaries are built well. This should be taken care of when you set the DB2_WORKLOAD registry variable to ANALYTICS in DB2 BLU databases.
· When using DB2 BLU, you can save significant log space by committing many updates/inserts in a transaction rather than just a few at a time.
With DB2/LUW v11.1 being released in June there were several good DB2 sessions this year. I described many new features in my previous DB2 v11.1 article but I’ll review a few of them here and add a few I missed in that article.
· BLU can now be used with the Database Partitioning Feature (DPF).
· Introduction of RANDOM distribution key on create table statement in DPF environments. You can’t get collocated tables when using it, but is an option when you just can’t eliminate bad skew any other way.
· Native column support added for BLU tables to allow native column processing and processing without decompression in several more types of joins, sorts, functions and other processing.
· The encryption feature now allows you to store your keys in a key store along with all of your encryption keys from other software.
· There were internal efficiency improvements that will significantly speed some workloads such as improved bufferpool latching.
· Now allows reorg of partitioned tables.
· Databases running under AIX on Power 7&8 servers can now take advantage of the hardware compression to compress backups and logs. This can significantly reduce CPU consumption and time for compressing these objects. This is also being investigated for Linux on Power systems, but is not there yet.
· Can upgrade directly from 9.7 to 11.1, and using an online backup before upgrading to 11 is now fully supported.
· pureScale installs have been simplified a lot.
· “dsmtop” command added that makes improvements over db2top.
· You can now use the OFFSET command to skip the first several rows when issuing a query. It can be run with the FETCH FIRST parameter to skip some number of rows before selecting some number of rows.
· You can now use the “create table AS” syntax to create a table based on a query. You can just create it or create and load the data from the query.
· The NOW key word can be used in place of CURRENT DATE or CURRENT TIMESTAMP.
· The EXTRACT key word in a query can now be used to select date elements like century, epoch, etc.
· Many, many new date and date calculation functions
· New statistical functions are available for you and the data scientist.
· You can now use the plus sign (+) to indicate an outer join even when not in Oracle Compatibility mode.
· New HASH functions
· WIDTH_BUCKET function to easily create histograms.
In case you had not already heard, there are now two flavors of dashDB – Transactional and Analytics. Both are available in the public cloud, The Analytics flavor is currently available in dashDB Local. dashDB Local now has parts of the DSM console imbedded in the dahsDB Console.
You can now run your Spark programs locally on dashDB Local. I anticipate that it will be on dashDB in the Bluemix cloud in the future. For single-node dashDB instances, there is one Spark engine on the server. For MPP deployments, there are Spark components on each node. This is a big deal if you want to move much of your analytics processing as close as possible to your data, because now instead of pulling huge amounts of data over the network from your database into your Spark cluster, your program can extract the data locally on each node the database cluster and just send you the results. Further, on dashDB MPP Local, it is easy to tell Spark which tables are partitioned and then it will know which data is local to the Spark threads on each node making MPP processing very fast. If you are unfamiliar with Apache Spark, it is a popular framework for developing and running analytical applications in a cluster of machines and typically does most of its processing in memory. The cluster can be on one machine or many. Once you have deployed your Spark application to dashDB cluster you can connect to it directly or access it through SQL using dashDB stored procedure calls that you can implement.
Docker is being used to make it easy to place dashDB on platforms including your local servers and IBM managed systems on Amazon Web Services (AWS). Docker is an interesting technology that allows you to install software like dashDB on a server, create the instance, database and other objects and then copy that install to a set of files called a “container”. You can then take that container and load it onto another server that has Docker installed and have a running copy of dashDB very quickly. IBM creates the dashDB containers, and when loaded onto a server, it determines the system resources and allows dashDB to use all of them. When installing a dashDB container you give it the mount point where the database files will be put. So the data lives outside of the dashDB container. This makes it quite easy to upgrade dashDB or apply a fix. To do that you just remove the old container, load the new container giving it the same mount point, and now your new dashDB version is pointing to the same old database. Since Docker is not a VM, there is almost no performance degradation of the software in the container.
When you have IBM deploy dashDB by deploying it through Bluemix or by purchasing through other means, IBM loads the Docker container for you. If you purchase dashDB Local, you install Docker on your serer in your data center or cloud server and then use Docker to download and install the container. dashDB Local is dashDB for your private cloud. With dashDB Local you can install single-partition dashDB or MPP dashDB across several servers. There are documented minimum sizes and recommended sizes for dashDB Local servers, and it is a good idea to consult them before acquiring a server for dashDB. At the conference I also saw an interesting example of where a dashDB local MPP system with lots of disk was used to replace a Hadoop system. This would not provide the typical in-memory performance of a normally configured dashDB database because such a small portion of the data would fit into memory, but it ran a lot faster than Hadoop.
dashDB Local MPP creates multiple database partitions in each container. HA is provided in this configuration by spreading the database partitions from a failed node to containers on the surviving nodes. Since the mount points for all of the database data must be in a clustered file system anyway, every node can see every mount point. You can also grow and shrink your MPP cluster by moving partitions. When you add a server or servers to the cluster and put dashDB containers on them, dashDB will move some partitions from each existing nodes to the new node(s).
dashDB local allows you to issue special write suspend and write resume commands to the containers that tell them to stop writing to the file systems. This allows you to back up by making a snap copy of the file systems for the database in between the suspend and resume operations.
I attended an overview session of non-relational databases by Larry Weber. One interesting point he made is that these data stores have become so cheap that that it is frequently worthwhile to have multiple, fit-for-purpose copies your data. For example, your primary store for data might be JSON in a document store where you also have some portion in a key-value store like Redis for cashing for one application and a portion in Elasticsearch for a different application that needs full text searching or the documents. The downside is that any time you have more than one copy of data the copies can get out of sync. Larry categorized these databases into four broad groups that really helped me understand them better:
Stores exactly two columns of data, one is a unique key and the other is a value. It is optimized to very quickly look up a row based on the key. The value can be anything including a list of values in a string, a BLOB, a single value, etc. Ex. Redis
Stores a formatted, self-describing document such as JSON or XML. Ex. Cloudant and MongoDB
Store data where the most important things are the relationships between objects and quickly finding information related to one object that may be multiple degrees of separation from the object in question. Social networking sites that need to quickly tell you the number of connections between you and the person you are searching for make extensive use of graph databases. These databases store objects (called nodes or vertices) and the relationships (called edges). It also stores properties about the edges and nodes. One example is IBM Graph.
These are data stores that store relational data in columns rather than rows. Traditionally relational data databases put full rows into a data page or block and when one block is full of rows it starts writing the next block. When you write a row to a column store, each column goes into a different page. When you write the next row, each column goes into the same page as the same column from the previous row. This is a huge advantage when you are searching on one or a few columns of a wide table because you don’t have to read any columns that are irrelevant to your search. An example is column organized tables in DB2 BLU.
There have been a number of useful Cloudant updates. Here are some of them that I found interesting:
· Cloudant Local now runs on Linux on Z.
· There is a Cloudant Local Developer Edition and it is free. It allows you to work off line, and replicate to your Cloudant Cloud instance.
· Development has been working with the CouchDB project and has made them very compatible so you can move your Cloudant data into CouchDB anytime with little trouble.
· The Cloudant browser dashboard now has a feature that allows you visualize your geo data in a map using Mapbox visualizations.
· The SDP process that replicates data to dashDB now allows you to be more selective about what fields you want to replicate.
· The Cloudant team has added quite a bit of new documentation and several how-to videos on ibm.com/cloudant.
· I attended a number of tools presentations to refresh my memory on things like merge backup, Advanced Recovery, and others.
· Bluemix Lift has been updated since I last wrote about it in my article about moving data to the cloud. It is still recommended for copying up to about 25 TB to the cloud, but there is now a cloud interface with a local agent. The local agent is called Slingshot.
I hope that you found at least a few pieces of information in this article to be new and useful. I hope to see you at the conference next year.