db2Dean’s 2015 Insight Highlights
27 November 2015
This year at Insight I didn’t get my photo taken with any celebrities or interestingly costumed people, but did learn some interesting things and got to some fun events like the one party at the top of the Mandalay Bay where I took the photo below.
Again this year there was tons of information at the conference and I’ll give you a summary of what I learned along with some random tips and tricks. This year I concentrated on Information Server, Big Data and Cloud. If you attend the conference then you can download many of the presentations from the Conference Sessions Page. As time has gone on many more have been added, so try again if something you wanted was not available on the site when you first got back. I look forward to seeing you at the conference next year! Also please add anything else you though was really interesting to my db2Dean Facebook Page or to the “Message Board” section of my page.
There is an extreme amount of development being done for IBM cloud databases and it just seems to be accelerating. Whether you need DB2, Informix or Cloudant (JSON store) you can now deploy it in the cloud with a few clicks. You can deploy fully managed databases like dashDB or self-managed databases like DB2 on Cloud.
Fully managed databases like dashDB, SQLDB, Cloudant and TimeSeries Database are ones where you deploy the database with a few clicks and then just create your tables and load your data. You get an IP address and port and you can connect your favorite client tools like Data Studio, your application or report writer like Cognos. There are also web based tools for creating tables and uploading data. You don’t tune any server or instance configurations and all fix packs and OS patches are applied by IBM. As a matter of fact the managed services are being developed in an Agile methodology, so new features and fixes are delivered about every 4 weeks as part of a “sprint”. This is a great for developers of agile applications or for trying out ideas for new data warehouses or data marts where you need a database in a hurry and don’t want to purchase any hardware.
Self-managed databases like DB2 on Cloud are for those who want a cloud database, but need to control when fix packs and other changes are introduced into the environment. Again you can deploy one of the five sizes of servers with a few clicks. Some choices are on VMs on Softlayer and some are bare metal servers that are not virtualized. These servers come with DB2 10.5 installed with a relatively new fix pack and the first instance created. The server OS is tuned for DB2. From there you get the server IP and a user with SUDO root access and you do everything else. You can log in with SSH or other secure protocol and do all system administration and database administration. You can even drop the initial instance and create one or more of your own. You decide when to apply fix packs and OS patches and do them yourself when the time comes. Please see last year’s Insight article for a list of several of the cloud databases available.
You can deploy all of these cloud database services through the IBM Bluemix cloud. Bluemix is IBMs cloud services site where you can deploy databases, big data sources It is also a robust development and runtime environment. So you can even develop and run applications using the tools in Bluemix that connect your databases. You do not need to have any applications on premises if you don’t want to.
With the explosion of data in most enterprises the need to track and govern that data has become very important, so I attended a number of sessions on the information integration and governance tools in the Information Server portfolio. I describe some of the things I learned here.
In September of this year version 11.5, the latest version of InfoSphere Information Server (IIS), was released. One of the biggest new features is the ability to run natively on Hadoop clusters. IIS has had a grid implementation for many years and that grid technology is one of the areas where it shines compared to its ETL and governance competitors. This same grid technology has been implemented on Hadoop clusters. Not only can it be implemented on the IBM Big Insights Hadoop distribution, but also on Cloudera, HortonWorks and others. This capability is great because you can land huge files on your Hadoop cluster and while you are running analytics jobs against those files, IIS can take advantage of Hadoop’s clustered file system processing and do ETL on various blocks of the file in parallel. Further, for those who want to transform, and cleanse the files and keep the output on Hadoop, it eliminates the need to bring the data across a network to the IIS server and back again. The transformations can be done in place leaving you very quickly with the old and transformed versions of the file and this is done across all nodes. IIS on Hadoop can also be controlled under YARN.
Under the Bluemix umbrella is a new service called DataWorks Forge that is still in Beta. It allows for self-service data shaping and movement. It automatically profiles and classifies files ingested. It also allows the user to easily enrich, join, filter and remove duplicates, while putting the data into a cloud data target.
Another component of Information Server is Data Quality. It is part of the IIS Enterprise Edition and as of v11.5 is available on Hadoop clusters where it is called BigQuality. IIS BigQuality has DataStage and Information Analyzer components. With this product you can do everything from evaluating the quality of your data for straight forward things like ensuring data types and domains of columns to verifying complex data business rules. You can just report on those things and measure the quality of your data, or you can build them into your ETL process, improve data in place and even enforce quality rules. It also has a built-in database to record data quality statistics that are used for the included reports, or you can use it with your favorite reporting tool to build customer dashboards. In recent releases you can even create a workflow so that when new data elements are introduced or problems are detected you can ensure that processes are followed and sign-offs are done.
One of the components of Big Data Analytics getting a lot of attention is text analytics. This is a fascinating area and you hear about it a lot, but what is it really? One way that text is analyzed is by first having algorithms “read” the text and record data about the meaning of that text into a tabular format so that standard SQL based report writers and analytics can use the data. For example, you might want to quantify how many people are talking about a particular topic and whether their sentiment is positive, negative or neutral. So in this case you would want to have the software “read” some text and put data into rows and columns. Here is an example of a snippet of a Twitter Tweet that comes in the JSON format and how it might get put into a tabular format:
"created_at": "Mon Nov 9 03:35:21 +0000 2015",
"text": "Yeah Right that Dean guy who writes about IBM Databases knows what he is talking about!",
"name": "Holly Day",
With the appropriate text analytics you can put the data in a tabular format such as the following
Extracting the Date and Name is not all that exciting because those fields are already broken out for you in JSON fields. On the other hand determining that the tweet text is actually talking about db2Dean (buzz) and that Holly’s sentiment is actually negative (interpreting “YEAH RIGHT” correctly) is where the actual text analytics is used. Once the data is in tabular format it can be loaded into a database table where report writer applications like Cognos can issue SQL and generate reports as is done with any other tabular data. It can also just be put into delimited files where any number of applications can also use it. The point is that the advanced technology part of this is determining what the text means and then putting it into tabular format where existing dashboards and reports can be run on it. While the above example is rather simple, the technology to interpret natural language with its idioms, sarcasm, and other complex forms of speech is not. In addition to social media feeds, text analytics can be applied to call center conversations, e-mail, Word documents, professional journals, books and many other sources. This is one area where technology from the IBM Watson labs is doing amazing research and those tools are filtering into the IBM analytics tools including Big Insights and Watson Analytics that you can try out for free today.
So where does this translation of text into tabular happen? Well it can be done in many places and is usually determined by what you want to do with the data. For example, if you had a lot of e-mail files that you need to examine, you might drop them into your Hadoop System and periodically transform new data into tabular files. If you have IBM’s BigInsights Hadoop distribution you can copy the files to the system and then configure the text tools that come with that product. Some of these tools are advanced enough to be configured by people in your organization, or you can hire experts to get you started. Once in a tabular format, you could then use them locally by applications on the Hadoop system like the Big SQL product that puts a high-performance SQL front end on the data that can be used by reporting tools, or you can load the files into your favorite database like Informix or DB2. If you are not getting the data translated fast enough you could call those text analytics functions under SPARK and get the job done even quicker.
With social media feeds it is frequently important to see trending of sentiment as it happens such as when you are doing a new product launch. In this case the text analytics functions can be performed by InfoSphere Streams as the data is being received. You can arrange for feeds from Twitter and other social media providers to arrive in a continuous stream of data and InfoSphere Streams can ingest that data, convert it to tabular immediately, and then simultaneously feed to your favorite analytics dashboard like Cognos ,drop the raw data into Hadoop, and the tabular data into a DB2 warehouse table. There are many other examples of how the text data can be processed, but this gives you a good start.
Letting the Data Guide You
I have heard a number of times that with big data you can let the data guide you instead of you guiding the data to an outcome, but didn’t really get what this meant in a concrete way until I attended one of the sessions at Insight this year. This session was by a research organization in the UK who used big data in an IBM BigInsights Hadoop cluster to create a heat map of England to show areas that are more likely or less likely to have archeologically significant items. This was done for the construction industry because, in England, digging up something architecturally significant delays and adds costs to projects. The company has to stop work and hire archeologists to evaluate the site. So a map like this is useful so that they can lower the probability of finding something when they have the flexibility to dig in a different place or more accurately budget for finding something.
One of the ways to approach this problem is to determine that ancient people tended to build near ground water sources for drinking water or transportation, and assign higher probabilities to land near water sources than to other places. This is an example of you guiding the data to an outcome. Instead this organization feed lots of data from maps, pdf documents, data about finds of significant items from construction, and data about where previous archeology digs found and did NOT find things. Using text analytics tools, R, Big R functions built into Big Insights, and other advanced technologies, they were able to let the data tell them where high and low probability areas were. Because much of the data they entered, including maps and text documents that indicated sites where things were found and sites where things were not found the analytics tools could assign probabilities to specific places and types of features like water sources and high or low slopes and then let those types of correlations draw the map.
I hope that you found at least a few pieces of information in this article to be new and useful. I hope to see you at the conference next year.