Data Warehouse Augmentation Using Big Data

Data Warehouse Augmentation

Using Big Data

Dean Compher

30 June, 2014

More and more data is becoming available to organizations and this trend just keeps accelerating. This means that every year the proportion of data that you are analyzing becomes smaller and smaller in relation to the amount of data available. This is even true for organizations with very large and growing enterprise data warehouses because traditional warehouses are not well suited to be a reservoir for massive amounts of semi-structured, unstructured and archived data that has a low value per byte. This lowering of the proportion of data analyzed causes a number of problems including having less and less representative data in the warehouse causing analysis results to become less relevant over time. Some organizations are taking steps to mitigate this and will have a real advantage in the marketplace. The good news is that there are ways to augment your existing warehouse at reasonable costs so you can continue analyzing a large portion of the data available to you.

In our previous article, Big Data Overview, Carl Douglass and I gave an overview of the various tools that are available from IBM and noted that different tools or sets of tools will be appropriate in different situations. We do not advocate ripping and replacing – instead we say use the tools that are working great for the tasks they are doing and add additional tools as needed to store and analyze your data. In the use case of data warehouse augmentation, we will primarily be discussing the use of tools available with Big Insights (including Hadoop), Netezza and DB2 BLU for deep analytics. This article provides a number of examples and use cases, but does not show all possible uses of these tools. Rather it is intended to get you thinking about how you could use them in your organization to analyze more data.

In general your existing data warehouse is doing a fine job of financial reporting and other types of analysis. The data in it has a relatively high value per byte because it has been summarized, transformed, cleansed, indexed and put into structures that do a good job for what they designed to do. Because of all the effort that went to creating this high value data it also has a relatively high cost per byte because of those efforts and the nature of the hardware on which it runs. This is the biggest reason that you would not throw a lot of text files, Facebook, Twitter and blog feeds, machine-generated data and other new sources of data into the warehouse. That stuff has a low value per byte, and would be just too expensive to keep in your well-oiled data warehouse. You can think of this low density data from an interesting perspective: I once took a tour of the enormous copper mine near my home. On that tour they said that they get about 6 pounds of copper for every ton of ore they extract. Even though the ore has a low value per unit, their ability to extract the valuable parts efficiently from an enormous amount of ore allows them to make a lot of money. Social media and other newer types of data is like that -- the valuable parts are diffuse within the load of data that you have to process, but the end product can be quite valuable when extracted efficiently

This analogy carries into the data warehouse augmentation use case. Big Insights is a great place to land the enormous amounts of data that you can get. It provides a low cost place to put your data while figure out what you’ve got, analyze it in place, and concentrate it before sending it to your warehouse.

Explore the Data

The first thing that you will frequently want to do is to determine what is in the data you are getting and what is valuable. For example, you may want to see if social media has relevant data regarding what people are saying about your organization, products, service, etc. Big Insights is a great place to put it because you can economically store the huge volumes of data you need and also evaluate it to determine if it can tell anything interesting. Most of the useful information in social media is quite diffuse so you need to get a large sample to evaluate.

The fist thing you will want to do after you have landed the data into your Big Insights Hadoop cluster is to see what the format looks like and what sort of data is in it. Many social media feeds like Twitter give you the data in JSON format. Blog and other types of feeds will likely be in some other format. Fortunately, Big Insights provides lots of line readers to view the data. It has readers for JSON, CSV, XML and many other formats. With these readers you can easily explore the files you have in various types of views including tabular views and raw views. We even provide a GUI browser so that you can just look around in Big Insights to browse the directory structure and files available like you can do with Windows Explorer on Windows. This is useful for getting a feeling of what is available.

Once you get a feel for what you’ve got, you can start to do more analysis right there on your Big Insights console. Now you can click on the interesting files and view them in Big Sheets. Big Sheets allows you to see the data in a spreadsheet like format and lets you sort the data and run spreadsheet functions like sum, count, average, etc. It can work efficiently on extremely large data sets without you having to even know what Map/Reduce code is. You can even join multiple different files that are related. When you get the data the way you like it you can save it in Big Insights in a delimited format or export it to use in your favorite spreadsheet program.

Another very powerful tool is Watson Explorer. You can point it at your Big Insights Hadoop cluster and let it index and correlate the data across the files in the system. This indexing and correlation is done very intelligently. Once it has indexed everything, you can then run searches to see what interesting items exist from an array of disparate sources and how they are connected. It then allows you to drill down into individual feeds. You may find relationships among your data that you had no idea existed. You can then develop applications in Watson Explorer to display this data to the masses quite easily. As new types of data appear in Big Insights Watson Explorer will continue to find, index and correlate data for your application or search.

Analyze the Data

Once you have determined what sort of data is available, you will likely want to create analysis reports and dashboards on it. Many reporting tools prefer an SQL interface to the data. That is OK because Big Insights has tools to help you easily put your data in to HIVE and other file formats that allow the Big SQL tool in Big Insights to process queries for you. That’s right! You can query over extremely large data sets in Big Insights using the SQL skills you already have. Behind the scenes these queries get converted into Map/Reduce code for you to provide adequate performance. While there are other SQL interfaces to Hadoop none provide the robust analytical SQL features, including advanced OLAP, that Big SQL does. It is the only one that is currently close to ANSI 2003 compliant. Like many of our IM tools Cognos is designed to work well with Hadoop and allows you to create reports, dashboards and other reporting. We will discuss Big SQL further in an upcoming article.

Also after you see what sort of data you are getting and have started reporting you may determine that you would like to do preprocessing on it to enhance your reporting. For example, you may realize that the social media feeds you are getting contain information about what people are saying about products you sell or services you provide. Further the social media data that you get is that it is coming in as a continuous stream of data from the service to which you subscribe, so you decide that you would like to have a real time dashboard to see where sentiment is trending. Therefore, you decide to use InfoSphere Streams to receive the data stream from your provider, and use the built-in text analytics features to determine which records talk about your product (buzz) and if the buzz is positive, negative or neutral. InfoSphere Streams can add those indicators to the records and put them directly in a file in Big Insights so that your Cognos dashboard can show the trends as they develop. In this way you can get very fast feedback about actions you are taking. Many other tools and accelerators are available for processing TCPIP data, video, audio, text and various types of machine data. Big Insights has these same accelerators available in the system so if you prefer you could land the raw data and then augment it with the sentiment.

Now lets take this example further. The sentiment analysis is so useful that you decide to create an analysis report showing revenue vs. sentiment. However, the sales and cost data exists in your relational data warehouse. There are a number of tools available so you have the choice of pointing Cognos at both data sources to do the reporting, use the built-in federation feature of Big Insights, copy the sales data from the database to Big Insights or to copy the sentiment data from Big Insights to the database. Many factors go into this decision, but let’s say you decide to copy the sentiment data into your warehouse.

Data Movement

You want to put the data in your warehouse, but you do not want the raw sentiment data. You really want the sentiment data summarized for each product by month and by region and you want your rigorous data quality and master data policies applied just as they are for other data put into your warehouse. Further even at this summary level there are massive amounts of data to get over into your warehouse that you need there quickly. A great tool for this is the extract, transform and load (ETL) capabilities of InfoSphere Information Server. It already had the best parallel processing capabilities to move and transform at extreme speeds, and now it works natively with Hadoop in Big Insights. Just like the connectors that allow it to talk natively to relational databases without having to write SQL, it can now has connectors that allow you to point to the files in HDFS or GPFS file systems and read the files there in parallel, exploiting one of the major benefits of Hadoop. It does this by generating the needed Map/Reduce code to read files in parallel while doing the data movement and transformations in parallel. Once the sentiment data is in your warehouse it is easy to join it to the other data residing there.

InfoSphere Information Server could have also been used to copy sales data from the warehouse into Big Insights, doing any summarization or other transformation along the way just as seamlessly. As another alternative, you could have also modified your InfoSphere Streams jobs to drop the sentiment data into the warehouse at the same time it was being placed into Big Insights. I just point this out to show that there are many tools and combinations of tools that are provided by Big Insights to allow you do store and process data that makes the most sense to you.

Deep Analysis

The next area of data warehouse augmentation, especially in the case of an Enterprise Data Warehouse (EDW), is to use Pure Data for Analytics (Netezza) or DB2 BLU for deep analytics. Many EDWs are humming along quite nicely doing the work that they are built to do and meeting all the service level agreements (SLAs) for getting reports out on time. However, they frequently have two major deficiencies. One is that they do not allow big, adhoc queries because those can make performance unpredictable and cause other stock queries to miss SLAs. The other deficiency is that the rigorous change control and testing procedures involved in getting new analysis put into the EDW can cause significant delays in getting new queries running , thus delaying possibly important insights for the business.

While some of this deep analysis is a good fit for Big Insights, many types of analysis including gathering of statistics and data segmentation are best run in a relational data warehouse and the best of these are Netezza and DB2 BLU. These environments can be created just for the deep or ad hoc analysis needed by the business. These are great platforms for loading the data in nearly any schema and then run complex queries. They are extremely fast due to the respective architectures. Further, because they do not need indexes, materialized views and the like it is a great place to try out new data. This is because very little engineering of data structures is needed to give very good performance. You can just load data in whatever schema you have and get fast query results.

Archive for Query

Frequently as data ages in the EDW or other data warehouses it becomes less and less valuable over time, until it reaches a value per byte where is no longer worth keeping in the data warehouse. However, many organizations keep this data in the warehouse because they occasionally need it for historical trending just worry that it might be needed for some reason. Optim Archiving is a great solution for this. It allows you to pull relationally intact data from the warehouse and to put it into Big Insights in a format that can be queried. When you do this you can keep data for much longer at an economical cost while still allowing any analysis needed. And with the ability of Cognos and other reporting tools to federate across data sources, you can still use all of the data to get a complete picture over time.

Further, there area also other tools to view data from many sources at once with out moving it. With an expanded license of Watson Explorer you can see data from not just Big Insights, but from your warehouses, transactional databases, internal and external web sites and many other data sources all in one place. This makes for an easy way to get a 360 degree view of customers. If you need a relational interface to your varied data sources, Big Insights now has a federation feature. With one SQL statement you connect to what appears to be a single relational database and join many different tables and file types from Big Insights files, DB2, Oracle, SQL Server and many other databases and data sources. While your query just thinks it is selecting data from views in a database, the federation feature gets the data from the individual sources behind the scenes. It does this quite efficiently because it collects statistics about the underlying files and tables such as row counts and available indexes and uses that information to optimize the queries.

***

While this article describes a number of ideas about how the data warehouse can be augmented or modernized with big data tooling and discusses a number of tools, it by no means covers all augmentation uses or tools. Instead my hope is to get you thinking about the possibilities and allow you to combine these in unique ways to make the best use of the data that is becoming available to you. If you would like help in beginning your big data journey, please contact one of your local IBM Big Data specialists. If you have any thoughts on this topic please share them on my Facebook Page or the db2Dean and Friends Community.

HOME | Search