Big Data Overview

Carl Douglass & Dean Compher

28 April 2014

 

 

What is big data?  Some say it is Hadoop or a data warehouse.  Some say it is the hordes of machine generated sensor data or feeds from Facebook, Twitter, blogs and other social media.  Still others say that it is collections of documents, e-mail, videos and other semi-structured and unstructured data.  We say that it is all of this and more, even though each organization will likely only use a subset of these.  We say that it is the ability to economically and securely store all the data you need with the ability to analyze it in a timeframe when the results are still useful.  It is not a single platform, but it is having the infrastructure components to provide needed storage and analysis tools.  So with this definition, big data can be the set of storage platforms, data movement, and analysis tools that allow you to make the best use of all of the data that is relevant to you when you need it.

 

 One of the big reasons that we are having this discussion is the ever growing amount and types of data that is available to us.  Because many of the new data sources are so new, many organizations are still trying to figure out how to use them.  However, the ones who are successful will be in a great position moving forward.  Those who figure out how to store all of this data, but do not analyze it well and those who have great analysis tools, but do not have the data when they need it will not be so well positioned.  In this article we will try to define what big data is and the tools needed to store and process that data.  The opinions expressed here are solely those of the authors and not of IBM.

 

In this first article we will define the categories of tools available and provide examples of them.   In following articles we will discuss use cases that will be relevant to broad cross sections of organizations to show where these tools fit into different big data implementations.  From a high level perspective there are four broad categories of tools:

 

                                

Even though we have attempted to divide these tools into categories, there will be significant overlap. 

 

Data Platform

 

We do not think that big data storage is just Hadoop clusters, but we do believe that Hadoop will be a key piece of infrastructure for many big data platforms.  Big data also includes existing OLTP databases, operational data stores, data warehouses, content management systems and any other place you are keeping data.  Many of these databases that are providing tremendous value are the best tools to do the specific job that they are currently doing.  With new analysis and movement tools existing data can be combined with newer data sources, such as social media feeds and other data hosted by Hadoop, whole new insights can be gained.  Think of what a flexible travel agency could do if they could correlate snow forecasts at ski destinations, with social media sentiment analysis of likely skiers with historical information from their data warehouse about what discounts drive business to their web site.   With the right tools the possibilities of how data can be combined to provide unique information is endless.

 

IBM data platforms for Big Data include:

 

In this big data strategy your IBM databases and other vendors’ databases have an important role.  The list above is not intended to indicate that you need to replace your non-IBM databases   Now I’m going to take a leap that you are familiar with your existing database technology, so I’ll just expand on BigInsights here. 

 

As noted earlier, it is critical that at you have both appropriate places to store your vast amounts of data as well as tools to analyze that data.  InfoSphere BigInsights provides both and that really sets it apart.  There are different editions of BigInsights that provide different sets of tools, but at a high level here are some of the important tools that it provides:

 

Hadoop

Includes a standard (not forked) copy of Open Source Hadoop that is ready for the enterprise with features that make it more highly available and secure.  It is also easy to install and provides a number of tools to monitor and configure your cluster, including a management console for the cluster.

GPFS

The Hadoop File System that comes with BigInsights and other Hadoop distributions is great but it has some drawbacks like having a single point of failure and not using standard file processing commands like mv, cp, pwd, etc.  With BigInsights you can use the IBM GPFS file system instead that solves these problems. 

Social Data Analytics Accelerator

Great tools for loading and analyzing social data from a variety of sources.  With this you can create applications to gage customer sentiment and many other things.

Text Analytics

Gain insights from large amounts of textual data that is in documents, blogs, social feeds and other sources.

Integration Tools

Provides capabilities to integrate your Hadoop data with your other data stores in the enterprise.

Big SQL

Allows you to access data in many types of Hadoop files using SQL that many developers already know how to use.  Behind the scenes this feature creates map-reduce code to read the files.  It is generally SQL-92 compliant.

Big Sheets

Provides a spreadsheet-like interface that allows you to see your data in rows and columns and then organize, sum and run functions on your data like you are used to doing.  You can even export the data into other spreadsheet programs you may know better.

Security 

Sensitive data you copy from your other systems definitely needs to be protected, but what about data you collect from public sources?  If you collect personal information about people from public blogs are your responsible for securing that too?  We suspect you are.  However, if you use BigInsights for your Hadoop cluster, then you can protect the data regardless of source and have one less headache.

Development Tools and Performance

There are a number of tools that generate MapReduce code for you, but BigInsights also provides tools to help you when you want to develop your own MapReduce code.  We also provide tooling to manage the MapReduce workload regardless to get the most out of your system and make it run as fast as possible. 

 

To make BigInsights even easier IBM offers an appliance called Pure Data for Hadoop to run BigInsights that has the hardware and software already tuned and configured and ready to load and analyze your data.  Currently it does not run all BigInsights features. 

 

 Analysis

 

Being able to analyze the data that you have is critical.  That is why IBM is investing heavily in this area as well.  Several analysis tools are included with the BigInsights offerings and you can see more about them in the list above.  A number of IBM analytic tools work across multiple data platforms allowing you to reduce the number of copies you have while still allowing you to maintain great performance on the source databases.  Some of the important analytics tools that perform analysis over many types of data include:

 

 

 

 

InfoSphere Streams

Streams is a platform designed to analyze data as it is coming across the wire from many sources simultaneously.  It can give alerts or perform other actions when it sees patterns developing in the data streaming through or find out of bounds conditions.  It is highly scalable and allows new servers to be plugged-in or removed as needed as workloads change.  One example of where it is used is in a large complex system like a jet that has thousands of sensors streaming terabytes of data where you want to watch for patterns of data from multiple sensors that can indicate a bad outcome even if no one sensor is reading out of bounds.  Hadoop clusters and data warehouses are great Streams companions because data can be stored in them for long periods where analysis tools can determine the correlations of various patterns to good or bad outcomes and then Streams can be used to watch for those patterns as they develop in real time.

Cognos

Cognos is a Business Analytics platform that can analyze data from multiple sources including OLTP databases, Hadoop, Data Warehouses, and other sources.  It can create reports, dashboards, score cards, and do great advanced visualization.  There are several editions to meet a variety of analytics needs.

InfoSphere Data Explorer

 

Allows you to search, navigate, and discover data from all of your data sources and even data sources on the internet.  It can find data in all relational databases, Hadoop, data warehouses, files, intranet sites, content management systems and others.  You can have it index this data as it searches to provide a fast search engine for your team.  It has a development framework to allow you to quickly create applications to display the data in ways that are useful to your organization and allow you to click on links to expand documents, websites and other data so you can see the data in its original source.  It allows you to make connections between various sources that you did not even know had related data and show all hits for a search in one window.  This is also useful when trying to determine all relevant data sources for your new data warehouse or Hadoop cluster.

SPSS

Using advanced data capture and statistical analysis allows you to gather people’s attitudes and preferences or predict trends and forecasts.  It also does this with an intuitive visual interface and helps you make better decisions.  Like the other analysis tools it will work across a wide variety of your big data sources, and can be used with InfoSphere Streams. 

BigInsights

Please see the analysis tools described in the previous table.

 

 

Movement and Transformation

 

Data used for analysis can be stored in a wide variety of platforms.  We like to characterize the decision of where to put your data as “right placing your data”, according to cost and operational constraints.  You wouldn’t want to place business-critical data in a new type of data platform that had just been created by a computer scientist, nor would you put data that is 5 years old and infrequently accessed in a data warehouse that only uses costly solid-state drives. 

 

BigInsights can participate in “right placing your data” as a platform equal in importance to your Enterprise Data Warehouse (EDW) or Long Term Archive (LTA).  Whether BigInsights is used as a “landing platform” for storing your OLTP data before moving the high-value OLTP data to the EDW, or as a short-term low-cost queryable archive of data moved out of the perceived high-cost EDW -- it is your business decision.

 

No matter what your use case is for using BigInsights or other Hadoop distribution, you still need a method to move your data to and from your Hadoop and among your other data platforms.  The data being moved to Hadoop should be ‘governed’ just like every other data source in your enterprise (data that is transformed, cleansed, deduplicated, archived, delivered, etc.).  In fact, a BigInsights or other Hadoop environment should be thought of as just one more repository of business data within your enterprise and subject to the same 5 Enterprise-wide Data Integration Requirements as any other data source/target:

 

  1. Flow data efficiently from any source to any target across the enterprise
  2. Unlimited data scalability wherever data integration processing occurs
  3. Codeless creation of data integration logic and jobs, reusable across the enterprise
  4. Metadata capture for lineage and impact analysis
  5. Operational and administrative management and control

 

Apache Hadoop coding operations such as MapReduce can satisfy requirements 1 & 2 (with the restriction that your Hadoop environment is sized to perform both load/transformation of data as well as analytic operations).  But a MapReduce job has to be manually coded, which can be a very time-consuming and complex operation.  Open-source tools such as Sqoop can satisfy requirement #3 for codeless operation but not for requirements 4 & 5.  Satisfying these last two requirements enables your data to be ‘trusted’ and can only be performed by a data movement and transformation tool that can participate within an Information Lifecycle Governance (ILG) perspective, which is an integral component of an overall, enterprise Information Governance process.

 

A traditional Extract-Transform-Load (ETL) tool such as IBM’s InfoSphere Information Server can satisfy all 5 of these requirements for your BigInsights or other Hadoop environment.  Information Server is also a default ILG component of IBM’s Unified Governance Process.  Information Server can integrate Hadoop data with your existing Data Warehouse, traditional OLTP applications, packaged ERP applications, or any other data source/target.  Information Server has been upgraded to work with BigInsights and other Hadoop distributions as simply another data source or target within your enterprise.

 

All of Information Server’s native scalability, performance, and transformation capabilities that have been developed over the last 20+ years for performing ETL or Extract-Load-Transform (ELT) jobs with a traditional RDBMS, can also be used while moving data to and from Hadoop.

 

Information Server’s 80+ native & ODBC/JDBC data connectors can be used to easily move/copy data to Hadoop (including BigInsights) from a wide variety of data sources.  Information Server’s 100+ data manipulation/transformation ‘stages’ can also be applied to the data being moved to and from Hadoop to ensure consistent data values (e.g., all Social Security Number fields contain 9 digits), calculate a field’s new value as derived from 2 or more fields, and even join multiple fields together into a single field.

 

Information Server’s native parallel processing capability provides both scalability and high performance for a Hadoop data integration job.  This capability leverages our years of development to seamlessly process an Information Server ETL job in both the Information Server platform and the target Hadoop environment, with no user intervention required.  An Information Server developer simply designs the job and Information Server determines which operations should be performed on the Information Server servers and which should be ‘pushed-down’ to the Hadoop cluster. 

 

Codeless creation is a huge advantage for Information Server.  An ETL developer simply uses Information Server’s drag & drop GUI to visually design the job.  Manual creation of MapReduce code is eliminated.

 

For instance, would you rather hand-code a PIG job

like this,                                                           and this,

 

C:\Users\IBM_AD~1\AppData\Local\Temp\SNAGHTML17bfaba.PNGC:\Users\IBM_AD~1\AppData\Local\Temp\SNAGHTML17b6dd6.PNG

 

Or like this in Information Server,

 

 

Governance of data stored in BigInsights and other Hadoop distributions is an absolute requirement in order to get business users to utilize the cluster.  Users have to feel confident that the data they are using to base business decisions upon is accurate and complete.  They have to trust the data.  Information Server’s automatic capture of the metadata resulting from a Hadoop ETL job enables business users to visually track the data’s lineage from the data source, through any transformation and cleansing operations and delivery to Hadoop.  The business user will have ‘trusted data’.

 

Utilizing Information Server’s GUI management console provides a number of advantages including:  a much greater visibility of the ETL operations than MapReduce could provide, easy control of the ETL operations, and logging of security-related events for auditors and external requirements such as SOX compliance.  Joint development and version control of ETL job creation and management is also enabled and that is a definite advantage over manual creation of Hadoop coding of ETL actions.

 

Information Server ETL skills already being used to move data to your Data Warehouse or other data targets, can also be reused to build ETL jobs for Hadoop, providing higher ROI for the BigInsights or other Hadoop platform and faster delivery of results from the overall Hadoop environment.  The process for building a Hadoop ETL job is the same as building any other data integration job, but a Hadoop stage is simply used to designate the target environment.

 

Industry analysts such as Gartner have even stated that Hadoop is not an ETL platform.  See Gartner, Inc. article, Hadoop Is Not a Data Integration Solution, Merv Adrian, Ted Friedman, January 29, 2013

 

Now, if the primary component of your use case for a data movement/transformation job is to continually move data on a near real-time basis, data replication is the answer.  IBM’s Change Data Capture (CDC) can read data from a data source and deliver it directly to BigInsights.  CDC can also move data to Information Server for eventual delivery to BigInsights, a process which would also capture the data movement metadata, adhere to the perspective of governing your data, and enable you to ‘trust your data’.  Data moved by CDC are probably small packs of data, which would not be efficiently moved by Hadoop MapReduce operations.

 

Moving data from Hadoop may also need to be performed.  For example, many organizations stage large volumes of incoming data in files on a server.  Hadoop’s ability to process different portions of extremely large files in parallel across many different servers in the cluster is great, but InfoSphere Information Server’s ETL component’s ability to do parallel processing on those files in Hadoop makes for a fantastic combination when you need to get the data in those files out to your targets quickly.  

 

Perhaps the Hadoop data has aged and no longer has any business value, or it may be need to be deleted in order to reduce business risk (but must still adhere to applicable governmental audit requirements).  ETL tools such as Information Server can still be used to move data from BigInsights to an LTA.

 

BigInsights also comes with database connectors for native connectivity to DB2 and Informix and ODBC/JDBC connectivity to other databases.

 

 

Governance/Security/Privacy

 

You would never considering putting sensitive data about your employees, customers or your business in an unsecured database, but many organizations fail to consider security, privacy and other factors like data lifecycle when they start their Hadoop project.  Even if you are pulling information that is publicly available like government records, Facebook and Twitter feeds you can get end up with a lot of personally identifiable information.  If your Hadoop system gets hacked, is telling the public that the data there is just from public sources going to get you off the hook?  We doubt it. 

 

In addition to data security, there are other data governance activities that many organizations are using such as Master Data Management (MDM), Data Quality and Life Cycle Management.  As you add pieces to your Big Data puzzle, you will want to include your new data sources like Hadoop into these strategies.  For example, you will want your MDM system to have access to customer data in Hadoop.  While Hadoop is a great place to put data archived from data warehouses, you do not want it to stay there forever. 

 

To meet the challenge of the rapidly evolving sources of data, IBM governance, security and privacy tools are being updated to work with the various data platforms so that they can be full partners in your big data system. 

 

 

Optim Test Data Management (Data Privacy)

Optim Test Data Management already works with most vendors’ databases and several non-database sources to allow you to create full sets or subsets of production data into test systems while obfuscating sensitive data.  It can now do this with Hadoop systems when coping data into those systems.  There are also components that allow obfuscating data already copied to Hadoop. 

Optim Data Growth (Archiving)

As data ages it typically looses value.   However, it tends to have some value in aggregate for a long time if it can be accessed.  The Optim Data Growth solution allows you to archive from many vendors’ databases and non-database sources as well, but allows SQL access to it through ODBC and JDBC connections.  Now it can put archive files in your Hadoop system using Optim.  If that Hadoop system is Big Insights then you can use Big SQL to access it.  Further, as your Big Data strategy evolves, you may want to retire some systems to place them in other big data platforms.  Optim archiving allows you to keep a copy of the old system in the original schema at a low cost that remains available to query. 

Guardium Database Activity Monitoring

In addition to the many databases that Guardium can already monitor in an enterprise scale environment, it can also audit who is accessing what data in your Hadoop system. 

Master Data Management (MDM)

MDM solutions provide a single trusted view of your critical business objects including customers, suppliers and products whether you keep a single copy or manage data in place.  Hadoop is now a first class citizen for IBM MDM solutions. 

Data Quality

The Data Quality functionality is part of InfoSphere Information Server and is discussed in the Data Movement section of this document above because it is typically done as part of that.    It provides capabilities that let you to cleanse data and monitor data quality, turning data into trusted information. By analyzing, cleansing, monitoring and managing data, you can make better decisions and improve business process execution.

 

Other Big Data Articles in the Series

 

***

 

We hope that you found this article useful in your big data journey and that you agree that big data is not just one tool or technology, but instead it is a set of tools that allow you to use large, diverse sets of data to accomplish your goals.  Please add any thoughts that that you on this topic to the Facebook Page or the db2Dean and Friends Community and help the extended community. 

 

 

HOME | Search