Combine Exotic Data Sources in dashDB
28 December 2016
One of the promises of cloud databases is ability to easily gather data from many sources and combine them into a data warehouse where you can easily search for patterns that can help you make decisions. Think about what you could do if you could combine data from your internal systems with weather and Twitter data. Might you be able to spot trends where weather conditions and social media posts about local events were highly correlated with higher or lower rates employee absenteeism? Or what about hospital admissions correlated to those sorts of trends? Now if you discovered those patterns, what could you do if you also put weather forecasts into that data warehouse? Might you be able to predict employee absenteeism or hospital admissions better? These are just a very few possibilities that could be explored. Your imagination is the only limit.
dashDB is a great platform for getting this data together for a number of reasons. First it has built-in utilities you can invoke from the console to get a number of sources of data, and with other Bluemix tools you can get additional sources as well. Here is a screen capture of the console load menu. As you can see it has the Load from Cloud option (for IBM and S3 Object Stores), Load Geospatial Data, Load Twitter Data and Load Open Data options. Open Data is publicly available data sets such as census data that is provided by governments and other organizations. For example, my home state of Utah provides a catalog of files available for download from https://opendata.utah.gov/. However, to get this data into my dashDB database, I don’t need to go to that site and down load it, I can just go to dashDB Open Data option and search for the type of data I want, choose the data set and tell dashDB to copy it into tables in dashDB with dashDB creating the table schema for me based on the data in the files. There are many, many sources of open data just waiting to be explored and analyzed with your internal data. The Load from Desktop feature has a nice option that will create a table based on the data in the file for you including naming the columns after the field headers if they are available and then load the data. It also allows you to load the data into existing tables.
In addition to the options in the dashDB load menu, you can get other sources like weather data quite easily. dashDB is one of the many services offered in IBM’s cloud services portal called Bluemix. One of the very useful Bluemix services is the Weather Company Data service that allows you to get historic weather information by location as well as weather forecasts. With just a little development using other services on the site you can get the weather data into your dashDB database. You can also use the Bluemix Data Connect service (formerly called DataWorks) to copy data from many different databases without doing any coding or having to create the tables yourself. Along with these services you can also load small to extremely large data sets from your workstation or servers using a variety of tools. Of course you can also use your traditional ETL and custom applications to connect to dashDB and load data too.
Once you get all the different types of data into your dashDB system, Bluemix has all sorts of services to help you find patterns in the data and make predictions using those patterns. Currently available services include Geospatial Analytics, IBM Watson Machine Learning and Decision Optimization. Of course you can connect your traditional software business intelligence tools like Cognos and SPSS. Any reporting or BI tool that can connect to DB2 can connect to dashDB.
Over the past few years various researchers have found that they can begin seeing flu outbreaks in localities in the United States about 1-2 weeks before the US Centers for Disease Control and Prevention’s reports based on clinical data show them occurring. The reason is that Twitter can provide a real time stream of what people are thinking a feeling. These researchers were able to filter that river of data to determine what sorts of tweets indicated that people were getting sick and watch for increases in those types of tweets. I find this stuff fascinating and if you do too you can see these articles from Johns Hopkins and the Washington Post or just Google “Twitter Flu Analysis”.
This research did not use dashDB. However, I want to illustrate few points about how you could use dashDB for similar types of analysis. They would have needed to store historical CDC or other clinical outbreak data where the flu outbreak patterns were known. Then they would need the twitter stream that proceeded it. Both of these sources of data would have needed to be analyzed to see if patterns in the tweet stream preceded flu outbreaks. Fortunately, the researchers were able to figure out what patterns of tweets developed that showed up in the CDC data that came out 1-2 weeks after the increase in particular patterns of tweet content. Now when that pattern begins to develop on the stream of Tweets of data from Twitter they know that the incidence of flu is increasing. If it is your job to stock a drug store or staff an emergency room, what could you do with this data? If you were doing this over, dashDB is a platform that could be used to consolidate both of those sources of data and then analyze it along with your own internal data. One advantage that you would have is that when dashDB is ingesting the Twitter data it enhances it with sentiment. That is, it adds a field that indicates if the text of the tweet seems to be positive, negative or neutral. That additional bit of information could be useful in determining which tweets should be kept.
One thing I found interesting in the articles that I read is that they talked about how the scientists spent a lot of time determining what sort of things people say might indicate that they are starting to feel sick. For example, saying “I’ve got the chills” is highly correlated to getting the flu, while saying, “We’re going to watch Netflix and chill” has quite a different meaning. The reason I note this is that it introduces bias. That is because when doing this you make assumptions about what types of things people say that is correlated to getting the flu. Because these were epidemiologists who do know what people talk about when they are getting sick that bias is not so bad. However, could there have been other things that no one would think of that increase or decrease in the Twitter stream as people are getting sick? I don’t know that those patterns exist, but another way to determine the patterns is to tell the analytics tool you are using that you see flu out breaks happening at these times in the historical CDC data and have the tool tell you what patterns are highly correlated to that in the twitter data starting 1 day before the outbreak, two days before, one week before, two weeks before and so on. This way you don’t have to hypnotize what sorts of things people may be saying, you let the system tell you. Several modern analyics tools allow to do this sort of thing and many of them work quite nicely with dashDB as the data store.
This is just one illustration of bringing data from two disparate data sources that could easily be put together in dashDB. CDC Open Data files and twitter data. The sky is the limit for the sources of data that can be put into dashDB with little effort. What data sources would you like to analyze? As I noted in my Different DB Platforms for Different Jobs article, on-premises data warehouses may not be a good candidate for this sort of thing because they are highly optimized for a different task, but cloud databases can be deployed to try things like this with little up-front cost.