My view is that technically Big Data is not a tool to solve a problem, or a solution, but it is the problem to solve. It is the raw material potentially containing hidden gems. These gems are what we are after, and distilling the data into something not so big is what we are trying to do.
Big Data simply means: “a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications” (Wikipedia). In other words, it is a large amount of complicated and interconnected data that you have the need or desire to refine into usable and manageable information.
Saying that you want to “leverage big data” is like saying that you want to leverage raw material collected from a gold mine. What you want to do is to extract the gold from the raw material, and use the gold for some purpose. The rawness of the material is not the feature of the material. The feature is the gold in the material. The rawness is the problem.
This might seem like a semantic argument, but I think it is not just semantics. The term and its popularity infer that we still don’t have tools generally and easily available to manage the amount of data that businesses deal with.
We tend to collect more data that we can deeply analyze at low cost, which makes it difficult to study and find the value in the data. In fact not having tools to easily analyze large data sets means that it is difficult to find patterns. To study patterns you need to be able to observe the data from both a high level bird’s eye view and at the detailed level. You need to be able to zoom in and out at will.
With big data this analysis gets difficult. You are required to sample the data and study manageable subsets of it. Then you have to identify patterns in the subsets, and form your theories. Than you have to build tools to confirm or debunk the theories on increasingly larger data sets. Once you have a good solid theory you need to build tools to be able to use that theory and distill the data into usable gems.
This approach still requires lots of hand building of massively parallel systems. Tools have been developed and have been around to do this kind of work. For example MapReduce – which is Google named implementation of a programming model for processing large data sets - is one of them.
Once you have the software to run, Amazon AWS today gives you the storage space and computing power necessary to spin hundreds or even thousands of servers that can be used for the duration of the data processing tasks at a fraction of the traditional costs. When the processing is done, you release the hardware, which instantly stops costing you any money.
So there are some tools. But all of these tools are like nuts and bolts. They in fact are fantastically powerful and at the base of the current technological evolution. However we live in a world where we have nuts and bolts and we are already in need of cars and airplanes. The demand drives the resources, and thousands of people are attacking the problem from various angles.
So, all of this to bring one main fundamental message: when you speak about Big Data keep in mind that the term is going to be always defined in terms of the problem, not the solution. It is great to have lots of data. However how you use the data is going to determine its value. Not the raw data itself.