July 27, 2015 | BY Dr. Kirk Borne
Big data flows from all channels in the modern technological world: social, mobile, networks, sales, machines, sensors, markets, etc. In fact, big data flows so abundantly that we choose water-themed metaphors to describe it: data lake, data flood, data tsunami, oceans of data, streaming data, and even the CD sea of data. As we navigate through these deep waters of data, we need to “mind the wheel” — that is, use exploratory data analytics and advanced data mining methods to navigate our way through the ocean of data from the uncharted seas of bytes, and then onward to the safe shores of analytics success: faster, better, cheaper insights and knowledge discovery.
Let us change the wording of our metaphor “mind the wheel” to “mine the wheel”, specifically to “mine the big data wheel.” With this rewording, the goal of our data analytics activities is now explicitly expressed: to mine the data! Since data mining is KDD (Knowledge Discovery from Data), then our goal is clear. Here are three ways that we can mine the big data wheel:
1. Data are created and emitted in prodigious quantities from large computational models running on high-performance computers. It is almost impossible for us to keep up with these output data streams. So, it is beneficial (perhaps, imperative) to mine the data as they pass from computational processor to data storage device. In other words, data have inertia – it is very difficult to get the data moving again after they have become stationary (on storage media); and conversely, the data have lots of power for knowledge discovery while they are moving through processors. Therefore, mine the data as they are moving, using embedded in-memory analytics algorithms as part of the computational modeling package. As the wheel of data turns within the modeling process, search for significant patterns, new trends, and anomalous behaviors in real time (not after the model has turned cold). In this way, you may also introduce an autonomous fast-response feedback loop into the model, to iterate, zoom in, or otherwise react to interesting emergent features in the massive streaming data outputs. The big data analytics processing capability of a Hadoop cluster in the cloud offers one approach to this “mining the big data wheel” use case.
2. Data will be collected and transmitted from billions and billions of sensors in the coming years as the Internet of Things (IOT) reaches full bloom. The IOT will “sensor” the world. Data will be streaming from ubiquitous devices, people, processes, supply chains, engines, manufacturing lines, networks (social, financial, computer), and so on. It will be essentially impossible to go back later and mine these data for emergent, anomalous, interesting, profitable, or adversarial patterns. So, again it is beneficial and imperative to mine this big data wheel as the IOT sensors are turning and churning out larger quantities of data than we could ever imagine. We will no longer be talking about petabytes, exabytes, or zettabytes of data, but we will be describing our seas of data with words like yottabytes, brontobytes, geopbytes, and maybe even the Epic Byte! Smart sensors will become the new hot commodity – sensors will be programmable decision engines, containing adjustable machine learning algorithms that learn autonomously what is and what is not interesting in the data that the sensor is collecting, which is then followed by the application of some adaptable decision rules that trigger a coded reaction, message, or action to take place. Mining the big data wheel from the IOT will transform the very meaning of autonomous machines and artificial intelligence in the coming years. The real-time streaming analytics capability of Spark Streaming offers an approach to this particular “mining the big data wheel” use case.
3. Finally, what about our existing massive data collections that already live in storage devices? These data have inertia too – they are very expensive to get moving again. In a case like this, a novel idea is to read all of the relevant (most important) data from the storage system into (and out of) memory in a cyclic manner – around the clock (like a turning wheel, perhaps daily or twice a day or whatever makes sense for your organization). As queries come into the system from business analysts, data miners, data scientists, processes, and other stakeholders, those queries wait in queue. The queue injects the appropriate queries at the appropriate time into the system, to mine the requested data products at the moment that they pass from storage device into compute memory. If a new data mining request comes in later that needs those data again, then the request need only wait for the next time that those data pass through the memory (as part of the cyclic data-moving process). The specific chunks of data that flow around this big data wheel can be prioritized and filtered according to user and business needs. Consequently, as the wheel turns, data are mined systematically, while disk thrashing (excessing paging) is kept to a minimum. While all of this data cycling is taking place, it would also be an especially good time to task the “data mining wheel” with a schedule of on-going data quality checks, information assurance compliance, and a variety of data mining (knowledge discovery) algorithms (to find new patterns, trends, associations, correlations, and novelties in the data -- i.e., hidden treasure within your seas of data). The SQL-on-Hadoop schema-on-read capabilities of Apache Drill offer a scalable low-latency approach to this “mining the big data wheel” use case.
So, if you are facing a flood of data from your processes, devices, sensors, social listening activities, and customer channels, then take a tour of “mining the big data wheel” as one approach to navigating your ocean of data. MapR has the tools and capabilities that you need to make this journey a splashing success.