Scaling Time Series Analysis on the MapR Data Platform

Contributed by

6 min read

Introduction

A time series is a collection of observations (x~t~), where x is the event recorded at time t. Common motivations for time series analysis include forecasting, clustering, classification, point estimation, and detection (in signal process domain).

With the prevalence of sensor technologies, the popularity of the Internet of Things (IoT) is trending. In a highly-distributed IoT scenario (autonomous driving, oil drilling, healthcare wearables), data with timestamps will be streaming back to your data center and stored. Today, the value of data is higher than the value of the IoT technology. If you can leverage the data upon arrival into your data center, rather than wait for a certain period, and engage in exploratory analysis on that data, you will gain more value from that information and be able to make an impact faster.

MapR Time Series Quick Start Solution

The aim of MapR is to solve the time series data collection and forecasting problem at scale. The applications that form the technology stack are MapR Event Store (streaming the event data into your data center), OpenTSDB (storing the data in a high performance time series database) and Spark (data processing and forecasting). A high-level diagram of the workflow appears below: Picture 1

MapR Event Store for Apache Kafka is the integrated publish/subscribe messaging engine in the MapR Data Platform. Producer applications can publish messages to topics (i.e., logical collections of messages) that are managed by MapR Event Store. Consumer applications can then read those messages at their own pace. All messages published to MapR Event Store are persisted, allowing future consumers to “catch-up” on processing and analytics applications to process historical data. In addition to reliably delivering messages to applications within a single data center, MapR Event Store can continuously replicate data between multiple clusters, delivering messages globally. Like other MapR services, MapR Event Store has a distributed, scale-out design, allowing it to scale to billions of messages per second, millions of topics, and millions of producer and consumer applications. Find more information on MapR Event Store here.

OpenTSDB is an open source scalable time series database with HBase as the main back-end. Since MapR Database implements HBase API, MapR Database serves as the back-end in this quick start solution, instead. The high performance achieved by OpenTSDB is due to the following optimizations, specifically targeted at time series data:

  1. A separate look-up table is used to assign unique IDs to metric names and tag names in the time series;
  2. The number of rows is reduced by storing multiple consecutive data points in the same row, so it seeks faster when reading.

On MapR, the performance benchmark can be as high as 100 million data points ingested per second (link).

Apache Spark provides us with the capacity to harness MapR Event Store and provide data processing/parsing functions while training machine learning models with multivariate time series regression algorithms. Our Spark streaming code will pick up the data from MapR Event Store, briefly process them, and write them to OpenTSDB; meanwhile, the machine learning model is fit to the data and writes the prediction into OpenTSDB as well.

In our example, we used gas sensor data from the UCI machine learning repository (link). With this dataset, we try to predict the ethylene level based on 16 sensors that monitor the gas content. The exploratory plot below shows the time series for 16 sensor readings: Picture 2

We use basic linear regression to regress on some auto-regressor features as well as some second derivative features. It is also good practice to look into the seasonality and stationarity of the time series data and apply smoothing/differentiation algorithms to prepare the data for processing. For a target with obvious on/off status, we could also consider combining a regression model and binary classification model to obtain a better RMSE.

The screenshot below gives an example from the UI for openTSDB: Picture 3

The metrics name in the data is stored as tags in OpenTSDB. In this figure, the blue and purple lines are two feature metrics, r15 and r16. The green line is the target time series, and the red line is our prediction: notice how the red and green lines track very closely. OpenTSDB provides options for automatically refreshing this dashboard.

Summary

The focus of this article is on the workflow, while the algorithm applied can be customized, given the distribution of the data and requirement of business. I have packaged the quick start solution to extend MapR 5.2 Docker container for demo purposes. You can launch the demo from your laptop, if you have Docker installed, and follow the steps in my Docker hub link.

The demo below shows how the Docker image works. It requires some time to start, due to the MapR and OpenTSDB services plus the MapR Event Store and Spark applications.

Additional resources

Read blog "Introducing MapR Data Platform for Docker"

Read blog "Getting Started with MapR Client Container" by Tugdual Grall


This blog post was published May 02, 2017.
Categories

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.


Get our latest posts in your inbox

Subscribe Now