Real-Time Anomaly Detection Streaming Microservices with H2O and MapR – Part 1: Architecture

Contributed by Mathieu Dumoulin

Editor’s Note: Read Part 2: Modeling and Part 3: Production Deployment

Converged Architecture for Real-time Anomaly Detection for IoT Sensor Data

Industry 4.0 IoT applications promise vast gains in productivity from reduced downtime, higher product quality, and higher efficiency. Modern industrial robots integrate hundreds of sensors of all kinds, generating tremendous volumes of data rich in valuable information. However, some of the most advanced industrial makers in the world are barely getting started making use of this data, with relatively rudimentary, bespoke monitoring systems built at tremendous cost.

It is now possible to successfully deploy Industry 4.0 pilot use cases in a matter of months, at a small fraction of the cost of packaged industrial solutions, using a well-chosen selection of big data enterprise products and open source projects.

This blog series walks you step by step through the process of how we built a working real-time ML-based anomaly detection system on a working industrial robot-analog installed with a wireless movement sensor.

This article is in three parts, which corresponds roughly to the three major phases we worked through from concept to working system.

  • In Part 1, we present our system’s architecture and our technology stack. It serves as a real-world application of streaming microservices next-generation application.
  • Part 2 is about machine learning. We look at how we were able to get a well behaving predictive model from the raw sensor data using the excellent H2O machine learning tool.
  • Part 3 focuses on the deployment of our predictive model to production. We show how to use Kafka streams to serve near real-time predictions to a visualization system such as a dashboard or, in our case, an augmented reality headset.

Problem Statement

SEQ Figure

Figure 1: SEQ Figure __* ARABIC 1: the LPMS-B2 sensor

Our project originated from a small wireless sensor, the LPMS-B2, built by a German-led startup here in Tokyo called LP-RESEARCH.

This sensor is a miniature wireless inertial measurement unit. It includes 3-axis gyroscope, 3-axis accelerometer, and 3-axis magnetometer, temperature, barometric pressure, and humidity sensors.

LP Research asked us if it would be possible to use data generated by this sensor, when attached to an industrial machine such as a robot arm, for the purpose to detecting anomalies for predictive maintenance.

SEQ Figure2

Figure 2: SEQ Figure __* ARABIC 2: A real industrial robot

Modern industrial robots already come installed with a host of sensors, but integrating sensors from different makers can be a huge problem. Furthermore, less modern robots still used in production can lack such sensors. A wireless sensor such as the one made by LP RESEARCH can be added painlessly to any manufacturing plant’s equipment and stream its data wirelessly to a collection point.

Given such raw data, could we analyze it in such a way to detect anomalous movements that may be a tell-tale signal of a machine-tool or robot requiring maintenance?

Conversations with Japanese car-parts manufacturers tell us that sensor-based anomaly detection for predictive maintenance is a topic of the highest interest for all technologically advanced manufacturers.

Predictive Maintenance System Requirements

Our solution had the following requirements:

  • Collect sensor data in real-time and store it in a central location
  • Analyze data to create predictive models
  • Apply the predictive model to the raw data and get predictions in near real-time (in a few seconds)
  • Serve the predictions to some output such as an operational dashboard.

Constraints from the Sensor

The sensor we use, the LPMS-B2, has Bluetooth wireless signal capability to transfer data. The LP-RESEARCH team helped us by helping collect the data into a Raspberry Pi board, which is itself equipped with Wi-Fi capability.

The Raspberry Pi is the point from which we need to collect the raw data.

This type of constraint is typical of IoT use cases, where getting the raw data is already a challenge with no ready-made solutions.

Why It Needs to Scale

A single sensor such as the LPMS-B2 can easily generate as much as 1GB of data a day when all its sensors are turned on.

A real-world deployment of this technology may require processing data from a multiple sensors per robot, on tens to hundreds of robots per factory. Many makers have multiple factories as well.

Furthermore, some systems may not need maintenance for periods of months or possibly years.

It becomes important to keep this data for at least longer than the expected MTBF (mean time between failure) to accumulate data about real-world failure.

Our System

Figure 3- Full Prototype Figure 3: The Full Prototype

Our working prototype does real-time anomaly detection from the small blue wireless sensor attached to the model industrial robot (in red, above). In addition, we use an augmented reality (AR) headset prototype for output visualization, overlaid on the black and white square located on the arm of the robot.

The point of using an AR headset is to allow a factory operator to simply put on the helmet and see, in real-time, the operational status of the factory and its various machines in real time. The operator can check on the status of the various production machinery not just staring at a dashboard while sitting in an office but also while walking around on the factory floor and looking around.

OK and Failure States

The Proposed Architecture

Data Collection Data Collection Figure 6: Data Collection

The code collecting data from the sensor running on the Raspberry Pi is custom made in C by the LP-RESEARCH team. From there, we need a robust solution for data ingest that can be easy to use from code running on a Raspberry Pi, such as C or Python.

Given the hardware constraints, a common issue for edge devices, using a Kafka client – or MapR Client, since we’re using MapR – is out of the question. The best solution is to use the Kafka REST Proxy. Calling Restful APIs is very well supported and requires little to no special knowledge.

StreamSets Figure 7: StreamSets saves streaming data to MapR XD easily

We use StreamSets Data Collector to move the data from Streams to CSV on the distributed filesystem. StreamSets Data Collector is a wonderful GUI that is exactly the right tool for this type of glue logic, managing data flows visually with no code for many use cases. StreamSets supports Apache Kafka, MapR Streams, MapR FS, MapR-DB, Apache HDFS, Elasticsearch and many more. It supports all major enterprise distributed platforms and I can’t recommend it enough.

Scaling Discussion

Our data ingest solution is easy to scale: just add more proxy servers and put a hardware or software load-balancer in front of it.

MapR-ES, or Apache Kafka for that matter, scales linearly with additional servers. Both streaming platforms are known to support millions of events per seconds with a modest number of servers.

StreamSets Data Collector scales as needed as well, it can run on a single node or in distributed mode as a spark streaming application, if the data stream gets too large for a single client to handle.

By using the MapR Converged Data Platform, or even a more traditional Hadoop-based distribution, we can cost effectively store the sensor data for as many sensors as needed, for as long as needed, just by adding nodes. This is a great illustration of the benefits of scaling out as opposed to scaling up.

Building Predictive Models

We use H2O to train models from the data accumulated in MapR. It’s super fast for data ingestion and directly imports CSV data. The full details of the modeling process are discussed in Part 2 of this series.

All the code we created is shared on Github:

Serving Predictions in Real-Time

Building a model from data accumulated on a distributed cluster with H2O certainly presents its own data science difficulties. The data can just be hard to understand sometimes. But we could get on top of it with the use of the autoencoder algorithm and some clever modeling by Mateusz Dymczyk, a machine learning engineer at

H2O Export to POJO

One of the most difficult parts of enterprise ML projects is deployment to production. Most data scientists prefer to work with either R or Python to develop their models. That’s an excellent choice, we also used R for the modeling task of this project. However, it’s not realistic to deploy models created with R (less so for Python) into a production system that needs to scale to handle real-time predictions for potentially thousands of sensors simultaneously.

If the performance/scalability requirements are high enough, this may mean that models developed by the data scientists must be handed off to engineers to reimplement with C++ or Java before final integration with the “real” production systems. That’s potentially a very difficult and specialized task.

Instead, we have H2O’s fantastic “Export to POJO” feature. This feature allows a model created by H2O to be encoded automagically into a regular Java class. Thus, the most difficult part of the integration work is done in a single click.

Real-Time Predictions Figure 8: Real-time predictions

Mateusz developed a simple multi-threaded predictor class in Scala, which consumes data from the raw input and calls the score method to get a prediction. This prediction is used to calculate the predicted label (OK, FAILURE), which is pushed onto the output stream.

Once the predictions are on the output stream, getting them back to a dashboard is easy and depends on the tool of choice. For a Kibana based dashboard for example, we use Streamsets to move the predicted labels from the output stream into a Elasticsearch index that goes back to the dashboard.

Another similar solution is to use MapR-DB + OpenTSDB and Grafana, a solution scales to petabytes without any problem. This is the approach used to handle the metrics data of the MapR cluster itself for real-time monitoring for project Spyglass.

With StreamSets, either of these solutions require no coding at all. Awesome!


Spark Streaming Deployment

In a production system, we’re going to want to scale more easily than using the simple predictor. Spark streaming is one way we could do this as shown in the diagram below.

Spark Streaming

The only difference is to adapt the predictor code to use Spark Streaming, an easy task that allows the deployed model to take advantage of Spark Streaming’s built-in reliability and disaster recovery capabilities.

Of note, as soon as the prediction scales to more than one node, predictions aren’t going to be processed in sequence. This is not a problem, as long as the raw data read for a window size is much greater than the minimum time window to get a prediction. In our system, we use a window of 0.2s of data to make a prediction, so if mini-batches are 10s or more, the predictions are meaningful enough that order is not a huge worry.

We’d expect to miss error conditions that would cross mini-batch boundaries, but again, that’s OK. We’re not trying to detect a small glitch, but rather detect a change in the state of a machine, which produces an unusually large number of detectable anomalies, rather than just point glitch spikes, which we found to be common in the raw dataset.


Our system is up and running, we have a video of it where predictions are streamed back to an augmented reality headset. How cool is that?

Please check out the two minute video:

It took about two months of part-time work by a team of four engineers, a very reasonable amount of work given the nature of the application. This is state of the art technology.

The MapR Converged Advantage

It’s undeniable, our project could not have been implemented in the time we had without the excellent MapR Converged Data Platform.

We had a single little AWS cluster and barely spent any time at all configuring settings or installing software. The platform worked for us with everything we needed working out of the box with zero configuration work. We only needed to install StreamSets, which is easy and well documented with easy to follow documentation specifically for MapR.

Having the Kafka REST Proxy, the streams and the distributed file system as well as YARN (to run H2O in distributed mode) all on a single cluster was a real boost to our productivity.

We could discuss the performance advantages of the platform all day, they are impressive, but for this type of project, where the largest challenge is getting something working, having a simple technology stack with very few moving parts allowed the LP RESEARCH engineers to focus on getting their data up to the REST Proxy and down from the REST Proxy back into their AR headset. Similarly, with the data pipeline so easy to setup, all of our time was spent getting the best possible model working. In other words, we spent our time on the real value-added part of our project, not lost in endless installation and configuration of the two or even three separate clusters we would have needed working with a different platform.

On to Part 2: Machine Learning Modeling

Stay tuned for the next installment of this post where we share the details of the modeling task for anomaly detection of the very, very noisy raw sensor data.

Additional Resources

This blog post was published July 17, 2017.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.