Dataware for data-driven transformation

IoT Mashup of Sensors, Cloud Software, and Machine Learning on the MapR Data Platform

Contributed by

10 min read

In this blog, I am going to walk you through a process that I used recently to integrate readily available IoT sensors, cloud offerings, and the MapR Data Platform, ultimately making a device that detects types of motion, similar to a fitness tracker.

With all the buzz around the Internet of Things, I wanted to develop a prototype that would demonstrate most of the features and architecture of a real-world IoT solution, using the infrastructure we had available to us at NTT DATA Business Solutions.

The Vision

What I had in mind at the very beginning (okay, I learned about IoT Gateway a few minutes after) was like the following:

Here, IoT sensors would read or measure some data, the gateway would collect those measurements and communicate them to the cloud, the cloud would store/route/process the data, and... well, do something with the results - machine learning, maybe?


The available sensor turned out to be Texas Instruments SensorTag CC1350. An accompanying smartphone app helped to discover its potential. It’s packed with sensors: for temperature, 3-dimensional acceleration, gyroscope, even luminosity, humidity, and atmospheric pressure. As a weather monitor would not be as interactive, I opted to do something with an accelerometer and gyroscope.

There was no IoT gateway in our company stationary room and apparently industrial gateway solutions could cost quite a bit. I emulated it on Raspberry Pi. The operating system on Pi is Raspbian, a flavour of Linux, which can perfectly host Python programs.

Data Plumbing

For sensor-to-gateway connectivity, my original intention was to use the module pybluez, but then I found sample code already developed for SensorTag using another module, bluepy.

Now to gateway-to-cloud. Now to gateway-to-cloud. In retrospect, I was lucky to have an NTT DATA Azure account already at my disposal. Otherwise, I’d probably have looked in the direction of AWS and had to code the cloud interface myself (yeah, plain MQTT, but still). With Azure, however, I just used an already provided Azure-IoT-SDK for Python.

So, I put both these connectivity functionalities into the same Python program, which would poll my SensorTag every 0.25 seconds, read motion-related measurements (X/Y/Z-acceleration X/Y/Z-gyroscope), and pass the resulting record to IoT Hub in Azure.

Storage and Processing

Now that the data reached the cloud – and in a bigger picture that would be data concentrated from multiple gateways and even more multiples of sensors – we could do something useful with it. For example: use it as a basis for decision-making, preferably in real time and potentially, using machine learning.

I’m sure all major cloud providers have the required functionality for that already. In Azure, there’s message routing (so an IoT message can be sent both to storage and to analysis functions at the same time), storage, machine learning, and data visualization functionality.

Cloud computing, at present, lies out of my skillset, so I just used Azure’s IoT Hub as a messaging proxy to a MapR cluster, which we conveniently had installed a few months prior in Azure.

The receiving part in MapR was simply MapR-ES. To feed the stream from Azure IoT Hub, I had to install a Kafka Connect Azure IoT Hub and set up a connector, using Kafka Rest API. Then, I could see the data coming in, in my Kafka consumer console.

Here's a sample of an incoming record under the microscope:

Struct{deviceId=247189586081,offset=24791616,contentType=,enqueuedTime=2018-08-08T01:04:37.314Z,sequenceNumber=45384,content={"deviceId": "24:71:89:58:60:81","msgID": 1,"utctime": 1533690277,"gyroX": -1.69531,"gyroY": 2.67188,"gyroZ": -0.83594,"accX": -0.29297,"accY": -0.79468,"accZ": -0.45581},systemProperties={correlation-id=correlation_1, iothub-message-source=Telemetry, iothub-enqueuedtime=Wed Aug 08 01:04:37 UTC 2018, message-id=message_1, iothub-connection-auth-generation-id=636606870154605075, iothub-connection-auth-method={"scope":"device","type":"sas","issuer":"iothub","acceptingIpFilterRule":null}},properties={}}

It looks like JSON, but not precisely. Only what’s inside the “content” section is actually a valid JSON schema, and that’s exactly the original message. The wrapping around is added by Kafka Connect.

I could have stored these records raw, but cheap raw storage would have made more sense right in the cloud. Here on the MapR Data Platform, I developed a Spark Streaming program that processed the records to get structured Parquet files. This will also allow the data to be used for analysis and reporting purposes.

Machine Learning on IoT Data

My data storage quest – with a lot of learning about possible options and their cost – was over.

Now I could formulate the problem to solve with streamed data:

How to distinguish (human motion) in real time, between resting, walking, and jumping? In order to do that, I had to attach the SensorTag to my shoe. What high-tech solution could not include a piece of duct tape?

So I recorded three datasets for standing still, walking, and jumping and ran a batch Spark machine learning program to create a Random Forest model. A machine learning record would include all 6 motion measurements of the current moment and 6 motion measurements from 9 records before. With the measurement interval of 0.25 seconds, that would constitute a record of 2.5 seconds of movement. Validation showed precision of 0.99, which was good enough for me.

On Real-Time Visualization

In a productive environment, the output of Spark Streaming could be written into yet another Kafka stream and then consumed by whatever reporting or decision-making application you have. Apache Drill + Tableau? Possibly – I’ve yet to check that out. Just to make the solution demonstrable, I tried Node.js with kafka-node and highcharts modules to consume that output Kafka stream... only to realise that kafka-node does not work with MapR Streams. I implemented that functionality in plain sockets.

What I Finally Got

In the end, this is what I got for my “unglorified fitness tracker” (as a colleague of mine described it):

To test how it all works, I spent 40 seconds (roughly – according to my stopwatch):

  • 10 seconds resting,
  • 10 seconds walking,
  • 10 seconds jumping,
  • 10 seconds resting.

The first results appeared in the live chart ~5 seconds after I started recording – that’s network/cloud latency time combined with the latency imposed by stream “windowing” by the Spark program.

The end result looks similar to the graph below, where values are: 0 for resting, 1 for walking, and 2 for jumping.

The results are shown in the short video below.

Real-Time Reporting on Streamed Results

For the "mainstream real-time reporting tool" mentioned earlier, I picked Microsoft Power BI. I used two components made available by Microsoft Azure:

  1. Azure Event Hub, acting as a Kafka server. My Spark Streaming application then acted as a Kafka client of a local MapR Streams (to obtain IoT readings) and as a Kafka producer for the Event Hub (to send processed data for reporting).
  2. Azure Stream Analytics Functions, which consumes data from an Event Hub, processes it with an SQL-like query, and outputs into a Power BI sink.

There were a few minor hiccups in this process, where the documentation was lacking details:

  1. Azure Event Hub is accessible via SSL, thus I had to do proper SSL authentication configuration for Kafka locally on a MapR cluster for trusted certificates.

  2. Stream Analytics Functions’ input should be either a CSV data, with column names attached to every record, or a valid JSON. I used JSON, as Spark can convert a Spark SQL row to a JSON record.

Apparently, a streaming dataset appears in a Power BI workspace only after some data actually gets pushed through. It's best not to change that dataset on the Power BI side because, in case something goes wrong, error messages would be rather cryptic. Any changes should be done in Stream Analytics Functions or upstream.

Closing the Feedback Loop

Let's say an IoT device measures vibration level on a gearbox – then, if the vibration level exceeds some safe threshold or maybe the vibration itself has a particular pattern (that's where machine learning would come into play), an alert on the IoT device's side might be required. To demonstrate this capability, I combined two functionalities:

  1. Send messages from the Spark application, using Azure SDK for Java. This call as well as "send to Azure as Kafka" were placed in the same "foreachWriter" sink of Spark Streaming as asynchronous calls. (I had to read up on Scala concurrency.)
  2. Light LEDs on Raspberry Pi, using GPIO library (admittedly a basic exercise, but I enjoyed the DIY part of it).

Hope you enjoyed reading this blog on how to create a cool fitness tracker using IoT, cloud, and machine learning technologies directly on the MapR Data Platform, as much as I enjoyed making it.

PS: For my next project, I am working with my colleagues at NTT DATA to apply these learnings to more of a real-world, predictive maintenance scenario. In this scenario, we will focus on measuring vibration, using the power of cloud and MapR to predict machine failure and integrate with SAP ERP Plant Maintenance to demonstrate the “Intelligent Enterprise” in action. Once it’s all done and dusted, I might be tempted to write a blog on this one, too.

This blog post was published October 05, 2018.

50,000+ of the smartest have already joined!

Stay ahead of the bleeding edge...get the best of Big Data in your inbox.

Get our latest posts in your inbox

Subscribe Now