Data Pipelines for Factory IoT - A Practitioner's Guide


Ian Downard

Senior Technical Marketing Engineer, MapR

Arun Sinha

Director of Business Development, Opto 22

One of the challenges with operating complex manufacturing systems is detecting and troubleshooting failures. To tackle this challenge, you need data and the proper analytical tools to analyze that data, but how you connect that data to the tools can make or break the effectiveness of your monitoring strategy. In this webinar you’ll learn about how to build data pipelines for instrumented manufacturing systems.

Topics include:

  • Data ingest. How to accommodate heterogeneous data sources with streaming services capable of aggregating data produced by different devices at different rates and in different formats.
  • Processing Data Streams. How to process data streams with transformations and filters that expose important signals for machine learning and real-time alerts.
  • Persisting Data Streams. How to persist data streams in OpenTSDB and MapR Database with the timeliness required for interactive dashboards in Grafana and data exploration with Apache Drill.
  • Architecture and Operationalization. How to architect and operationalize data pipelines using StreamSets.


Ian Downard: Thank you. Welcome to the webinar today. The topic of this webinar is data pipelines for factory IOT. My name is Ian Downard. I am a technical evangelist at MapR, where my focus is on developer enablement for AI and analytics. In the past, I've worked with the United States Navy and Rockwell Automation doing things related to factory IOT.

Ian Downard: I'm also joined today Arun Sinha, who is the director of business development at Opto 22. His focus is on industrial automation and instrumentation. In the past, he's done mechanical engineering for Schneider Electric, Emerson, and others.

Ian Downard: Our agenda for today is, first Arun is going to talk to us about the last mile of industrial IOT, talking about how to connect physical assets with the digital world. I will follow Arun with a couple of presentations on the road map for integrating AI into your industrial IOT applications, and a pipeline demo that shows HVAC data doing into a data platform that is suitable for AI. Then, we will wrap up with a discussion about the challenges associated with AI and industrial IOT.

Ian Downard: Okay, Arun.

Arun Sinha: Thank you, Ian. I'm very happy to be here today. I'm going to start off with just a few words about Opto 22. Opto 22 is a 44 year old company. The company was founded by one of the co-inventors of the solid state relay, which is a very fundamental control technology. The company grew from there when mainframes and PC's later on wanted to connect to real world IO, and control them, and monitor them. Our products evolved to interface with those systems, and eventually around the 90's, what would be called today an industrial automation programmable automation controller.

Arun Sinha: We're based in Southern California, a beautiful town called Temecula, California. A little north of San Diego. We have about 200 employees. We manufacture all of our products in the facility that I am sitting in, that you see a picture of there. We've been in industrial control, industrial automation for many decades, but there's always been a bit of a bent toward IT with our products. We've always had a very early on interfaces to IT systems, and we continue that strategy today.

Arun Sinha: We have customers all over the world, large and small. Some of these names you might recognize. One thing about industrial automation, or control systems, they're not vertical specific. We have customers in all types of verticals. Any application where there is a need to monitor and control, and do something with that data, you'll find Opto 22.

Arun Sinha: So, that's a little background on us. I like to think of Opto 22 is we're kind of the fingers that touch industrial assets. We move that data over IP networks via IP infrastructure into software and enterprise applications.

Arun Sinha: For those of you who know what industrial automation is, I apologize. This will be a very basic intro to what industrial automation is. It will be review for you. I wanted to set the table, or set the groundwork for what Ian is going to present later. Industrial automation can be referred to ... typically, the platform can be called a PLC, programmable logic controller. Sometimes DCS, a modern term that's a little more all encompassing is PAC, or programmable automation controller, which we've traditionally called our products. Today, we are calling our flagship product, which I will talk about a little later, an EPIC. An edge programmable industrial controller.

Arun Sinha: The purpose in life of any automation system is to take input and drive outputs, and kind of in the middle sits logic. That's where somebody who understands the process or machine does based on what the inputs are doing. This is the control scheme to go drive those outputs. That's a very basic explanation of what an industrial automation system is.

Arun Sinha: What are inputs? Inputs and outputs both can be three basic types: physical devices, digital devices, or data sources, or data devices. For inputs, physical devices would be things like sensors. Proximity, limit, inductive sensors. Also, those are discreet analog signals for things like pressure, level, temperature, flow, motor speed, et cetera. That's physical input. There's also smart, or semi-smart devices out there which would be considered digital devices. Power monitors, smart devices, protocol converters, gateways, and some variable frequency drives for motor controls. We have physical, we have digital, and then of course, there are data sources. Databases, cloud services, API's, local files, and things like that.

Arun Sinha: The same holds true for outputs. When we are coming out of PLC or the PAC, we go to things like lights, valves, solenoid heating, heating elements, motor starters, stack lights, etcetera. Those are physical. They're either discrete or analog. Same way for inputs, we have sometimes we're driving digital devices on the output side. The same would hold true for data sources, like databases, business apps, API's, etc.

Arun Sinha: That is a very broad brush overview of what an industrial automation system is. I'm going to talk a little bit about our platform called a Groov Epic. As I mentioned, EPIC stands for edge programmable industrial controller. Again, I wanted to get to what we feel our role is in the IOT and I can illustrate that by our Groov EPIC product.

Arun Sinha: This is a picture of the Groov EPIC. I've got the covers of the IO modules open right here. You can see the wires that would go out to those sensors and things that I mentioned before, whether they are coming in or going out to an actuator. To the left, with this local display open, you can see there is a couple of ethernet ports, there's some other ports in there for USB and HDMI, et cetera.

Arun Sinha: This is a Groov EPIC platform. Before I talk about the EPIC platform in more detail, I want to talk about the most important slide I am presenting. It's the reason for me being here, and being on this webinar. The challenge that is faced today with data applications, data platforms, IOT platforms, is that they need data from the physical world. Right? Those sensors and those actuators, they don't speak any type of IT protocol. They're a raw electric signal. They're a voltage, they're an amperage. They're not data yet. They need to be converted into a format that the IOT or data platforms can understand.

Arun Sinha: Another problem that exists today is that industrial control systems, PLC's, PAC's, DCS's, they've all been talking to each and sharing data for decades. The problem is, they use very specific industrial automation protocols, and oftentimes, it takes a lot of work in the way middleware, integration, and gateways to get that data into an IT system or to a software application.

Arun Sinha: The third challenge is, there's a lot of equipment out there that is even instrumented. They don't have sensors on board. The challenge today with IOT applications in industrial environments, in on board machines, is that this real world data doesn't exist until it is converted and conditioned. Very importantly, into the format that data and IOT platforms understand. You'll have a process plant that is instrumented, but the data is kind of locked down and not moving into higher systems without some degree of difficulty. You might have a manufacturing plant with OEM machinery, where the same is happening. You might have Legacy equipment that is not instrumented at all. That's the challenge that we address here at Opto 22. What we do for data platforms and the IOT, the problem we're solving, is converting that physical to digital.

Arun Sinha: You have the digital world out there, in the way of databases, cloud applications, on premises software applications, mobile devices, and then you have those physical things I was talking about. You have sensors, or pieces of equipment, or sensors on board machinery and in processes. What we look to do is to abridge those two worlds together.

Arun Sinha: A simple example that I will walk through is we have, let's say, a tank with a level sensor, and then we have, let's say, a pump to fill that tank. What we do, basically, is make those connections and bring that data into software applications. Also, performing some other functionality at the edge, which I will talk about.

Arun Sinha: Coming back to the EPIC is ... the purpose in life for the EPIC, again, is to do real time control at the edge to do some processing of data at the edge, and to provide local user interface, and kind of northbound providing that data to data analysis platforms, cloud platforms in an IT friendly protocol and format.

Arun Sinha: A little bit specifically on the EPIC. This, of course, makes those connections to the sensors, and has a really broad tool set to do things like connecting to third party control systems, legacy control systems, doing the real time control, creating user interfaces on mobile or for browsers, or on screens, and to be able to build and deploy IOT applications quickly with tools like Node Red that are built in. Also, for the IOT protocols like MQTT are very important to interface with software systems without the typical hassles of poking holes through firewalls, and dealing with IT, and things like that.

Arun Sinha: Just a real quick overview of the tool set on this EPIC, in the context of hey, we've got these sensors. We've got third party PLC's. We've got smart devices that speak industrial protocols connecting to them, using the EPIC. On board, of course, we like to use what we call move manage to set up the system, to set up the security, to set up user access. The next piece of software we have on there is the fundamental control scheme. That's where we are doing some edge data processing. We're doing some real time control of those sensors, instrumentation in smart devices.

Arun Sinha: Another tool that is on board is what we call Groov view, which is where the user can develop operator interface screens suited for any browser. Mobile devices, iPads, Androids, things like that. That's becoming very important on the plant floor these where BYOD, bring your own device, is increasing in popularity.

Arun Sinha: Another tool that we have on there is Node Red, if you are familiar with that, developed by IBM, but it's opened source and it's basically a tool to get data from the EPIC controller that is talking to the IO. Maybe mash that data up with data from a website or an API, and then move that data somewhere else as well.

Arun Sinha: Finally, we have a tool on board called Ignition Edge. It's from a SCADA. company called Inductive Automation that gives us OPC connectivity to other peoples' PLC's, other manufacturer's legacy systems, as well as their MQTT engine to move data to cloud applications to IOT applications.

Arun Sinha: That's kind of the summary of what is industrial automation, how do we make those connections to sensors that are on board machines and in processes. Using a platform like EPIC, really the bottom line is we bridge that OT gap that is ... the OT and IT gap, and make that data available to cloud applications and data analysis packages using this broad tool set.

Arun Sinha: Really, what is the promise of the industrial IOT? That is the data is an asset to the enterprise, right? Acquisition and analysis of data in real time for the purpose of beyond the plant floor, higher level business decisions, improving processes, reducing costs, increasing profits, tuning systems autonomously, detecting anomalies, decreasing down time. A really big topic in the industrial space is predictive maintenance. Predictive failures before they occur. Ian is going to build on that.

Arun Sinha: Hopefully, that gave you a good overview of who we are, what we do for data analysis software, what we do for the IOT. Hopefully, that's a segway to Ian, who is going to talk about, hey, what a system like the EPIC can provide data for MapR to ingest, then what are the things that can happen through the data pipeline, and in the analysis package.

Arun Sinha: Thank you, and with that, I am going to turn it over to Ian.

Ian Downard: Thank you very much, Arun. That was fantastic. Let me just switch the sharing to my ... the rest of the presentation, we're going to be talking about data pipelines, and data pipelines that are specific to AI. What you have to do to IOT data to get it in a place where you can do artificial intelligence on it.

Ian Downard: The particular kind of AI that we're really thinking about is something called predictive maintenance, because it relates so much to factory automation. Let's just do a quick introduction to what is predictive maintenance.

Ian Downard: This is a quote that I think encapsulates it pretty well. This came from a book called Learning To Love Data Science by Mark Barlow. Imagine a world in which machines and systems never break down unexpectedly. That's the goal of predictive maintenance. No surprises. If you are not surprised by failures, then you can be more efficient about how you deal with them. You can do things like run machines at lower speeds or lower pressures to postpone total failure. You can have spare equipment brought on site, which would be really cost effective if you've got remote sites like pipelines, for example. Also, you can schedule maintenance at a more convenient time, rather than having to scramble when there is an imminent failure. That's the attraction of being able to predict failures with predictive maintenance. All of these things can reduce costs.

Ian Downard: Those costs can be pretty big. This is the seminal study on predictive maintenance. It came from the U.S. Department of Energy, in which they said predictive maintenance can reduce equipment failures by 75 percent, downtime by 45 percent, and maintenance costs by 30 percent. That is a pretty big savings, but look at the date for the study. August 2010. Since then, we've had the arrival of IOT and AI, and the opportunities for IOT are huge in manufacturing.

Ian Downard: This is the last quote that I will be showing. From the Mackenzie Institute, by 2025, the potential of value for IOT in factories will be a huge number. In the trillions. The reason this is happening, the reason IOT is so valuable, is because the data it provides is so granular, and our capacity for learning valuable insights from that data is now possible with big data analytical tools. By big data analytical tools, I'm including artificial intelligence.

Ian Downard: Predictive maintenance steps are ... there's really four steps to doing predictive maintenance. First, you have to collect the data. That could be from the Groov Edge appliance that Arun mentioned. Essentially, you want to collect data anywhere that it's possible, and anywhere where you could have data that correlates to failures. Sensors are an obvious choice. Sensors from machinery on the factory floor, cameras so you can detect changes in infrared heat. Also, you may not think of this, but also operator logs. Or, even weather. All of these things can have metrics that correlate to failures.

Ian Downard: The second step is, you have to store that data for a really long time. Why? The reason for that is because failures are usually not very common. If you think about a hard drive- even a spinning disk hard drive is supposed to last a million hours. Failures are rare. In order to train software to detect patterns of failures, you have to show that software lots of patterns. The combination of failures not happening very often and the requirement to be able to capture data from many different failures in order to predict when they are going to occur.

Ian Downard: That really gets to the third point where you are doing the data science aspect of looking for patterns. Using tools that can find these subtleties that you can't see with your own eyes. These tools can use artificial intelligence, they can use more traditional like mathematical techniques that have been used for ages. Essentially, you are using some sort of data science tool to surface these patterns.

Ian Downard: Once you find those patterns, and they make sense, and they're valuable for predicting the failures that you're seeing, you want to automate them in production. In production, for predictive maintenance, you're usually using two different types of agents. You can use failure agents, or anomaly agents. Failure agents will look for those patterns associated with known failure modes, and you could even automate some behavior and response to that, triggered by a failure ... imminent failure. Also, for anomaly agents, it typically will look for when machines are behaving abnormally, and then notify a human to check it out to see if things are working as they should be.

Ian Downard: AI can be really useful for detecting patterns. Let me just describe what is artificial intelligence. AI is software that can detect patterns in data. It is especially suitable for finding patterns in large data sets. The more data you give AI, the better it performs. The image that I am showing here, this is just one example of what AI can do. This is using an image taken from a linear regression study, where training a model to learn on a sequence of time series data, and it's going ... the output of the model that we're training here is going to basically predict a short window of time into the future. This could be a useful technique for something like anomaly detection.

Ian Downard: Let's go into the road map for how businesses can integrate AI into their industrial IOT applications. As we said from the beginning, the first step is instrumenting your machinery. That is often already in place, but not always. Especially with legacy machinery. That is the first step you have to do, because you can't do any kind of pattern recognition without data. Establishing data source is a step one. The next step is that you have to move that data into a place where you can do analysis. That is the role of the data pipeline. The third step is that you have to do data science to analyze that data, run AI experiments, maybe to do even better data pattern recognition. Finally, to deploy those patterns into production applications like failure agents, or anomaly agents.

Ian Downard: I think of this as broken into three different phases. You've got the monitoring aspect, you've got the data exploration aspect, and then you have the applied machine learning aspect. Monitoring, we've ... manufacturers have been doing that for a long time using dashboards that show them the status of the assembly lines. Also, just control interfaces to help them control machinery.

Ian Downard: The role of the data pipeline, as I said, is to move data into a place where you can analyze it. No matter what you plan to do with the data, whether you just want to monitor your status of your factory, or you want to do some advanced analytics, you have to persist. You have to save that data somewhere. That's the focus of this talk, really, of my talk is to describe the challenges associated with saving that data. That's the role that MapR can play, to persist the data and to facilitate the advanced analytics that you want to do on that data.

Ian Downard: You can write data pipelines custom, using API such as Kafka, or H Base. There's any number of Java and other languages have these interfaces that you can use to tap into restful data sources, or MQTT data sources that Arun mentioned. However, you don't have to write them custom. You can ... MapR supports all these open API's to do that, but there are tools that you can graphically create these data pipelines. If you like lateral logic, then you are going to like StreamSet because this is one of my favorite tools for creating data pipelines.

Ian Downard: It really is a great way to create data pipelines by drag and drop. It's production ready as well. You can start and stop pipelines with a rest API. This is where I want to pause and go into a short demo. I have created a pipeline in StreamSets.

Ian Downard: This is the StreamSet GUI. Are you seeing my screen? No, you're not. Hang on. I'm just going to transition to Chrome here. Maybe I need to exit presentation mode. Okay, here we go. Maximize my screen. Okay.

Ian Downard: Here is what StreamSets looks like. It's a web application. You can install in docker, you can install it on an actual cluster node or a stand-alone server. You have what they call stages. Stages for the beginning of a pipeline, stages for processors in the middle, or destinations. I only have one line in my pipeline, but you can fork off different branches here to send data in different places. They have more stages than you can shake a stick at. There's lots of different packages that you can install into StreamSets. MapR is one of them.

Ian Downard: With MapR, our platform has an embedded database, distributive file system, and distributive streams. You can use any one of those to bring data into a pipeline, or send it out. This D corresponds to a destination. You can just drag these into a pipeline and just specify the table name, and other parameters that are associated with that.

Ian Downard: In the pipeline that I've already created, I'm consuming from a stream. This is the path to my stream. We've actually ... Opto 22 provided us some data when they- I got to share my whole desktop here. There you go. Okay. You won't see this probably, the font is really small. I'm just starting a simulated MQTT data source. This data is provided to us from Opto 22. It instrumented an HVAC system. When I start this pipeline, we're going to see data populated in this dashboard. This is Grafana, another web application. Grafana picks up the data that is shown in the database that is called Open TSTB. The role of this pipeline is to consume the MQTT data, to reformat the time stamp with this stage, into a format that is expected by Open TSTB. That's common to have to reformat data, especially timestamps.

Ian Downard: It's going to post those. Each record that we are getting in this MQTT data, we're going to post that to Open TSTB. You'll be able to see the metrics that are available. Let me just start this pipeline. Okay. We're ready. It's just received 26 records from our stream. We can monitor the pipeline with these metrics, as well. StreamSets is great for that.

Ian Downard: Whenever I go over here, we can already see data starting to come in. In our MQTT data set, we have 150 metrics corresponding to boilers and chillers and temperatures throughout various ducts. Energy usage and outside air temperature. You can also use Grafana- you can display alerts in Grafana, so we don't have any alerts coming in currently. In the past, we have had alerts, and they can show up as these vertical lines. You can set up email alerts as well. This is just FYI. Grafana is a very useful tool, and it's included inside the MapR platform.

Ian Downard: These alerts could be anomaly detection, it could be failure detection. The outputs is anomaly and failure agents. I think I will pause the demo there, and go back to the presentation.

Ian Downard: You could stop there. You could stop with Grafana. What you would have there would be situational awareness for your factory, and have an idea of what's going on in your factory. A place where you can visually inspect for failures or degradation. There's a lot more you can do, as we mentioned earlier on, with predictive maintenance. You could explore pattern recognition with AI. If you do that, there's a lot more that you should consider.

Ian Downard: For example, to do AI, typically people will use programming libraries like what we see here. Like, Scikit-learn, [inaudible 00:29:57] flow, Apache Spark, or Keras. They might use IDE's or data science notebooks, like Jupiter, Zeppelin, R Studio, and so on. There are lots of different options for how you do data science. Lots of tools out there. The one thing that they all have in common, is that they all have to get data from somewhere. That data could be a silo database. It could be a separate cluster where you are hosting a distributive file system, or it could be MapR. MapR is attractive for a lot of different reasons, but just high level, you can think of it as a place where all kinds of data can sit. Files, tables, streams. All data can reside in MapR, which makes it much easier to manage and store.

Ian Downard: A lot of times, people feel somewhat intimidated by all these different options for data science. That's understandable. The way I think most organizations will explore AI is through short term pilot programs. Just to experiment as many different avenues as possible. Just to find low hanging fruit. This way, they kind of build up a portfolio of viable techniques they can use that makes sense for their particular domain.

Ian Downard: For example, two particular AI algorithms that are commonly used for predictive maintenance are linear regression and logistic regression. Because failure agents are essentially looking for a probability of failure, when it reaches a certain threshold, it triggers some action. That question of predicting the probability of failure can be thought of as looking at a metric called remaining useful life. This is the metric that you are trying to train an algorithm to predict. Anytime you are trying to predict probabilities, this logistic regression is commonly used. This little screen shot, this is basically the probability of passing an exam versus the number of hours that have been studied. You can see the more hours you study, the higher the probability that you will pass the exam. That could be the probability of failure as well, so these other axes could be different types of correlating metrics. That's one approach for doing failure prediction.

Ian Downard: Another one would be, as I mentioned earlier with that screen shot, was linear regression. We have a sequence of values and we are drawing a trend line through there to predict the next value. That could be very useful for anomaly detection. You would alert whenever the actual values are very different from what was predicted.

Ian Downard: Initially, models are deployed like this, where you have a linear regression model, where the inference request is basically a chunk of time series data. You're asking the model what do you think is going to happen in the next minute, or the next five minutes, or whatever your window of time is. Actually, on that screen shot, the model is predicting the remaining useful life. You're still feeding it data, and you're asking it, what do you think the probability is for failure in the next day, or week?

Ian Downard: This is very sequential pipeline. These symbols here indicate streams. Steams are useful because they ensure that these requests and the responses are saved, and they're replicated. Streams are a part of the MapR platform, which is a distributive cluster of machines. The data in the streams is replicated across the cluster. If anyone's server were to go down, you would still have all that data available. Also, the data can be replayed, which could be useful if you needed to go back and change the model, or it can be useful for a lot of different reason.

Ian Downard: Another challenge with applied machine learning is you often don't have just one model. You have to have ... people will develop multiple models because there is this process of trial and error with the parameters that you use to create models. In reality, a lot of the time you will have multiple models that you want to deploy to production, have them performing inferences in parallel. You might have logic defined that will select one of these models to actually be your favorite result. That's the role of what is called a rendezvous service, where you are rendezvousing all the results from these different models.

Ian Downard: Again, there is a new stream here called Model This. It's very important to capture the differences between models, because if you think of enterprise software, a lot of the times you are doing A, B testing. You need A, B test models because they have a short life span, and they can go stale. It's important to monitor them for quality. When they start going stale, then you re-train them.

Ian Downard: That concept of a rendezvous architecture was actually invented by some colleagues of mine at MapR. They've wrote about it in this book called Machine Learning Logistics, which you can download. It's an e-book. You can download with this URL,

Ian Downard: Let's talk briefly about why MapR, and the challenges associated with applying AI and industrial IOT. Now, let's talk about road blocks. What can stop you right away is just too much data too fast. The way people a lot of times deal with too much data too fast is they sample it, or they'll aggregate the data in averages. That really is a bad idea, because when you are doing AI or any type of pattern recognition, you want as much granularity as possible. Any time you are averaging data, you're really missing out on those outliers that are extremely important for pattern recognition.

Ian Downard: The second challenge is that ... really what I just mentioned. Machine learning requires full resolution of high fidelity data. If you are capturing that ... if you are capturing too much data too fast, in its full fidelity, you have to make sure you get that data to a place where you are going to do analytics. If you are saving it into a silo data store, moving that data into another separate cluster that you are using for analytics can be really slow. That friction can kill the productivity and really frustrate data scientists.

Ian Downard: A third point is that machine learning requires frequent re-training and re-deployment. Machine learning software, or any software that uses models, that has a different lifecycle than enterprise software. These models are trained with data that is usually a snapshot in time. Now, as reality changes, as machine state changes, you need to re-train the model and deploy a new one. They have a shorter lifespan, typically. Our existing processes that we have used for enterprise software, often called like DevOps. The tools and processes for deploying enterprise software into production don't support the life cycles that are part of machine learning. You have different people required to build these models. You have different tools that they're using, and they all need to access data. Then, you have the continuous deployment of models is going to be different. These are the three roadblocks that can prevent people from being successful in deploying AI into production.

Ian Downard: The five reasons to use MapR are, really, they address all those points. The first scalable adjust. With MapR streams (Now called MapR Event Store), you can ingest millions of data streams per second. If you need higher throughput, or if you need more storage capacity, you can do that simply by adding cluster nodes. MapR scales linearly to thousands of cluster nodes. What I mean by scaling linearly, by adding more nodes, you're not incurring more technical debt. You don't have to go into all these different other configurations to reconfigure them. It just works by adding nodes.

Ian Downard: The second point is, there's no data movement. You can do storage and analytics on the same cluster. That makes it much easier, much faster for data scientists to access the data, and immediately start processing and exploring it. Also, there's a variety of MapR, called MapR Edge, which is a distribution of MapR that can be deployed on smaller footprint hardware. That can be deployed on a factory floor to capture and process and apply AI models close to the source of the data. It would be a great way, a great pattern to use if you are trying to apply AI at that edge point.

Ian Downard: The third point is, frictionless future engineering. This is just another way of saying the data science is easier. You don't have to move data, There's no ETL from data silos. We also have technology such as the MapR Database connector for Spark. That makes it very easy to save features for machine learning in the database, and update those on really big tables. By big, I'm just thinking about all the ... take what I showed you earlier. 150 metrics for an HVAC system being reported once per second. Imagine if you had a more complicated machine, where you are dealing with groups of sensors. Each group could have 150 metrics, sampled once per second for a long time, because you have to train these models on multiple failures. That's a very large table. An important part of machine learning is to be able to update all that sensor data. When a failure occurs, you retroactively go back and you apply labels to all these records saying, yeah, we're close to a failure, or we're not.

Ian Downard: Also, I didn't mention the data versioning is really useful in MapR. You can take snapshots of terabytes of data, and tables, streams, or even directories in a matter of seconds. That's really useful because one of the first things you do as data science is start monkeying with your data, so you want to back it up as step number one.

Ian Downard: The fourth point is that, as I mentioned with the machine learning logistics, it's just the fact that we have streams as the first class citizen in the platform. Streams are so useful, not only for micro-service type message exchanges, but also for rendezvous architecture, and monitoring models, and even deploying models. They facilitate that continuous re-training and model deployment production.

Ian Downard: The fifth, and I think most important, reason to use MapR is because it helps you attract and retain talent. It should come as no surprise to anybody that machine learning specialists and data scientists are some of the hottest jobs on the market today. It can be really challenging to recruit these people into your organization. What's surprising is, as this survey from stack overflow revealed, these are the same two job roles of people who are most likely to be looking for a job. Even if you do hire somebody, the odds that they're going to jump ship are actually pretty high. There could be a lot of reasons for this. One of the most likely reasons is that they're often not given the data they need to do their job.

Ian Downard: They're often also not given a suitable infrastructure to do AI. They don't have buy in from management to do ... to hire these data ops tools and processes put into place. Those three things can lead to a lot of frustration with data scientists. They went to college, they learned how to perform magic on data, but they're just not given the data. They're not given the tools. They're not given suitable infrastructure. They have to do VTL, and that's just really frustrating.

Ian Downard: What delights them are features like these. Posix compliance, meaning you have file system that just looks and feels Linux, something they are accustomed to doing. You don't have use any special utilities to just copy and edit files. Also, this easy backup and recovery is very useful. Standard API's, being able to use open API's and rest API's to interface with, to save, and edit data. Also, the ability to use one platform for all data and for analysis. Having one platform that not only stores the data, but also a place where you can install Spark, or Drill, or Zeppelin. These are all tools that data scientists really like to use. Having it support Docker and Kubernetes is a game changer. There's no better way to do experimentation than doing it inside containerized environment. It makes back up so easy, it makes it easy to run other peoples' experiments. That's really the way of the future, containerized work environments.

Ian Downard: Just having a platform that supports data ops. All of these different people are going to be delighted by all these different features that are part of the platform. Data scientists, data engineers, software engineers, operations business analysts are all going to like having that easy access to data with one platform with the burden of data movement.

Ian Downard: These are all characteristics of MapR. This is a diagram showing our data platform. We have services for distributive streaming, distributive database, distributive file system, and data science, analytics. We provide tools such as Apache Drill that enable you to use standard sequel from either programmatically, from Python, or from business intelligence tools to analyze that data. That really helps ... you don't have learn some proprietary query language.

Ian Downard: Also, you can deploy MapR on-prem, at the edge, on the cloud. Even if you have multiple MapR deployments, it's smart enough to ... you can connect them so they replicate the data between each other really quickly, and they mirror the file system. Maybe you are collecting and applying models at the edge, but all that data is going back to a central, maybe a cloud cluster, where you have data analysts and data scientists that are going to be accessing that data from that cloud as it is ingesting and replicating with multiple edge deployments.

Ian Downard: That sort of wraps up my presentation. If there are any questions, this is the time to ask. We have another ten minutes, I think.

David: Yeah, just a reminder, if you have any questions, to type them into the chat box in the lower left hand corner. There were a few requests, if we will be sending out a link to the deck. We will be ... we are recording this, and we will be sending out a link to the recording shortly after the event.

David: With that, let's hit some of the questions here. First one is, what are some of the benefits of MQTT and IOT applications?

Ian Downard: MQTT was protocol that was designed specifically for IOT devices because it really pared down communication protocol. It requires less energy, which makes it friendly for those wireless devices that have to operate on batteries. And, pretty much as an extension of that. It just got deployed into industrial environments.

Arun Sinha: I'd like to add to that-

Ian Downard: Go ahead.

Arun Sinha: I'd like to add to that- and Ian hit the nail on the head. It's a very lightweight protocol in various ways. It's report by exception, number one. Which, if the state of a valve or something hasn't changed for week, traditional models would be pull, response, pull response, pull, response to remote access. That's a bit inefficient if data isn't changing. MQTT's report by exception, it's a published subscribe, so clients and servers publish to a broker. That helps with another important point, which is this idea of having to penetrate a firewall at a remote site to do something like pull response, or even do a restful API call.

Arun Sinha: It's a remote originated connection. It's two way, but it's remote originated. That alleviates some of the IT hassles. Those are just a couple of things about MQTT. It came out of the oil and gas industry.

Ian Downard: Okay. Thanks, Arun.

David: Another question is, does the team have any sensor data collection from power plants or jet engine use cases?

Ian Downard: There's a really good open data set that was published by ... I think it was used in Azure. Microsoft used it for Azure machine learning tutorial. It's a simulated data set for airplane engine failures. As far as actual, real data, I am not aware of any. Just the simulated data set. I think with a little Googling, you can find it. I could send it to you, if you email me.

David: We will ... we can follow up with each ... with the questions if we didn't fully answer it on the call today. One of the team will follow up with you.

David: Okay. Next question is, can MapR Edge run on the Opto 22 Edge devices?

Ian Downard: No. They would run side by side. The Opto 22 ... the MapR Edge requires an X86 type architecture with larger drives, and I don't believe that's what Opto 22 Edge device is.

Ian Downard: Does that answer your question?

David: They can't answer.

Ian Downard: Oh, they can't answer a question back. Are there any more questions?

David: Let's see. Let me take one look here. Yes. That did answer his question.

Ian Downard: There's one more thing I need to show. We are going to have a part two of this talk, and in this part two, it's going to be more technical focus around the code behind Spark and just streaming data. We have a demo repository in our MapR demos on Get Up.

Ian Downard: In part two of this webinar, we will be going through this in detail, looking at code to talk about lagging features and dealing with really fast data streams for vibration sensors and audio signals.

David: Okay.

David: Just one last question that came in. How is predictive maintenance different from preventative maintenance?

Ian Downard: Predictive maintenance- I mean, preventative maintenance is time based. It's usually your ... it's like your oil change with your car. You schedule it every certain number of days, or certain number of operating hours. The problem with that is that it tends to disregard the actual condition of the machine. Maybe you never drive your car. Do you still need to do oil changes every five thousand miles? Or six thousand, whatever it is.

Ian Downard: It's kind of wasteful labor. Artificially limits the life of what you're replacing with the maintenance. It's also based on essentially averages. There are always going to be outliers. If you have some policy to replace a machine every two months, there's still going to be cases where you have catastrophic failures within two months.

Ian Downard: It's good. It lessens ... it's better than running the equipment to failure, but you're still going to have failures. It's better to do pattern recognition to predict failures. Maybe doing both is also the right solution for you.

David: A couple more questions that came in.

David: Arun, this one is for you. How do the solutions of Opto 22 address scalability, both on the software side and the hardware side?

Arun Sinha: That's a great question, Andrea.

Arun Sinha: The EPIC ... Groov EPIC platform is scalable. In the way on the hardware side that works is, you saw those input, output modules. That's where sensors were wired in, and we go out through actuators and things like that. Those have densities from four up into 24 point. You can get a lot of sensors in and out.

Arun Sinha: Let's say you fill up all the modules. There are 16 available. The way you scale up from there is you can have another rack acting as a remote. You can keep adding on. It could be a few points of IO, all the way up to ... you know, we have full plants with thousands of IO.

Arun Sinha: Same on the software side. There's no tag limit, so every IO point is a tag, and it might be associated with one or multiple variables. There's no limit to that. That's how it's scalable from both the hardware and the software side.

David: Great. Thanks, Arun.

David: One last question that came in, and I think this one is for you, Ian. Can I visualize in Spotfire?

Ian Downard: Yes you can. I have not used Spotfire, but I know it's an analytical tool. It provides some graphic capabilities, and I know it provides a way to connect the data sources via ODBC. I just happen to have an example of an ODBC driver here that I am using here inside of a Zeppelin notebook.

Ian Downard: You can install these ODBC drivers to connect a lot of different tools to data that sits on MapR. In this case, I am using the ODBC driver to connect to the MapR database to do some data science inside Zeppelin.

Ian Downard: So, yeah, you can use ODBC and Spotfire will work with MapR.

David: Great. Thank you.

David: That is all of the time we have today. Thank you Arun, and Ian. Thank you everyone for joining us. For more information on this topic and others, please visit We will be sending out a follow up email with links to the resources that Ian mentioned throughout, as well as a link to the recording and the slides, and the upcoming demo that Ian mentioned.

David: Thank you again, and have a great rest of the day.