Data Pipelines for Factory IoT - A Practitioner's Guide (Part 2)


Ian Downard

Senior Technical Marketing Engineer, MapR

This webinar will help you gain a better understanding of the theory and concepts associated with developing intelligent applications for predictive maintenance. This is the second in a two part series on predictive maintenance. The first part focused on business processes and data pipelines, whereas this part goes deep into the technical aspects of data streaming, feature engineering, and machine learning.

In This Webinar you will learn:

  • Architecting for Fast Data. Anomaly detection on audio and vibration signals require high speed sampling. We'll talk about how to configure data streams and consumers to process that data in full resolution, without aggregation, so anomaly detection can be more affective.
  • Feature Engineering on Big Data with MapR Database & Apache Spark. The metrics that correlate to important events, such as failures, often come from multiple places and may only be apparent after some kind of data transformation. We'll explain how Spark can be used to derive features by joining datasets, transforming raw data, and retroactively labeling lagging features such as “Remaining Useful Life”.
  • SQL interfaces for Data Science. The most well-known language for querying data is SQL. We'll demonstrate how data scientists can execute SQL queries through Apache Drill to load raw data and derived features from numerous sources into data science toolkits.
  • Neural Nets for Predictive Maintenance. We'll survey neural networks commonly used for predictive maintenance, such as linear regression and logistic regression, and demonstrate them using Keras.


Ian Downard: The topic of today's webinar is the theory and practice of feature engineering for industrial IoT. My name is Ian Downard. I'm a Senior Technical Evangelist at MapR, where I focus on developer enablement for AI and advanced analytics. In the past I've worked for the Navy and Rockwell Automation. I'm familiar with a lot of the challenges associated with IoT and dealing with sensor devices that are remote or at the edge.

Ian Downard: The agenda for this webinar is we'll first talk about architecting for predictive maintenance. Predictive maintenance is a particular kind of application that is frequently used and encountered in industrial IoT, so we'll be focusing on that. And then we'll talk about ... we'll look at some code and some interfaces for feature engineering on big data on Spark and MapR Database, as well as feature engineering on Fast Data, streaming data that is, with Spark and MapR Streams (Now called MapR Event Store). And then we'll finally look at data exploration with Apache Drill using standard SQL to access data and databases or files and data science kind of tools.

Ian Downard: This is the second in a two part series on predictive maintenance. We did the first part, the first webinar a couple weeks ago. If you'd like to see that, it's available on demand at the URL shown here. The first part focused on data pipelines and really justified our motivation for looking at predictive maintenance. It set the stage by giving information about how IoT and machine learning have evolved and are creating incredible opportunities for manufacturers to cut costs and achieve competitive advantages. In this webinar, we'll be focusing more on code and the interface is in Kafka and Spark for data streaming and feature engineering and machine learning.

Ian Downard: As a quick background though, for those of you that don't know, aren't familiar with predictive maintenance, we'll just spend a few slides talking about that. Maintenance generally falls into one of these three different categories. Reactive maintenance just is the pattern of running equipment until it fails and then you replace it. Generally that's not acceptable because it causes disruption. The next best maintenance mode is predictive maintenance, where you may schedule maintenance intervals. Replacing the oil in your car every 5,000 miles is an example of preventative maintenance. That's good, but it kind of is not good because it disregards the actual condition of the equipment. It's also a waste of time because you may be changing the oil when you don't need to be, so it's labor intensive. And it artificially limits component life. Furthermore, catastrophic failures can still occur because just because you're doing these scheduled intervals, even if you're really conservative and replacing equipment frequently, you can still have catastrophic outages.

Ian Downard: People try to do better than that. They do predictive maintenance by looking for evidence degradation, trying to measure the meantime between failures. Predictive maintenance has been going on for many, many years by just looking for evidence of degradation and measuring the meantime between failures and doing maintenance based on that. But now what we're seeing with applications of AI and advanced analytics is that you don't necessarily have to wait to see evidence of degradation. You can predict it. The steps for predictive maintenance start with collecting data wherever possible. That data can come from sensors or cameras, like infrared cameras that look for heat in equipment. But also, it can come from other places that you may think are [inaudible 00:04:45] to what's happening in the factory. Such as operator logs or even weather. All of this data can contain attributes that correlate to failures that you're trying to predict.

Ian Downard: It's important to collect all that data wherever possible and store that data for a really long time. The reason you have to store it for a long time is because in order to look for the patterns of failures, we have to wait for failures to occur. Failures are generally not common, so we have to store enough data to capture the patterns associated with failures, associated with many, many, different failures, so that we can really capture that signature using tools that use math to spot those trends. They're very subtle. Much more subtle than what you can see with your own eye. Once you have these patterns, you can incorporate them into software, such as failure agents or anomaly agents, which you can deploy to production in hopes that they automatically detect these patterns and essentially help you predict when failures or anomalies are about to occur.

Ian Downard: AI is software that's really good at detecting patterns in data. It's especially useful for finding subtle patterns in really large datasets. Like as you're storing sensor data for many, many weeks or months, it can very difficult to analyze that with traditional tools. But with AI, it really performs better the more data you give it. In the screenshot shown here, this is an example of what you may use AI for. In manufacturing, we see lots of time series data. In IoT in general, you see time series data frequently. One thing you could use AI for is to predict the next small window of time to see what future values are likely to be. When you train models, the performance of these models is measured in terms of how accurate they are. The window shown on the right here, the red line are those values that are predicted by our AI algorithm. And depending on the application, this may or may not be accurate enough, but the error between the actual and the forecast values is essentially how we're gauging the effectiveness of our AI model.

Ian Downard: The process for incorporating AI in factories generally looks like this. You start off by instrumenting machinery and then you are collecting data and storing that data. You store it by moving it from the edge device to a platform that can be used to save it. That process is done by data pipelines. Once you have data in a place where it can be analyzed, you monitor the data. This is what manufacturers have done for years, monitoring their factory with dashboards that show them charts and highlight how machinery is operating. To actually incorporate AI, you then need to do things like feature engineering. You have to clean the data. You have to run AI experiments. Basically a process of trial and error to find viable patterns that actually could be deployed to production. Once you have these patterns and failure agents or anomaly agents, you deploy them to production. And that's what I call applied machine learning.

Ian Downard: Regardless of what you're planning to do, the IoT data has to persist somewhere. That's the role of MapR. You have to persist that data in order to analyze it. In order to do that, you have to have data pipelines that pull from PLC or access data through gateways like MQTC gateways or restful gateways. And these pipelines can be client processes that run on MapR node and access those interfaces and save the data on the node. Or, instead of writing these processes as custom, you can also use tools like data pipeline tools such as StreamSets. Streamsets is a really nice utility for creating pipelines.

Ian Downard: It really saves you a lot of the trouble with understanding these coding interfaces. For example, you can drag and drop in an object or a pipeline stage for pulling data off an MQDT data source or a restful data source. And then if you wanted to persist that to a no SQL database, they have other objects you can drag in here and just connect the data source to the destination and it will save without having to write any code. And it also allows you to start and stop these pipelines like the restful API and monitor them through this GUI. If you'd like to learn more about StreamSets, check out the first part of this webinar. We went into detail about StreamSets and showed it in action. Once we have data on a platform where we can analyze it, there are a lot of different libraries and IDEs and data science notebooks that you can use to analyze that data.

Ian Downard: We'll go through a process of cleaning that data and analyzing it, engineering features, and then ultimately training models and measuring their effectiveness. How accurate are they predicting values.

Ian Downard: The process involved with developing models is very much an iterative process. Data cleansing and feature engineering require lots of trial and error, so any friction associated with accessing the data, or moving the data, or defining schemas, or using proprietary query languages, these are all very bad. They kill the productivity of a data scientist. And these are all pain points that map our addresses, map our minimizes data movement, makes it for industry standard API's for accessing data, saving data, and you can run any analytical toolkit that you want on top of MapR.

Ian Downard: Furthermore, the effect that you get out of these models can be saved on MapR and made available to production applications all on one platform, so MapR is really a general purpose data platform that you can use for AI, advanced analytics, and even for exposing data to production applications.

Ian Downard: Deploying machine learning code to production often starts like this, you have an inference request. This is basically the question that would be, what's the next data point in my time series data set. And the model will tell you that, and that is an inference. An example of something that you might see for predictive maintenance is the remaining useful life for a piece of machinery. Remaining useful life is going to be a number that's counting down to zero. Maybe it's measuring seconds remaining. Maybe it's measuring days. Either way, you're going to be asking the model "How much time do you think is left in the viability in this piece of equipment?" And not ... this pipeline is very much a sequential pipeline. You'll note we have streams on either side of the model. The nice thing about streams is that they ensure these requests and the responses are saved. They're also, on MapR, they're going to be replicated, so they're a high availability. There's no single point of failure, and if you wanted to change your model and replay all of the requests that were performed, you can do that very easily, because the offset ... because streams, all the requests in streams can be accessed through offset.

Ian Downard: And that's the same approach that you would see with Kafka. In fact, Kafka is the interface for MapR streams (Now called MapR Event Store). The Kafka API is the interface.

Ian Downard: One more slide here about machine learning in production. In reality, production applications, or machine learning based production applications aren't as simple as what that previous slide showed because models tend to have a shorter lifespan. They can go stale. They're often updated based on the priorities of the people that are using them. And you may train them with data that eventually becomes old. When you're training models, you're training them on real world data, and as the real world changes, the models go stale.

Ian Downard: We see these ... the life cycle of these models is generally much shorter than traditional enterprise software, so it's not uncommon at all to have multiple versions of models in production, and it's necessary to be able to compare the results that come out of these.

Ian Downard: There's this pattern known as the rendezvous architecture where the results of models go into a rendezvous service. It takes the differences between the models and puts them into another stream, as well as saving the inferences into a stream where consumers can access them for maybe a anomaly agent or a failure agent kind of production application.

Ian Downard: The important thing to realize here is how central streams are. They're very much pertinent to machine learning and production. And the other important aspect is to notice how there's this feedback here. Whenever we're putting machine learning code into production, we'll still have the life cycles still going to have us going back to analyze data and update our models accordingly. If you'd like to learn more about the rendezvous architecture, check out this book at this URL:

Ian Downard: And now I'm going to go into more detail about the theory and practice of feature engineering and industrial IoT. First, what is feature engineering? Features are predictors, so when we're giving out ... when we're instrumenting a factory, we're getting lots of data and when we're trying to detect patterns and failures, or predict patterns and failures, we're going to be looking at certain attributes of that data that predict, or correlate, to those failures.

Ian Downard: Whenever there's a correlation between an attribute and a failure, that's called a feature. These are the important attributes that we want to feed into AI algorithms. Our training data has to contain enough relevant features and hopefully not too many irrelevant ones, because if you do ... if you're out of balance here and you're trying to train with data that doesn't correlate to failures, then you get this garbage in garbage out. The actual groups ... the forecasted values are not going to align with what the actual values are going to be.

Ian Downard: Features can be selected from raw sensor data, but they can also be combined to produce more useful features. For example, maybe by adding temperature values with pressure values, you may come up with ... the result there may be a value that correlates more strongly to the failure that's happening.

Ian Downard: This is an example of what IoT data may look like. You're just dealing with a table of numbers, and this data is coming from vary important machinery. This is my attempt at a joke. I got a vending machine in a factory that's providing us a variety of attributes. Maybe some of it's ... these numbers are between zero and one, so it sort of skipped a step because a lot of times, the numbers coming, maybe it's counting stock remaining inside. Maybe it's counting temperature of coffee that's being created, or pressure inside compressors. Whatever the numbers are, you're going to end up with a table like this and, I'm jumping ahead but, one of the steps with AI is you try to get these values between zero and one.

Ian Downard: Anyway, we're also dealing with time series data, so we're going to have time stamps. We're also going to have device ID's probably. For feature engineering, we want to again find whatever attributes of data are going to correlate to these failures. When we're saving this data, we need to save it into a table that has a flexible schema. We may discover in the future that as we add the operator, or as we add weather, our feature table is much more ... it contains a much stronger signal and it correlates to those failures.

Ian Downard: It's important to be able to use a data base that has flexibility and schemas, which is essentially another way of saying you need a no SQL database. Other features may be created, derived, not ... as I mentioned earlier, you can combine them. You combine maybe the X and the Y by some kind of mathematical operator to create a new feature. But you can also maybe tap into time libraries.

Ian Downard: For example, to determine weekend, it's very easy to do that with Java libraries, or time. And the reason you would want to do that is because as we're analyzing this data with data science tools, we may be using SQL to access the data. And to write something like a SQL function to calculate weekend could be pretty difficult. It's much easier to add these attributes into our feature store while we're in the process of a Java environment, or Python environment, or wherever you're doing your features, deriving features.

Ian Downard: And also, we can add other attributes such as subsystem. This is another way of simplifying the SQL logic that we may be using later in data science tools.

Ian Downard: Another aspect of the database that's nice to have is a secondary index. The reason that's nice to have is because whenever we're doing queries on this large table that may be containing millions of rows in it's time series data set, if we want to filter that on subsystem, maybe show me all of the time records associated with fuel supply or boiler. To do that on such a large table with actual table scan is really necessary, because full table scans can take a long time and again, anything you add creates friction in that data exploration process. It's going to hurt the productivity of data science.

Ian Downard: Another important feature is the lagging features. I mentioned earlier this concept of remaining useful life and that is an example of a lagging feature. With lagging features, the values, they're values can only be calculated once a future event occurs, which is why they're called lagging. Say our vending machine fails. Once that failure happens, we retroactively go back through our feature table and we assign values to all of these lagging features. For remaining useful life, it's going to be a number that counts down to zero until the failure occurs.

Ian Downard: And then, there's this other attribute called maybe about to fail, or 30 seconds to failure. Maybe 30 seconds is the window in which we're going to consider failure imminent. We would go back 30 seconds based on this timestamp and label all of these true. This is an integer value, and this would be a Boolean value. Both of these are lagging variables.

Ian Downard: When you're going back retroactively, you can consider that to be an extremely expensive ... you wouldn't expect that to be a very expensive operation. It's necessary to be able to use distributed database to store that feature store, because as we ... as we're sampling data from sensors that may be providing data, 10 or more samples per second, for weeks or months or even years until failures happen, and then once a failure happens, we have to retroactively go back and update all of that data. That's one heck of an update.

Ian Downard: The databases that are used are often distributed just to handle the sheer storage capacities. The feature store may exceed the capacity of any one node, so you've got to distribute it across nodes, and that distribution also makes it a much more robust database. Robust to failure.

Ian Downard: Spark is a good ... MapR Database is an example of distributed database, and Spark is an example of an execution engine that can operate also in a distributive fashion. You don't want to have to move data off of the database to be able to operate on it, and Spark is nice that way because you can load data ... I mean, you can operate on these maps the data sets in Spark without moving any of the data, because data movement would really take a long time with such a large table.

Ian Downard: For Spark to do that, it has to have some kind of connector to the database that it's using. And there's a lot of different connectors available for Spark and MapR Database. There's one that is a database connector for, and this is great because it allows you to update those feature tables without massive ETL and to those Spark data connecters.

Ian Downard: I'm going to go into a couple examples of code now. I love looking at code and presentations, so if I go too fast, go to this github site, so All the code that I'm showing is part of the tutorial that's on that github repository. In this tutorial, we're ingesting sensor data and failure events and using the Kafka API, we're going to do feature engineering on the ... we're going to store the data's feature table in MapR Database, and we're going to derive new features and then we're going to save those into a no SQL database being MapR Database.

Ian Downard: At the end of the tutorial, we're analyzing that data using SQL analytics. Drill or we're using ODBC to connect data science tools such as Tableau or any kind of Python environment to access the data with standard SQL.

Ian Downard: The first step here in pipeline is to subscribe the data stream. And this is what you will see in a Spark process. Our Spark process is a Java process. We're using the ... I'm sorry, I meant to say Scala. This is an example of how you would subscribe to a stream with Scala. The first step is that you define a case class for the ... that complies with the JSON records that you're consuming. I'm calling this MQDT records. It includes a timestamp and then all these other metrics that are part of the JSON. And there's many, many more. After you define that case class, you basically ... this is how you subscribe to a stream using a Kafka API.

Ian Downard: And then once you've subscribed to it, use this foreachRDD function to read from the stream. And as you're reading, you're loading each of the ... all the different records that you're reading off the stream get loaded into a variable that the data sets. Basically, it's a table and inside the table you can filter, you can do operations on that table, and so on and so forth.

Ian Downard: And I'll show this in action in just a few slides. Actually, I'm going to switch over and show you on github what this looks like. Our MQTT consumer starts off with the fine [inaudible 00:28:07] class, and I have 150 different metrics that are provided by the MQTT data source that I've got data for, and this data source describes an HVAC system, so I've got metrics for outside air temperature, boilers and chillers, fans, power utilization, and then the next step, which I think is optional, is defining a schema so that we know what the data types are going to be used and the data sets that's used for saving the data.

Ian Downard: Here's the ... here are the functions for subscribe to the stream, and then when we do foreachRDD, this is where we're going to actually save the variables into ... or save the streaming records into a variable, and then we can operate on that. This is where we can actually add derive features. And in my example, anything with an underscore is a derived feature.

Ian Downard: Here's how we're creating the lagging variables. We're going to add to that original data set that has 150 metrics, we're going to add some new columns. And these columns are going to be initialed as shown here, so we'll see everything that's everything, every sample of data we get, we're going to add in about to fail a Boolean lagging variable, initialize health, and we're also going to say remaining useful life, initialize to zero.

Ian Downard: And then later, when we access those attributes, if remaining useful life is zero and about to fail, we'll know that it hasn't been updated yet. This was an example of how you would derive a feature called weekend. I'm using the unix time library for doing that. And then when I want to save this to a future store, we're going to use the MapR Database connector for Spark and this is what that looks like.

Ian Downard: Save to MapR Database with a table name. The index is going to be timestamped and we'll create the table prior to doing this, so I don't need to do that. Now, we see this in action. I have some scripts already set up here. I have inside the [inaudible 00:30:59] refill, we've provided a sample data set that's MQTT.JSON and if I look at the first row of that, you'll see JSON records of 150 different metrics.

Ian Downard: I'm going to basically use the Kafka console producer. I'm just going to cat the JSON and reformat the timestamp. Or, add ... sorry, add a timestamp according to what is currently ... according to now. Going to add that timestamp, and I'm going to publish to this stream and MQDT is the topic, so we map our streams with this notion of not only the stream, but also the topic. Makes it easier to organize data and publish data faster, which we'll talk about. And then that's how that works.

Ian Downard: This is the output of the code that we just talked through. Look at this Spark submit, or tap ... or we're calling MQTT consumer, and we're consuming from the same stream that we just published too, and we're going to persist to this table. As that is running, you'll see it's consuming roughly one message per second, and it's telling us the current size of the table and just giving us a couple examples of metrics it's consumed.

Ian Downard: And finally, as that's saving to MapR Database, we have another process here that's going to consume off that stream. Again, apps factory MQTT, and it's going to write all those JSON records using Curl. It's going to write them to open TSTB and this is the restful API for open TSTB. Open TSTB is expecting the timestamp in a certain format, so we do a little bit of back scripting here to put it in that format. And the reason we're using open TSTB is because we want to be able to visualize the data and grafana, Which looks like this.

Ian Downard: And I just chopped it, so see there's no data currently. If I refresh it, it'll probably come back. There it is. Grafana is a great dashboard. You can see not only metric values, but we can also see alerts. In this example, I'm showing alerts from a backdated stream that I'm detecting anomalies in, so it's an anomaly agent. I'm also have a failure agents that's adding alerts into the inter-granfana using a restful API.

Ian Downard: These vertical bars are associated with those alerts. And this is how predictive maintenance has been done for years and years, was you just have a dashboard that you use for monitoring your factory. You could actually go down the process of updating lagging variables, we have another example to show.

Ian Downard: For updating these lagging variables, we first have to receive failure notification. We're going to subscribe to a new stream. This is another Java process, subscribing to a stream for failure notifications. And then we'll ... when we receive these notifications, we're going to load the feature store from MapR Database, and even if that store is gigantic, we're not actually moving any data here. We're just getting essentially a handle on the data, or the data frame.

Ian Downard: And then we're going to retroactively go back. We have to calculate a failure window, so we're going to calculate that 30 second slide. And then we're going to update the binary lagging feature like this, so we'll go back to find all those records within the last 30 seconds. And then we're going to pull up the about the sale derive feature and we're going to set it to true. And then for updating the continuous lagging feature remaining for life, we then look up all those where we're filtering that feature store by timestamp for those records older than 30 seconds, because 30 seconds is our window that we're calling about the sales, so the records older than 30 seconds, we're going to label them as not about the sales, but there's a remaining useful life that needs to be documented.

Ian Downard: This is an example of how we do that, and then we union the data frame that's the continuous lagging feature, we union that with the binary lagging feature and then we're going to write those back into the MapR Database feature store. And this is what that looks like, so it's receiving events, it's receiving failure events on this stream. And then it's going to load the feature table, which in this example is the 18,884 records long. And then we're going to update 30 seconds of about the sale is truth, and then all of the prior records going back into the prior failure, we're going to update the remaining useful life.

Ian Downard: And here's the small excerpt of the feature table and those records that got updated. You can see going back 30 seconds everything was true, and then prior to that the remaining useful life was documenting. And then finally send an annotation to the group's dashboard using their API.

Ian Downard: And what I'm starting here is the Spark job that's going to update these lagging features. Now it's waiting for failures on that stream. I'll just pull that out here, and then this window ... all I'm doing is sending JSON to a stream that looks like this. Sending a timestamp and the name of the device that failed. And after I send that, we will see the failure notification get picked up and lagging variables will be updated.

Ian Downard: This is the failure that I just sent. The feature store is currently about 17,000, due to the timestamps we can know that prior to 30 seconds it just updated seven records and it's trying to update 3,298 new records for remaining use life. When that's complete, we should see an excerpt of the feature table. Okay, this is what we just updated.

Ian Downard: Now, we'll go into talking about feature engineering for fast data. The reason this is important is because continuous time signals are common with sensors. Maybe in measuring vibration or audio. And it's really important to capture the full resolution of this data, because when you start averaging data just to ... a lot of times people will average or skip data samples to be able to keep up with fast data, but when you do that, you really hurt your ability to capture those signals that are going to predict failures. And really, high fidelity makes AI more effective.

Ian Downard: Like I mentioned earlier, the more data you give AI, the better it performs. The challenges, however, can be ... there's really two challenges that can happen. You can have congestion because there's if too much data and it's too fast, it just can't keep up. You can also have bottlenecks due to expensive transformation. In your pipeline, you maybe have ... you may be transforming the data from, for example, the prime domain to the frequency domain, or doing other things that require computations that slow down the pipelines.

Ian Downard: An example of fast data for industrial IoT or vibrations, this is really important because vibrations often give the first clue that a machine is starting to fail. Vibration sensors measure physical displacement. What you're getting from the sensor is a measurement of distance. A very small distance, but it's distance. And as this distance is changing, you can measure a frequency through a transformation called fast warrier transforms.

Ian Downard: For example, if you have a vibration that's a 10 kilohertz vibration, in order to measure that, you're needing 20,000 samples per second. This is from, if you've ever taken a signals course, it's called a nicles trait. It's the minimum number of samples that you need to capture that frequency. That's a lot. That's 20,000 samples per second. That's pretty fast data.

Ian Downard: What we're trying to do with this for the purposes of predictive maintenance, is measure what are the frequencies that we see throughout the day and we're going to hope that AI has this ability to see these patterns and detect long term patterns in the frequency domain.

Ian Downard: There's an example in the github repository that shows how to take a simulated time series data, apply a fast Fourier transform on it so you get frequencies that can be used for machine learning.

Ian Downard: As I mentioned, there's two challenges to this. The fast Fourier transform that's running inside Spark could be too slow, and maybe there's just too much data coming in. It can't keep up. Well, Spark has no problem with that because Spark can spark these Java processes that are running and the Spark executor can run 'em parallel, and the stream records, all that data is coming in, will be low balance across all of the Spark stream consumers. As long as they're in the same consumer group, which in this tutorial are. But that's one way you can address that first concern.

Ian Downard: The second concern is that you have too many producers, too many sensor data sources. In that case, you can scale by having all these different data sources sending to their own topic. This is going to significantly improve the throughput through those streams. And Spark consumers can consume to multiple topics, so you can just have them all in the same consumer group, subscribe to all of the topics, and this way you're going to be able to scale in terms of not only the compute, or the fast Fourier transforms, but you'll also be able to scale according to the sheer number of data sources sending data quickly.

Ian Downard: There's an example also in here which I'll just run quickly. I know we're just about out of time. That's seven ... This is sending to a new topic called fast data. This is simulating those displacement values that are coming from vibration sensor. And then we have another Spark job that's going to apply the fast Fourier transform, and whenever ... it's basically an anomaly agent and when those frequencies change beyond some threshold, we'll send an alert to Grafana.

Ian Downard: I will show you where that is here. Here's the example of what we just ran, consuming from the stream and applying ... here's the operations that it's performing. For every record that's on the stream it's doing all of it. All these attributes could be saved to your feature store.

Ian Downard: Also wanted to spend a moment to mention the use cases for Apache Drill. Drill is a distributed SQL engine, so it's a open source project. It's primarily fostered by MapR, so if it is included in the MapR platform, it supports query data through ANSI SQL, and it also enables you to perform these inquires without predefining schemas, which is nice.

Ian Downard: You can use it to join data from different formats. For example, you could do the SQL join on a field in a JSON file. You could join that with a field in a relational database. And you could also join that with a field in no SQL, so there's a lot of power in being able to combine data sets this way.

Ian Downard: And also, you can use ODBC to access ... to submit SQL queries through Drill that come from ... so that you can analyze data and AI tools or business intelligence tools, BI tools.

Ian Downard: Here's an example of what that looks like, and the apache zeppelin, this is a data science notebook. It's part of MapR data science refinery. Using Python, in this example, to evaluate some of the MQTT data that I'm streaming. But here's how we ... we're loading an ODBC configuration called Drill 64 which I've created on my node, and then I'm submitting a SQL query that looks like this. And this is what our 150 metrics of HVAC data looks like. The actual Appache Drill web UI looks like this. You can submit queries this way as well.

Ian Downard: And finally, I just wanted to say ... answer the question ... we've gone through a lot here. We've shown ... talked about the architecture of data pipelines for IoT. We've talked about Spark and Kafka interfaces. And all of this has been running on MapR, I haven't really talked about why we're doing that. The reason that MapR is really useful for predictive maintenance applications is because there's several challenges with this type of application.

Ian Downard: First of all, the sheer scale of IoT data is ... it's huge, so you need to be able to save that data on a platform that has linear scaling. You can't do that unless the platform is built to scale from the beginning. It's not really an add on feature. MapR is exactly that, linear scalability. Also machine learning requires full resolution of data, so we don't wanna have to move those huge data sources, that really kills the productivity of data science.

Ian Downard: And also, being able to deploy these models to production requires lots of different life cycles for ... we need to be able to access data frequently to retrain models, and we also need to be able to have the streams as a way to monitor the differences between models.

Ian Downard: Those are the roadblocks to predictive maintenance and these are the five reasons I think MapR is useful for industrial IoT. First, it's scalable. There's no data movement. Frictionless feature engineering, so you don't have to do ETLs and data [inaudible 00:49:33]. You can version data with instant snap shots, so if you're concerned about corrupting your raw data, you don't need to be with MapR because you can back that data up instantly with snap shot.

Ian Downard: And you can easily save and update big feature tables with that MapR Database connector for Spark. Also, we've got streams that's built into out platform so implementing something like rendezvous architecture is a piece of cake.

Ian Downard: And finally, all those things that are challenges for data science are addressed with MapR. For example, the data scientists like to have POSIX access to the files on all systems. They don't have to use cryptic hadoop, they don't have to run a command line interfaces that they're not accustomed to. It's just looks like a regular Lenox file system in Apgar. Also, that ability to take snap shots to easily back up and recover data is very nice. Some standard API's for SQL and rest API's for the files, rest API's for the data base, rest API's for streams, they're all part of the platform, usable for all types of data. Files, tables, and streams. With integrative tooling, for example Spark, Drill, and Zeppelin. And we support Docker and Kubernetes as part of our platform.

Ian Downard: If you'd like to learn more, check out part one of this webinar where we talk about the data pipelines, we talk about StreamSets, we talk about our justification for even looking at predictive maintenance and what it means for manufacturers. And we also have this tutorial on github where you can download the code and see what I was showing you today.

Ian Downard: David, that's all I have. Are there any questions?

Speaker 2: Yeah, thanks Ian. Just a reminder, if you have any questions, please enter them now into the chatbox and we will try to address them here live before the top of the hour. Just a couple questions that have come in already, Ian.

Speaker 2: The first one is what machine learning or logarithm would you recommend for predictive maintenance?

Ian Downard: What machine learning algorithms for predictive maintenance? I think they're talking about neural network algorithms, so for training models that can predict time series values, there's often two that people use. You'll see in a lot of published tutorials that people are using recurrent neural networks, RNN's. And also, long short term memory networks, LSTM's. And also, those are in our github repository, and we have a couple, basically included a survey of neural networks that are currently used for predictive maintenance and there's shown at the bottom of the read me file. Some example data science notebooks that'll go into LSTM an RNN examples.

Ian Downard: Are there any more questions?

Speaker 2: Yeah. Another one, how fast can the data be injected into the data store?

Ian Downard: Data can be injected into MapR millions of records per second. It depends on the hardware that you're using. It depends on the type of discs, are you using SSD's, are you using spin discs? I think the important thing ... I mean, that's gonna be the case no matter whether you're using MapR or not. The important thing with MapR is to realize the architecture is simpler than other pubsubs service such as Kafa, for example. Kafka has dependencies on Zookeeper. It can't handle the same number of topics, it can't handle the same throughput, just in terms of bites per second, or records per second.

Ian Downard: If you're curious about how MapR compares with Kafka, you can do MapR stream versus Kafka and find a lot of posts. First one talks about that, but at the end of the day, a lot of times it's not the performance that is the barrier to success. The barrier to success for machine learning and production is often management. It's often effectiveness of data science, so all those advantages that I mentioned like no data movement, no ... the built in support for docker [inaudible 00:55:01], built in support for anti SQL. Those are saving time. Those are saving people time. You can ingest faster with just spending more on hardware, but you can't create time. You can ... with MapR, we have these features that make management easier. Make data science ... reduce the frictions of data science.

Speaker 2: Okay. One more here. Are the techniques you described specific to industrial Iot, or are they applicable to more general IoT applications?

Ian Downard: We were talking about predictive maintenance in this webinar. That is pretty much unique to industrial IoT, but any other IoT application that's dealing with time series data is going to use the same ... is going to be able to use the same patterns that I discussed for data pipelines. All those aspects of the feature store, namely that you need a no SQLs feature store, you need secondary indexes, you need data base connectors for Spark, those are some ... those are all patterns that you're going to want to have generally for any IoT application.

Ian Downard: Unless you're trying to predict patterns that are frequent, but anyway I think that this webinar is primarily focused on industrial IoT because we're talking about predictive maintenance.

Speaker 2: Okay. And we got one last one, and then we'll close it out. Can the techniques be used in relational database or no SQL? [crosstalk 00:56:55] Or is no SQL needed? Sorry.

Ian Downard: Okay. I'm not ... I'm sure you can use relational databases, they're just less convenient. If you want to add attributes to a feature table and you're restricted by schema, that's much more difficult. But no SQL, the schemas are flexible, so if you ever want to add these attributes that correlate to failures, such as weekend ... or not necessarily ones that correlate to failures, but ones that make analyzing that data inside data science tools easier, you need to be able to add those attributes and you don't want to have to burden those people who do the data exploration. You don't want to have to burden them with any hurdles to creating those deals that are going to simplify analysis later. That's the advantage of the no SQL.

Speaker 2: Okay. Great. Alright, thank you Ian and thank you everyone for joining us. That is all the time we have for today. For more information on this topic or others, please visit We will be sending out a link to the recording as well as links to the assets that Ian mentioned throughout the presentation. Thank you again and have a great rest of your day.