How Data Science Will Play a Pivotal Role in the Future of Equipment Maintenance


Ralf Klinkenberg

Co-founder & Head of Data Science Research, RapidMiner

Rachel Silver

Sr. Product Manager - Data Science & Analytics, MapR

As predictive maintenance converges with IoT, data science will play a pivotal role in driving innovation in equipment maintenance and related business outcomes. Just think of the billions of data points we capture across devices from infrared tomography, sonic and ultrasonic analysis, motor current analysis, vibration analysis etc. Yet according to Gartner 72% of manufacturing industry data goes unused due to the complexity of today’s systems and processes. Why? Historically It has been a challenge to reliably and cost effectively collect and manage the IoT big data – further it takes data science and mathematics to help a human identify patterns, derive insight and take action on the data. Together MapR and RapidMiner are radically simplifying the picture and helping deliver the future of maintenance operations today.

In this webinar you will learn:

  • Enrich your IoT data with data from the edge
  • Perform data science directly on your IoT data in place
  • Identify equipment in need of maintenance
  • Optimize the balance between repair costs and uptime
  • Deliver economic impact to maintenance operations


Rachel Silver: Hello. I'm Rachel Silver. I'm the Product Manager for Machine Learning and AI at MapR. Today we're going to cover how MapR and RapidMiner join together to solve predictive maintenance use cases. To begin, I'm going to start with a high level overview of what predictive maintenance is, and how businesses benefit from it, then I'm going to cover how the MapR platform supports IoT machine learning technologies and enables the predictive maintenance workflow. As soon as I'm done, which should be maybe five minutes, I'm going to pass the mic to Ralf Klinkenberg. He's going to walk through a predictive maintenance workflow using RapidMiner against data stored in MapR.

Rachel Silver: To begin, predictive maintenance is intended to recognize equipment failure patterns in a way that is predictive. This is in order to determine the condition of operating equipment and predict failure before it occurs. Lately, this has emerged as a primary advanced analytics use case as manufacturers have sought to increase operational efficiency and as a results of technical innovations like IoT and edge computing. The ultimate aim of predictive maintenance is to provide cost savings over scheduled-based preventive maintenance or unplanned reactive maintenance, both of which result in machine being unavailable during critical periods.

Rachel Silver: Predictive maintenance models are typically built using anomaly detection algorithms trained on the data generated by IoT sensors. That data is commonly a streaming feed of metrics that predict systemic health, like infrared thermography and ultrasonic current vibration and oil analysis.

Rachel Silver: The cost savings and the business benefits are enormous because predictive maintenance allows companies to schedule planned interventions which reduce downtime and operating cost while improving production yield. Without predictive maintenance, repairs and updates are performed in an unplanned and reactive way that can lead to machine downtime or unnecessary parts and Liebherr expenditures. With predictive maintenance, tasks are only performed when necessary which increases operational efficiency, minimizes production hours, less maintenance, and lowers the overall cost of spare parts and supplies.

Rachel Silver: To understand how MapR supports predictive maintenance and integrates as a data layer for our partners, let's quickly review what capabilities the platform natively provides.

Rachel Silver: MapR is a multi-cloud data platforms that can support any kind of data, sensor data, machine data, images, graphs, tables, streaming, et cetera, and provides fast access to all of that data from any software application. We shift our own multi-model NoSQL database and event streaming components in addition to many open source ecosystem products, like Apache Spark and Apache Drill.

Rachel Silver: The MapR platform is highly available, secure, multi-tenant, and operates with a global namespace to support all types of workloads from batch to streaming. MapR supports a wide range of APIs to enable data access from any tool or library, including the ability to map to global Namespace to container Kubernetes or docker. MapR can be deployed anywhere from small footprint edge clusters to on-prem and to any cloud.

Rachel Silver: Where MapR differentiates and support from machine learning is in our data access capabilities. Many vendors are limited to one type of data access or API, this limits the tool in the libraries that can support a particular machine learning is concerned. By supporting many API's, including POSIX, S3 and HDFS, MapR allows access to your data from any tool allowing data scientists to do their machine learning on the data where it sits.

Rachel Silver: Finally, here's an example of how predictive maintenance workflow could run on MapR that includes MapR Edge. MapR Edge, to define this, is a small footprint edition of the MapR converge data platform and addresses the need to capture a process and analyze IoT data close to the source. In this example, we start at the edge where we transmit sensor data from the equipment sensors to the MapR Edge cluster. Some data cleansing operations aggregations can be done in MapR Edge to scale back the amount of data you need to stream to the primary data cluster.

Rachel Silver: Let's assume this happens in many locations across the globe that could represent factories with one central maintenance processing hub. In this hub, the predictive models are built, in this case, using RapidMiner on the data that is transmitted from these edge clusters. When a trusted model has emerged the model itself is sent back to the edge cluster, which powers the scoring of the model against the real-time data coming off of the centers and generates the alerts to indicate when maintenance is needed.

Rachel Silver: At this point, I'm going to pass this over to Ralf. Mr Ralf Klinkenberg is co-founder of RapidMiner and Chief of Data Science Research.

David: Ralf, I think you're on mute.

Ralf K.: Now we're going to focus on the application of predictive maintenance. We'll try to predict machine failures before they happen. Before we do that, I'll give you a very quick overview of RapidMiner as a platform and how it integrates with MapR, and then we'll dive into how data is loaded from various sources from the MapR Hadoop cluster and how we can build predictive models from this data, and how we can then use those predictive models to score machine sensor data to predict whether the machine will fail or not, and how likely the failure actually is.

Ralf K.: Let's get started with the RapidMiner platform. It covers the overall data science or data mining process from data ingestion from various sources, can be structured data like Excel sheets and SQL databases, can be a do clusters, NoSQL databases, could be text data, Twitter feeds, other Internet sources.

Ralf K.: Then you have typically the data integration, data preparation, data cleaning as a very important step to prepare the data for the actual machine learning. Machine learning algorithms then try to automatically find patterns in the data that can be used for predictions or to find anomalies to find unusual situations that are also worthwhile alerts.

Ralf K.: Once a model has been trained and validated and considered sufficiently accurate and reliable, it can be brought into deployment. Then after that, obviously, you want to integrate the results in one way or the other, for example, by sending out email alerts, by feeding output to Hadoop cluster on MapR. Or to integrate it to other systems via web services or a Java, API, or some other means. Basically, the platform covers the whole analytics process from data ingestion to model deployment, interactive modeling to automated modeling and model deployment.

Ralf K.: The key tool that is used to design data mining processes is called RapidMiner Studio. It's a visual data mining process designer and every little box in the graphical user interface corresponds to one step in the data analysis. This can be interactively executed on your desktop but it can as well be automated on a server, or in parallel on a Hadoop cluster like in the MapR.

Ralf K.: The advantages are: it can very easily integrate the data from various sources and thereby accelerate your data preparation steps; it allows you to automate the modeling process, the model selection, and parameter optimization process; and then also, it can be extended easily if needed since the whole platform is based on our open source solution for which we have different flavors, the open source version, community version, enterprise version. Interfaces are open more than 90% of the source code that's available which means it's also extendable by third parties for which we have a marketplace and a community of more than 380,000 users in more than 100 countries worldwide.

Ralf K.: For seamlessly integrating Hadoop distributions, especially MapR which is particularly fit for the purpose of IoT due to its fast processing, we have an extension to RapidMiner called Radoop, which seamlessly integrates Hadoop and Spark into the visual process design interface of RapidMiner. It allows you to quickly design processes you want to execute in parallel on your Hadoop cluster using RapidMiner as a design tool. You can both use basic commands that you have available in Hadoop clusters, particularly machine learning libraries available for Hadoop. Or you can actually also execute complete RapidMiner processes in parallel using SparkRM.

Ralf K.: Now from an architectural point of view, the advantage for you is that you don't have to bother too much about bringing your process into a parallel distribute execution. You can simply design it in a visual way and then RapidMiner will automatically push it down into the Hadoop cluster. Looking at the architecture or how this principle works, you have the RapidMiner Studio to design your workflow. For example, some part is for the data preparation, another part is maybe for training some models doing some clustering, anomaly detection, or running some other scripts.

Ralf K.: Then, this will be automatically mapped to the appropriate libraries in your Hadoop environment, for example: Spark, particular Spark machine learning libraries like MLlib; or if you're using Python scripts within your Hadoop scripts, then they would be automatically mapped to PySpark; or SparkR, if it's Rscripts; or to SparkRM if it's RapidMiner processes.

Ralf K.: This can be used to map your visual design process into your MapR environment to automatically execute your processes there, distribute it in a parallel. RapidMiner also integrates with the authentication so that the security aspect and data protection aspects are taken care of using Kerberos and the MapR user login. That can be done very seamlessly. Once you've configured it, you don't have to worry about those details anymore. RapidMiner Radoop automatically addresses the underlying libraries like Mahout, MLlib, Spark, and so on.

Ralf K.: For Hadoop authentication, here's a little diagram going into a little bit more detail about that. RapidMiner Radoop would request the authentication from Kerberos server. The Kerberos server would then grant an authentication ticket. There would be a request service session. The service would then be confirmed, then only you can access the Hadoop cluster. That it means without authentication, there is no access to the underlying Hadoop cluster.

Ralf K.: The integration of MapR into RapidMiner supports a lot of things. Like I said before, you have the MapR user login and authentication handle, including the Kerberos authentication and server-side impersonation. We use the MapR file system as a data source. You can retrieve data from that into RapidMiner using Hive or Read operators. Then you can use all existing Radoop operators, including SparkRM within the MapR environment.

Ralf K.: This is just to give a little explanation. If you have RapidMiner process, you can take the whole process and distribute it on the various nodes and using all the RapidMiner approach available for both structured and unstructured data.

Ralf K.: A little bit wider view is this RapidMiner studio and the RapidMiner Radoop that also is a server for collaboration, sharing resources, user and user group management, and password handling, et cetera. The server also provides this process scheduler for automatically executing processes periodically. It provides the ability to design web apps and web services. Any RapidMiner process can also be made available for web service for integration.

Ralf K.: We have cloud offerings also in the marketplace with extensions by third parties. Let me just skip that.

Ralf K.: This is another architecture view. It has the same components, but you can see that you can basically also integrate the results easily into other visual applications.

Ralf K.: Enough of the architecture and the more technical integration aspects, let's now look at the actual use case. Predictive maintenance in the context of the industrial Internet of Things. Here, we are looking at data about machines, telemetric data, and machine data.

Ralf K.: Just to recapture quickly what we've heard before, what's predictive maintenance about? It's about predictive machine failures or component failures of the machines before they happen so that you can prevent them, for example, by doing a timely maintenance.

Ralf K.: How can that be done? Well, machine learning is leveraged to automatically find patterns in this historic data, the observations from the past, and uses this to generate predictive models. This can be solved as a classification task and then be deployed as a scoring task. This is, if you want to find patterns from the past where you have non-failure types.

Ralf K.: If you don't have these non-failure types, you still want to find failures, then you have to do something that's called unsupervised learning. That means there are no examples from the past but still you want to be able to find it. This can be done if you look at anomalies or outliers. That means you try to automatically detect situations that significantly deviate from what has been observed in the past and then that can also trigger alerts or maintenance actions.

Ralf K.: The advantage of performing predictive maintenance in such a data-driven way leveraging machine learning is that it's individual for each machine. That means it takes into consideration the load of the machine, the stress the machine has been exposed to, the maintenance it has experienced in the past, any kind of problems or issues it had in the past. Some machines may need less maintenance, so you can save money there by not doing it more often than necessary. But other machines that are exposed to high load changes, high machine stress, et cetera, may need more frequent maintenance to prevent failures.

Ralf K.: Hence, it's important to consider this as a machine-individual, load-dependent, and wear-dependent prediction. This allows you at the end reducing your down-times, make your process more reliable, and overall lower costs both the costs for machine failures and repair, the costs for production stops, and maybe even broken contract if you can't deliver on time as well, as the cost for the maintenance itself because it can plan much further in advance.

Ralf K.: This is the overall perspective of what we want to do. From a business perspective, what's the question? It could be something like, what's the probability that a machine goes down due to failure of a component within a certain time interval, let's say, two weeks, two hours, whatever is relevant and technically feasible?

Ralf K.: This can be treated, for example, at the multi-class classification problem, if you want to predict which particular component is going to fail. You can use machine learning algorithms to build such predictive models. There's a wide variety of hundreds of algorithms available both native implementations of RapidMiner as well as seamlessly integrated third-party implementations, like Python libraries, ML libraries, H2O, and other publicly available libraries.

Ralf K.: The training happens on historical data, data that has been connected about the particular machines in the past both from normal operations as well as some failure situations. The target is to avoid unplanned maintenance due to unexpected machine failure. We want to be able to predict when the machine failures are likely to happen and then do the maintenance before those predicted failures.

Ralf K.: The overall workflow that we are going to look at now is we want: to connect to the data sources, namely, MapR; do the data discovery; feature engineering, that means data preparation and generating features that are meaningful enough for the machine learning algorithms to find patterns; then we do the actual machine learning, the predictive modeling, including some validation; and then at the end, we also see how this can be deployed for real-time scoring. That is when your measurements when machine come in try to predict as quickly as possible, as far ahead as possible whether the machine will fail or not.

Ralf K.: Let's start with the look at the data. What kind of data sources are we looking here. In this particular use case, we look at real-time telemetry data collected from machines. We look at error messages the machines have issued, historical maintenance records, and then we also have data about failures of the machines. Possibly also, other machine information like what type of machine do I have, how old is the machine, how old are a critical components, et cetera.

Ralf K.: Of course, there can be more sources of information, but due to the limited time, we'll focus on this for today. But in many use cases, for example, it will be helpful to also consider textual information. For example, if the machine operators keep a log book of things they observe, this can be very valuable information and can also be integrated in such processes. Okay.

Ralf K.: Let's look a little bit closer. What kind of data sets do we have here. The first data set, we look at telemetry data. That means it's time series data which has information about the sensor measurements, like voltage, rotation, pressure, vibration measurement, et cetera. It's collected from the machines in real-time and for every hour since 2015. The data is stored on a MapR cluster.

Ralf K.: We then have error messages. In this case, we mean messages that occur is owned by the machine without the machine really breaking. So there may be things that are not perfect in operations, there's a message occurring pointing this error out, but it doesn't mean the machine has to be stopped or the machine is breaking. But still it may be very valuable information. Along with the particular error, we obviously also have information, like error date and time so that we can connect that to the corresponding telemetry data.

Ralf K.: Then we have a log of the scheduled and unscheduled maintenance. That means which component of which machine has been experienced in maintenance. This is obviously also relevant information because if you had a machine with very recent maintenance, the chance that something broke is probably much lower than if you have a machine that is overdue for maintenance.

Ralf K.: Then we have some data about the machines themselves, such as age, model, et cetera.

Ralf K.: Then, of course, information about the failures that we have observed in the past. Those may have triggered, for example, the replacement of certain components which means the machine had to be stopped, the component had be replaced, and we had a loss of production time. This information is kept together with the machine ID, some component type information, replacement data time, and possibly additional information.

Ralf K.: Now you can see already we have different data services here, which somehow need to be aligned, combined to provide the basis for successful machine learning.

Ralf K.: Now we want to go through the process of taking a closer look at those data sets as well as using those in the pre-processing stage to come up with the final data set that we will present to the machine learning.

Ralf K.: Let me see if I can switch to RapidMiner and show it in live. I hope you can see my RapidMiner screen now. Let me see. Moderator: Yes, we can.

Ralf K.: Okay, great. Thanks. What you see here is the startup screen of RapidMiner Studio. That is the desktop tool used to visually design data mining processes. When you start RapidMiner Studio you can start with a blank piece of paper and design your data mining process from scratch. You can use tools like Turbo Prep to help you in the data preparation by using a wizard to ask you a few questions and quickly guiding you to the process of doing that. Or you can use Auto Model once you have your data prepared to automatically apply various learning algorithms, compare their performance, optimize the parameters on the given data set.

Ralf K.: These are always interesting approaches to take if you start just with a new data set. If you have a particular application like Predictive Maintenance in mind, we also provide application templates. Let me just quickly give a you look on that one. If you look at this, you see a complete RapidMiner process. Every box corresponds to a particular step in the process, like for example, retrieving reference data from the past, using that to automatically identifying the most important indicators for machine failures.

Ralf K.: Then this is also used in automated optimization for a Predictive Model and this model is then applied on a new data set that is fed in here. Then there's some post-processing going on to basically run the prediction results that you see the most urgent maintenance recommendations at the top.

Ralf K.: Let me give you a very quick walk through this one and before we look at the data that I showed you on the slide. Here we have information about machines given by their IDs. We have failure information, whether the machine fails, yes or no. Then we have various sensors. Obviously, as a human, it's very hard to detect a pattern here. We can take a look at the metadata to, for example, see what is the value distribution of particular sensors. We could also look at charts to see how different sensors might be related to the target variable. Since we have so many sensors, we may also want to take a look that's capturing more of them.

Ralf K.: In parallel, you can see that I plotted all the sensors on the X-axis. Then maybe I go one plot back here, each line here corresponds to one line in the data table. My hope is to see the features on sensors that help distinguished failures and non-failures. Unfortunately, there's too many data points so I can't see it here. If I switch to the average per category failure versus non-failure, I can see the blue line, average non-failure; red line, average failure; and the shaded blue area is a standard deviation of the non failures; shaded red areas, standard deviation of the failures.

Ralf K.: You can see that for a lot of senders, like sensors 15 to 25, for example, there is no significant difference in average either. So we don't see much of a chance to find something there. But here in the sensors one to, let's say, 10, 11, there's some difference in the average of the categories failure versus non-failure. So there might be a chance for learning algorithm to pick it up.

Ralf K.: This is basically what happens then once we pass the data onto the further process. Maybe let me add some step to show you how easy you can add, for example, a modeling technique. We added decision tree, just feed in the same data that we also feed to the other parts of the process.

Ralf K.: Just to show you how quickly that can be done. Now if I run the complete process, you will see we have a particular decision tree that helps to automatically predict whether we'll have a failure or not. There's a lot of cases where it seems it's easy to decide that there's no failure. For example, if sensor two is very large value, there should be no problem; sensor three has a small value, no problem; but there's a sensor combination where we do have problems on this path. Obviously, this is a very simplified example in the template.

Ralf K.: Okay. Let me show this. These are the predictions generated by the optimized model in this particular process. We have cases where we predict a failure and we have a confidence. These predictions are now sorted by the confidence that the machine will fail and it will give me a priority list for my maintenance services. Okay.

Ralf K.: In the optimization that happens in this particular process, we use a learner called K-nearest neighbors, which has a parameter K, which describes how many of the most similar data points from the path should I consider when creating a prediction for the new data coming in. Depending on this parameter, the model is more or less accurate.

Ralf K.: Here, there was an optimization for the parameter K, and then we can plot that K versus accuracy. Or, in this case, we see a larger neighborhood helps to more robustly predict failure probably you want at the end, towards the end for the very large case, it starts to drop again. So optimal K is somewhere in the mid-range.

Ralf K.: This is what you can already see in the table view. If you sort it by the biggest accuracy, you can see it's somewhere around 30 to 39, roughly in that range. This is just a template process that you can use to start your exploration into Predictive Maintenance. But you can already see we assume we have one reference table.

Ralf K.: Now if you look back at our example, we actually had five different tables. We have telemetry data, which also happens to be time series data. So it's not as simple as having one machine ID and nicely prepared label failure, non-failure and already nicely prepared sensor values next to it, we still have to create that from the time series.

Ralf K.: Then, we have error information, we have machine information, we have maintenance information, and we have information about the failures. Taking a quick look at those, this is the telemetry data, we have for each entry, machine ID, the voltage, rotation, speed, pressure, vibration, and we have a timestamp date and time.

Ralf K.: One more thing, you may want to look at, maybe you also want to see some metadata, like what is the value range typically for the voltage, or for the rotations, pressures. Of course, you can take any of those, double-click on them and you can see that enlarged. This way you have a chance to quickly explore your data.

Ralf K.: Yeah. This was the telemetry data looking at the error information. We have error codes for the particular machines. We have again the timestamps.

Ralf K.: Looking at the machine data: we have a machine ID, we have the model, and we have the age, how long ago has this machine being either installed or last maintenance. It depends a bit what you think is more critical in your case. Of course, you can have multiple of those attributes.

Ralf K.: Then we have maintenance information, like what machine, what component needed to be replaced, and then the timestamp again.

Ralf K.: Then we have the failure information. For each machine, you have the component that failed and the particular timestamp.

Ralf K.: These are the various tables that we have. In this case, I just pulled them from my local repository as some samples. Obviously, since we're here looking at larger data volumes in the real application, we want to actually pull this data from a MapR Hadoop cluster. In this case, we define which cluster we want to use and then in that cluster we specify which are the data sets we want to pull.

Ralf K.: Then you can do the required pre-processing. For example, if you want to perform, for example, duplicate removal. If you want to execute a particular refiner process, you can also do that. For example, transform the columns that contain contains the date and timestamp into an actual date and not just as a text column, you can sort by the date in that particular way. Then you can, for example, also use a loop that you can actually create information for each machine, and so we loop over the machine IDs, and then we loop over the particular attributes you want to look at.

Ralf K.: We also generate new attributes from the given ones. I don't want to go to all the details I just want to give you a high-level feeling of, basically, you can define RapidMiner process, like we saw before in the application template, but now you can do the same in a distributed and parallel way on your MapR cluster. In this case, what we do here is we look at the aggregates of the telemetry data over time. So we have a 24-hour time lag that we can use later on if we want to do the prediction, so that we can predict 24 hours ahead if the machine is going to break.

Ralf K.: At the end, once we have done our pre-processing, we'll just store this as a new data set. Once we have done that, we can do the same thing for the other data sets, for example, for the errors and do the transformations we need, for example, to look at the error ID, to do aggregations for each error code, for example, if we want to count how many certain types of errors occur for the particular machine. We can then combine them as a telemetry data using a join once we identify the ID attributes and do further pre-processing, for example, like replace missing values.

Ralf K.: Again, you can integrate complete RapidMiner processes here, for example, to count the different numbers of error types in the last 24 hours. For example, if you see a certain increase of particular error types, that might be an indication of a failure happening soon. Then at the end again, we will store that as another data set.

Ralf K.: For the maintenance data, we do a similar thing. Retrieve again telemetry data and maintenance data. Select the attribute that we want to use from the data table. Then once we bring those together, we can actually calculate for each component type what has happened in terms of maintenance. At the end, we can store the overall result again.

Ralf K.: Once we have done that for all the, let's say, raw data sets, we can start integrating them, generating, for example, features from those. Again, we can bring those sets together, perform the required joins across the machine ID, for example, and the timestamp, and then come up with the final data set that we have. The only thing that is missing for that is the label ... I just noticed a little typo there.

Ralf K.: For that, we obviously need the failure information, which we collect as well from the MapR cluster, generate to the appropriate and target variable from it. Again, replace missing values. Finally, define what the role is, for example, that machine ID should be treated as an ID and that the failure information should be treated as a label, the target variable we want to predict.

Ralf K.: Only now we have a single table that has everything including the label for the prediction, which is a failure that we want to predict, given the data with the 24-hour time lag. So once we have done that, let me quickly switch back to the slides. We've seen the data. Now I'll skip a few slides because I always show the data live. So we have seen the machine data, we have seen telemetry data, the machine failure information, the maintenance data.

Ralf K.: Now, the last step, the feature engineering. Like I said, we typically pick a time window. We want to allow some lag between the things we observe and the thing we want to predict because the prediction is only useful if you have enough time to react. Doesn't help if I tell you your machine is going to fail in five seconds if the maintenance team needs, let's say, one hour to be there. So we need to consider the relevant time window.

Ralf K.: Obviously, in a real-world application, we would apply the same logic, the same process, with the loop across multiple time horizons so that we would drive for different time intervals, like one hour, two hours, one day, two days, one week, two weeks, et cetera, to predict the likelihood of failure so that you can do short-term as well as long-term planning.

Ralf K.: In RapidMiner, you can use a time series extension for the time windowing on the data and then extract rolling aggregates of the measures that you want to look at. Those could be, for example, average vibration level or average voltage, average rotation speed. It could be minimum, maximum rotation speed or vibration. It could be number of errors so you can do columns as well, and sums of things, and whatever you consider relevant. Yeah.

Ralf K.: For example, the days since the last replacement of a component was one thing that we did in one of the processes. I already showed you the process for that corresponding data tables that has access information. I then again combined into a single table where we add the machine information as well as the label. But this was a process for machine feature aggregation. Then we edit the label, that is we had a look at the failure data that is used to create the label column to tell us does this machines fail after this 24-hour window or not? Then we have aggregate features from the previous 24 hours to give us some hints on whether this is valuable or not.

Ralf K.: Then, this is the final table. Resulting from that, for each machine, we have the failure information given the particular measurements and this is what we can actually use for the machine learning algorithm.

Ralf K.: What we want to do now is predictive modeling, apply machine learning to actually predict the failures. We want to also validate how accurate is the model on previously unseen data. This can be done by splitting the overall historic data into training and test data, and then training the model only on the training data, and then testing its performance on the previously un-shown test data.

Ralf K.: Obviously, there's also other approaches for validation instead of a single split into training a test set. You can do what's called cross-validation. You randomly split the data set into, for example, ten parts, iteratively use nine parts for training, one for testing, and to iterate in such a way that each of the 10 parts is a test set one and the other nine of the training set for that particular test set. That way you would get the average performance and standard deviation from that performance.

Ralf K.: Or if you have situations where you think the patterns change over time, then you probably want to do more sophisticated validation called sliding-window validation that actually simulates how time passes that you've always trained on a certain data set, then apply the work to the next set of data, train it on the newer data, apply it to then next data, and so on.

Ralf K.: All these different validation schemes are provided by RapidMiner. You can simply choose, which one you want to use. In this case, we'll use that to built various multi-level classification models, validate the performance of each model, optimize model parameters, and then eventually, also compare the performance of multiple models.

Ralf K.: I already showed you the application template and I added some machine learning algorithms, like the decision tree learner. Following this logic, you can basically retrieve the data either from MapR or from your local repository. Depending on the size, typically, you would probably rather retrieve it from the MapR cluster. Then you do the actual modeling on the training data only so you can use the split operation to basically create training and test splits.

Ralf K.: Then, you have one table where you have the information, whether it was a failure or not, plus the template data and the aggregates that you computed and the system can then try to learn something. You have a second table, which was a test data that can then be used to use the model for prediction, compare the predicted label with the real label that you've already seen in the past whether the machine did fail or not fail. Then you can compute the number of errors, the predictive accuracy, or other metrics like precision and recall.

Ralf K.: One way of quickly building a lot of different models is using RapidMiners Auto Model feature. This way you can select a broad range of learning algorithms for the same task and then RapidMiner run all of them in parallel to basically find out for you which model is the best. Then for each model, you will see the predictive accuracy, like what's the accuracy when this particular model makes predictions, you also see other information, like the run time in milliseconds needed for the prediction for the model training. You can use this information to either refine your search for a good model or to make a decision and say, "Okay, this is a model I want to deploy."

Ralf K.: If you want to dive a little bit more into the individual models, and we will also take a look at that in RapidMiner, maybe quickly switch back to RapidMiner to show that. I already prepared it so we don't have to wait for it. For example, here, we have the so-called confusion matrix of a particular problem that we have seen before. We have the data from the past. Sometimes in the past there was no failure. Sometimes there was a failure in component four, sometimes a failure in component one, sometimes a failure in component two, and sometimes at three.

Ralf K.: Now the question is: in a situation where there was no failure, how often did the model predict correctly that there's no failure? We see in this case 100% of the no failures were recognized correctly. In other words, there was no false alarm on this small test set. If you look at, for example, the failure of component four, we only had one case and this was not recognized, so this was not so good for this particular model. Looking at this component failure of type two, three of them were predicted correctly as component two failures and, hence, we have a three out of four correct, is 75% accuracy.

Ralf K.: For each category, you can recognize how accurate the model is and you can make a decision, is this model robust enough? Is it accurate enough? Can I trust it? This particular case, GBT model type, stands for Gradient Boosted Decision Trees. It's not a single tree, but it's an ensemble of many trees that are combined to give more accurate predictions.

Ralf K.: Here is a look at some of the individual trees, for example, how long has it been since the last time component two was replaced? How long has it been since the last time component one has been replaced? Obviously, the different trees may look the same or different features, they're created from different sub-samples of the data.

Ralf K.: This is just to give you an idea how this looks like in the tool. Let me go back to the slides because then it's easier to compare multiple learners across the same metrics here. In this case, we see the gradient boost decision tree and we see a naive bayes classifier and they both have the same task to predict the failure. Then we can see which of the two actually does a better prediction.

Ralf K.: Once I have decided which particular model to use for my predictions, I want to deploy it and use it for real-time scoring. How do I do that? Well, the idea is, I have to, first of all, store model. I have to do the same kind of pre-processing at it on the data from the past on the new data. Then I can use the so-called Apply Model operator of RapidMiner to take this Train Model and apply it to the new data. This can be done on RapidMiner server or it can be done on MapR, wherever you want to do the prediction.

Ralf K.: Obviously, in an IT scenario, you probably want to move the decision as close as you can to the machine. Maybe you want to do it on the edge or close to the edge. That probably means you will probably want to deploy it on your MapR clusters.

Ralf K.: Well, once you use the model, the result you get again is very similar, you have a prediction. Will the machine fail or not? For each of the component, you have the confidence that the machine thinks that that component will fail. That means you can see ... First of all, you can sort by this confidence and see what are the highest risks of component failures, where do I need to do maintenance first. Yeah, you can also use that in an automated way, feed that into your maintenance scheduling planning system, or you can use it for alerts, depending on how high the confidence level is.

Ralf K.: I hope that gave some insights on how you would typically tackle such a predictive maintenance use case. Thank you very much for your attention. The MapR team and the RapidMiner team are online to answer questions. I'm almost sure we will not be able to answer all of the questions live, but we can follow up with those questions that we cannot answer live afterwards via email. Thank you very much for your interest and attention. Now, it's your stage for your questions.

David: Thanks, Ralf. Just a reminder to submit your questions in the chat box in the lower left-hand corner of your browser. Let's get started, guys. We have a few questions in here. Let's start with "What kind of models and the logarithms are supported?"

BP: I think I'll just take that one. If you remember, Ralf was building some simple decision trees, now RapidMiner, in the context of the design environment, allows you to use approximately 150 plus different algorithms that are available as drag and drop boxes. Now some of these algorithms are classic algorithms that have existed for decades or centuries in some of in some cases, obviously, we haven't limited them to be performant and scalable.

BP: But along with that, as you noticed on the screen, there are operators that are based on the Spark ML library. We also have certain algorithms that come from the H2O library which is another popular data mining package out there. There are about 150 to 200 different algorithms that you can typically use. We have quite a rich list of ensemble modeling techniques. The data science problem is definitely required to explore some of these to get the best out of your data.

BP: Lastly, the platform, and I'll mention this couple of times, also supports things like the ability to write your scripts in R or Python that you can push down as PySpark or SparkleR scripts. That pretty much gives you access to almost 90 to 100% of what are the algorithms available out there.

BP: I think RapidMiner provides in the context is the way to try these algorithms without having to worry about the code and the syntax and so on. But obviously, when you want to use something that we do not provide in the code RapidMiner, you can explore algorithm supported by Python, so your possibilities are pretty endless there.

BP: Hope that answers your question.

David: Great. Thanks, BP. I'm going to skip down a little bit, guys. "How can I move from MTBF to a ML model for failure prediction?"

Ralf K.: I think I would need some light on what does MTBF to answer that one. If you don't mind, if you can put that on the chat. In the meantime will answer other questions.

David: Okay. No problem. Let's go to another. "Can I use my scripts in this environment?"

BP: Yes. Just like I answered previously, you're now able to use R and Python scripts, you can write your own fix scripts. We also support high query. You can do some scripting Bui that is Java, basically, the script language. I think that there is absolutely lot of options when it comes to scripting.

David: Okay.


Ralf K.: Yeah. Would you like me to take this one?

BP: Yeah, please.

Ralf K.: The meantime before failure basically means you have historic information about how long it typically takes before you have to replace a particular component. But this is not individual for the load that the machine has, or the stress in machine has, this is the average or the mean. Which means it's not as accurate as it could be, it's treating all machines the same. Some machines may have a constant low-load and may last very long while other machines have high-stress, high-frequency of changing requirements, which makes components wear off faster.

Ralf K.: The advantage of doing machine learning base and data-driven approach is that you can do more machine individual forecasts and, hence, tailor your maintenance more to the actual need instead of doing the standard, treating every machine the same.

Ralf K.: Now the question was, how can I move from the mean time before failure maintenance model to this machine-individual, data-driven model? Well, obviously, you have to first collect data to be able to do this. Like you saw in the example in the presentation, data about the machine failures from the past, you need data about sensor measurements, something that allows you to distinguish how the machines are treated differently to give you a chance to recognize, to say it in layman terms: which machines suffer more than others?

Ralf K.: So you know, you can recognize from those metrics, okay, these machines probably need maintenance earlier than others. Obviously, we don't want to rely on probably or I believe, so that's why we collect the data and then use statistics and machine learning to actually help us quantify for each machine, for each component how likely is the failure in a given time horizon. I hope this answers the question.

David: Thanks, Ralf. "How does the solution scale with data volumes?"

Ralf K.: It's a very good question. Obviously ... Okay. Sure, go ahead, please.

BP: No, no, go ahead. Okay. We both are eager to answer this one. Thanks to our partner MapR here, right? One of the beautiful things that Hadoop platforms and then MapR provide is the scalability that comes from having additional nodes and having that in a management layer on top that distributes the workload. What RapidMiner and MapR do together here is, from end-user perspective, we take away the complexity of Hadoop designing visual workflow by dragging and dropping these boxes, if need be some of your Spark scripts there. Right?

BP: At the end of the day, when that workflow is done and then I execute the workflow, whether it's triggered by the studio or the user is building the workflows or if it's scheduled or even if it's triggered by, let's say, a web service, what happens is actually we take that workflow that the user designed and translate that into underlying Hadoop framework stuff. Right?

BP: For example, some of it might become a high query, some of the objects might be converted into a Spark job, and so on. Basically, the studio automatically generates these jobs that get pushed to the cluster. Now the reason this is great is as your MapR cluster is the EPS where you store the data, we actually are pushing the processing of the data to that cluster and it also gives you the flexibility to add more capacity.

BP: A typical use case in a MapR you see is you just start accumulating it and over a period of time obviously that grows and the computing power that is needed to go through that growth. As the MapR cluster grows the same workflow that you designed on day one will continue to work leveraging the distributed computing framework that MapR provides.

BP: Purely from scaling it perspective, it is going to be as scalable as your MapR cluster is. If it is able to handle, let's say, two petabytes of data and you have enough computing power to build a model using that, we are more than happy to let MapR do to the heavy lifting here. We provide the computational logic and the design environment for the user, but MapR is then with the framework that does the heavy lifting for us here. So you scale as you scale to the cluster, typically.

David: Well, thanks, BP. "What other use cases can RapidMiner handle?" [Crosstalk 00:53:46]

Ralf K.: There's hundreds of use cases. If you're in a manufacturing environment, those include, for example, automated quality prediction, optimizing mixture of ingredients, optimizing production lines, like predicting bottlenecks, doing what's called process mining to find bottlenecks and time piece in processes. This can be done for production processes but it can also be done for business processes. You have a lot of use cases around custom analytics, like churn prevention, direct marketing optimization, and customer lifetime value prediction. You have a lot of risk-related use cases, like fraud detection and prevention, credit risk prediction.

Ralf K.: There's a lot of use cases in different industries. A few of them are actually visible within RapidMiner when you look at the application template, but that's just the tip of the iceberg. If you want to find more, reach out to us, we can share slide text with you listing a lot of the use cases that we have. Obviously, we cannot share every use case.

Ralf K.: Another way of finding out more about the use cases is to go to our community that's on the web, like But some of the very popular use cases you will find here already as application templates. I think you can see we also can analyze unstructured data like textual data automatically classified by content or by sentiment and so on.

David: Great. Thanks, Ralf. Ralf, BP, and Rachel, so there's still a lot of questions and I know we're not going to have enough time to answer them all. All right. Which ones would you like to ... Is there anything you would like to jump down to, if you want to take a quick look?

BP: There are a couple of questions, I think I'll just summarize an answer for three or four questions in there. What we saw here was a supervised learning problem, we had some historical data, we learned from that and, obviously, using that to predict in the future. But then there are questions around, can we do market segmentation, can we do frequency pattern analysis, so that we can find marketing segmentation, those kind of things.

BP: The good news is the platform is the same. We focused early on a particular use case, but as Ralf is showing right now, there are dozens of techniques available for our segmentation. There are techniques available for time series. There are techniques available for your association rules or basic statistics. So everything is basically an operator away.

BP: I think that answers a few of the questions and obviously more details is available on the website and we can have a follow-up call there. But hopefully, that gives you a little bit more clarity. The platform is not designed for just supervised, but pretty much any kind of machine learning problem you have more than likely can solve it with RapidMiner by building workflows using operators.

David: Okay. BP, Ralf, Rachel, anything else you want to add? Okay. Thank you, Ralf, Rachel, and thank you everyone for joining us. That is all the time we have for today. For more information on this topic and others, please visit Thank you again and have a great rest of your day.