MapR and RapidMiner: The Right Solution for Predicting Machine Failure


Ronak Chokshi

Product Marketing, MapR Technologies

Andries Engelbrecht

Partner Solution Engineering, MapR Technologies

Bhupendra Patil

Partner Solution Engineering, RapidMiner

Predictive maintenance is a technique used in various industries to reduce machine downtime by predicting its failure. It is fair to say that most enterprises consider this a difficult technique to deploy in production. The right implementation uses a combination of the following steps:

  • Real-time data ingestion from IoT devices
  • Extract-transform-load (ETL) of this data and writing it into a data store
  • Developing machine learning (ML) algorithms to extract insights into failure events and training these algorithms using the stored data
  • Deploying the final algorithm(s) in production onto the target environment
  • Monitoring the performance of the system and tuning the implementation as physical conditions change over time

MapR has partnered with RapidMiner to bring a holistic solution to help manufacturers predict machine failure accurately. The solution offers unique flexibility in design, multiple deployment options for convenient transition to production, and an easy-to-use UI.

Watch the recorded webinar to view a demonstration of our joint solution.


Ronak: 00:00 All right. Good morning, afternoon, evening, depending on where you are. We are going to take this hour to talk about predictive maintenance, our perspective of what the industry requires of an ideal solution to predict machine failure, and why you will find the MapR Data Platform to be the ideal platform for this use case. For this, we are partnered with RapidMiner, who we believe has the best data science platform. Let's dive in.

Ronak: 00:37 All right. This is the agenda here for the webinar today. I will start with an executive summary and walk through the evolution of AI as it relates to predictive maintenance, walk through a few challenges that we see manufacturers face, our approach to building a solution that addresses those challenges, and followed by a live demo and the reference architecture for predictive maintenance. We'll also have a few questions along the way to help us better understand where you are in the journey in predicting machine failure.

Ronak: 01:17 The three presenters are on the right on the slide. I am Ronak, and I'm joined by Andries and Bhupendra Patel from RapidMiner. The two of them will walk us through a live demonstration.

Ronak: 01:41 All right. We know that organizations everywhere are embracing or rather are marking on their digital transformation journey and embracing machine learning, IOT and big data. Right? Specifically, the intersection of these three technologies presents a very interesting set of business opportunities. So, developing the data pipeline that allows users across functional boundaries of an organization to build those actionable insights, to apply machine learning into applications. All of these are cool, but at the same time, they face challenges as well when they deploy these applications as used cases into production. And predictive maintenance is one such application.

Ronak: 02:41 The focus though, is on metrics and driving tangible business outcomes. Whether it is improving uptime, improving productivity of machines, of your people, improving response time or keeping costs under control, right. Predictive maintenance like I said is one such technique, and it's deemed complex.

Ronak: 03:06 Today, along with RapidMiner, we will show you how we have been helping our customers use the best data platform, and the best data science platform and successfully deploy predictive maintenance introduction. We will also give real customer examples to illustrate these points.

Ronak: 03:31 I pulled this two by two from a 451 research or the others here. They should give you a glimpse of how the various use cases back up against being simple versus complex on hand, and whether a human is considered a better source of intelligence than a machine or otherwise. As you can see, predictive maintenance, along with use cases such as autonomous driving, image recognition, and so on falls under the top right. It's considered not just complex, but it's essentially better handled by machines.

Ronak: 04:16 What they're going to show you today, is having the right solution matters when implementing these complex use cases and can get you faster time to market.

Ronak: 04:34 All right. Before getting into the specific challenges that make predictive maintenance complex, let's try to understand the alternatives. First of all, reactive maintenance. If you are a manufacturer, reactive maintenance means that you'll have to act after the fact, after the failure. It's just too expensive. Any kind of downtime, it actually impacts revenue and customer satisfaction. The other alternative is preventive maintenance, which really means that you're wasting dollars in repairs and replacement that you don't really need to. Again, expensive.

Ronak: 05:14 Predictive maintenance is really what brings those three technologies that I said earlier, machine learning, IOT, and big data together, and accurately tell the launch manager, or whoever exactly when to pay attention to the machine. Now, that said, it is generally perceived complex, but through a demo for joined technologies today, we'll hopefully make it clear that it really isn't.

Ronak: 05:51 All right. Let's dig a little deeper into some of the techniques used generally across industries today, right. This slide explains why we think a new approach to predictive machine failure is needed. You could be doing HVAC maintenance, vibration monitoring, oil analysis, corrosion control, things of that nature. It just basically comes down to a few key imperatives.

Ronak: 06:22 You need a system that can handle high frequency sampling, some sort of instrumentation. You need something that can handle multi-variant data, you need the ability to iterate, so you can really fine tune and accurately predict failure time. Lastly, some level of automation in the system. As you add more machines to the mix, and across your facilities, the underlying solution that is predicting failures, just works. Right?

Ronak: 06:58 These imperatives lead to key fundamental technical requirements for a solution to address this use case. Things such as a modern data platform that can provide persistence of data across files tables and messages. You need a streaming system that can capture data across a variety of streams, a variety of data sets/data types, and become the system of record. You need a system, which does not require your mandate to move data from one place to the other. When you move from analytics to machine learning, and build new applications there. You need a system, which plugs into your existing applications and augments them and doesn't require you to develop new applications. Lastly, it should offer you the flexibility to develop, monitor, and fuel machine learning interruptions.

Ronak: 08:08 That's where we come in. That's where our joint value proposition shines. At MapR, our claim to fame really is that we are a leading data platform for AI and analytics. It's difficult to gather data and manage data from the EDGE. An exhibit scale data store, the ability to create data pipelines, and lastly, a database that can collect and store all the data in a reliable store, all abstracted within a single data platform. No data movement across multiple clouds, on-premises, EDGE, et cetera, and gives the developers the flexibility to choose machine learning toolkits of their choice.

Ronak: 08:58 There's a unified security and governance built right into the platform. In addition to that, it's fair to say that 80% of data science processes that are spent in basically data wrangling. This year, unfortunately, they're improving the process, enhanced improving the efficiency of the data science process. In RapidMiner, we see a very valid partner that devotes number one data science platform. It really enables productivity for data scientists and engineers.

Ronak: 09:35 BP, anything you would like to add here?

Bhupendra: 09:39 You know, I think so. Quickly, a couple of sentences, right. We are basically the number one rated data science platform according to Gardner, Forrester, KDNugget and a bunch of other third parties, such companies. What we basically do, is we enable data scientists, your subject matter experts, your business analysts who data science, and have a lightning fast business impact. The platform, the RapidMiner ecosystem basically enables you to leverage full data that's on the MapR, or you know, cluster, distributed, cloud, on-prem, however. The set up is, we leverage the distributed computing power of MapR, and all of this is done in a very visual, code optional framework. Essentially, need programmers to do data science. See, it's a code optional platform, so you can bring your code if need be.

Bhupendra: 10:28 The bottom line is it's a very simplified and automated platform for everyone to use with a low value to entry, but also it doesn't compromise on the deep needs of specialist data scientists, if that's what your role is. Even highly skilled things with MapR, we are making sure that we are helping the data scientists and business analysts build corrective models that are easy to tune, easy to explain, and easy to trust. You'll want to put systems that act as black balls, but you really want to put systems in place that you can run your business on confidently. At the end of the day, we are a collaboration platform. We like to think of us like that in the context of a data science project, and a year in the context of our excellent partner, MapR, the two of us can provide a long lasting benefit to your business.

Bhupendra: 11:20 Well show some of the value today in the demo, which will fall in a few minutes from now. Until then, thank you for joining us, and back to you, Ronak.

Ronak: 11:28 All right. Awesome. In terms of value that you get as a joint customer, you get shorter time to market. From the point you start working with us, to the point you essentially see economic impact from a predictive maintenance application. It's much, much shorter. Obviously, reduced PCO and project risk, since you are streamlining and alienating data movement like I was alluding to earlier.

Ronak: 12:03 All right. Best practices for a successful AI solution. I wanted to show how the two technologies come together to deliver a use case. In particular, in this case, predicting a remaining useful life of an equipment. Right? For that, for a use case like that, you would require machine learning models running in production, right. Very simple. Then, you need to partition the deployment of that model across wherever your data is, and sensors are, basically. Right? At the EDGE of the cloud, on-premises, et cetera.

Ronak: 12:50 For that, you would require to do a few additional things, such as a feature extraction, development of the model itself, selection of development in itself, decide between the sort of learning that you want to implement as far as the use case, and this is all obviously, an iterative process. Now, that is where RapidMiner shines.

Ronak: 13:16 Doing all of this requires a robust data platform, and that's where MapR comes in. MapR allows access to all the multi-variant data, so the users can get a holistic picture of the data set from the get go from all the silos, et cetera. A system of record for streams, like I was saying earlier, the forms and underpinnings of a predictive maintenance platform in production, data security, governance, and essentially becomes your flexible and extensible data framework.

Ronak: 13:56 I have a poll in the next slide, here. I'd love to get your feedback and probably give you 10-15 seconds to go over the poll and select.

Ronak: 14:23 All right, I see responses trickling in. I'll give a few more seconds.

Ronak: 14:37 All right. Awesome. Thank you so much. We'll jump into the demo now, and I'll hand this over to Andries.

Andries: 14:49 Thank you Ronak. I am Andries Engelbrecht, I am a senior solution architect at MapR, and I'll start by giving an outline of the pipeline for the demo that we're going to show. This diagram is actually just a subset of what customers are actually facing in reality. If you actually look at the left hand side. It's what we refer to as more of the EDGE. In this case, this might be, depending on the industry, if it's a mining industry, it might be a mining site, or in oil and gas, it might be a drilling platform, so drilling site. Someone in the manufacturing, so it might be the manufacturing site, so typically there'll be a lot of these H-Sites at a customer or it's environment.

Andries: 15:36 If I look at this environment from a data management point of view, or data logistics point of view, is there's a number of challenges that we need to address to actually provide a proper data platform to allow IOD to create a work with all the data that's available.

Andries: 15:54 In this diagram, what you'll see on the left hand side is, there's various equipment on the EDGE environment with lots of sensors connected that actually collects sensor telemetry data in real time. We need to be able to capture that data reliably on the EDGE so we don't lose any data that's available. To be able to do that is we can deploy MapR, which we refer to as MapR-H, which is a small improvement of the MapR Data Platform running on a small footprint commodity hardware on the EDGE. That allows us to now reliably capture this information in real time as it's streams in by doing the process that will then actually write the streaming data on to the MapR event store.

Andries: 16:38 In this case, it will be a telemetry data that gets captured on the MapR main store. The other advantage for doing this is, we can now do some lightweight processing on the EDGE to potentially capture any data errors that can happen in this environment or even do some cleaning up of data or doing some aggregation if need be. Then, we use the technology which we refer to as MapR Replication. It allows us, in real time, to actually transport data reliably over whatever network is available to a core site whether the core site is cloud based or on-prem, it doesn't really matter in the MapR context.

Andries: 17:19 This allows you, if there's a network fold, which very often can happen in this environment, the data store gets reliably captured on the EDGE and when network connectivity is restored, you can resume streaming data over the network over to the core site. The data then gets consumed, again, at a MapR event store, at a core site itself. Then, the record can get further processed if needed, and then the data lands on the MapR database What's powerful about this is, it gives us a lot of flexibility, because since data can change a lot over time, or additional equipment can go out, or additional sensor can be placed, and the data format can change. The MapR database is very flexible in this regard to allow different types of data formats to be stored, but it also allows us to store lots of historical data that is very usable or useful to data scientists. As we know, speaking to one of data science, the more data they have access to, the more reliable they can build models to do predictions, et cetera.

Andries: 18:28 Within this is this MapR-4 platform, it's a single data platform, which now allows us to manage data in a very reliable fashion with a single view from a security point of view. Typically, and I'll show that in a little bit in the GUI interface, itself. We typically do that by trading volumes, so now we can apply a system label governance depending on the type of data that comes into allow access control, but also do quota management, so we don't just have one area consuming all the available resource. It's a single data platform where the streaming data stores and database, even data available in file format. We can make this data available through Spark and Hive, to RapidMiner, and when I hand it over to BP, BP will then go into how they can then actually leverage this data to then run it through various models and look at the big sense, et cetera.

Andries: 19:24 After, as these elements run through it and get processed, we have a feedback loop where data can then come back from RapidMiner. In a lot of cases, that will be based on the probability of a prediction. That data comes back, and depending on the probability of that, we can route that data into different formats. If there is a high probability of failure or if a system component was in machinery, that data can be placed on another event store. An advantage of that is, you can now have various consumers listening on to data.

Andries: 20:03 As data streams in on to the event store, it can either be picked up at a core site to trigger specific alarms, or it can send us specific alerts. It also gives us the advantage similar to the data coming in from the H as we capture all the sensor data, we can similarly send this type of information back to the EDGE, to the event store at the EDGE. We can then trigger certain alarms or alerts at the EDGE site so it will immediately warn certain machines there is a high probability of failure, take it offline, do whatever maintenance that's needed to be done in regards to that, which is very powerful to have that feedback loop.

Andries: 20:43 In addition, if we can take some other data that may not have as high a probability of failure, we can capture that data back into the MapR database that then allows us input for future models as we go forward and keep on learning and training the models as things move forward to just make the environment much richer.

Andries: 21:06 An advantage of this is now, we can keep scaling and scaling and keep on adding more and more data that can get fed back into these models as new things are learned, as new algorithms can becomes available, that information immediately is available throughout that.

Andries: 21:25 I just want to do a quick maintenance, so you'll see StreamSets in the SOE's. StreamSets is another partner of ours, and we use that to manage some of the streaming data pipelines from a MapR point of view. With this, I'll just do a quick... Let me just share my desktop, and I'll be able to show you.

Andries: 21:46 This is a view of the MapR control system at the core class [inaudible 00:21:51] environment. As I mentioned previously, the way we managed data from MapR point of view, is we actually create different volumes for different types of data and then, what it allows us to do is, in these volumes, we can then do different types of data placement and we can manage both from a quota but as well as from a scare perspective. From a data governance point of view, when somebody needs AUS, lineage, et cetera, it allows us to do those type of things from a volume perspective, but it also allows us to scale tremendously, because we can create from different types of use cases, you can create additional volumes, and we manage some metadata within that.

Andries: 22:34 In addition, of EEEs, and the same data platform as EEEs, I mentioned a MapR event store to deal with streaming data, so we can add the event store within this EEEs, and then for each event store, you can basically... I'm just scrolling down to show you within that, so you can have the stream itself, so the single stream in this case, the maintenance stream can have various topics in there. So, it's actually just from a management perspective where you can create additional topics, where you condense streamed data into, so you can produced to the topic and it can also be consumed from the topic. Within each of these topics, can then be partitioned to allow for scale, so some of the architects are prefilled with HP and others, we are able to manage millions of messages a second on to a single topic as need be. We don't have any scaling issues, and it becomes very powerful. It allows lots of producers to ride to similar topics.

Andries: 23:33 You can then, by using different types of topics, you can then get a different topics in, so if you have different H-sites, you can potentially says, maybe I have a different stream for each H-site or I have a different topic for each H-site depending on how you want to structure your data.

Andries: 23:52 In addition to that, on the same platform, we can then actually put a MapR database tables on the platform and then we can say, okay, we mentioned that if on these tables, we can add the telemetry data that actually gets streamed imminently on time, but we can also have various maintenance or reference data available. That makes it just the environment so much richer for data science so they can immediately attach all historical as well as real time data coming in at the same time to feed their different models, which allows them to build a much more accurate model.

Andries: 24:27 At the same time, from the same environment, we can actually use this to just go and view the typical metric from a system administration point of view to see how busy the system is and how different types of things are going on within the platform, what type of operations is going on, and so forth. It gives you one platform to actually deal with the various data logistic issues, which in a lot of cases where speaking to a lot of data scientists, you find if they don't have the proper platform, they spend dis-appropriate amount of their time actually just doing data management versus focusing their time on data science itself.

Andries: 25:08 With this, I'll hand this over to BP.

Bhupendra: 25:16 Excellent. Thank you Andries. Let me share my screen first, and then I will get started here.

Bhupendra: 25:27 All right. As Andries mentioned, he set up an environment for me where we have all this sensor data, all the factory streaming data. As a data scientist, obviously, I want to leverage it, but I don't really want to get into the hassle of managing and maintaining a data system. That was what Andries has helped us out today with. In the interest of the time we have in this webinar, we have actually done a few things. What I have done here is taken the telemetry data, the error's data, some of the maintenance log, and all the data, and prepared or profiled it all that I will share, right now, seeing on your screen.

Bhupendra: 26:05 To quickly get you guys familiarized with in my original data, my machines are streaming into my MapR cluster, I have some old data information, some RPM rotation, some pressure, some vibration that's literally coming off from my sensors, you know, thousands of times per second. I also have a summary of errors. How many times a particular error occurred. Error one through five are the ones that I am really focused on. These are the ones that can cause machine failures and so on. Then, obviously, just because of auto-fill, I don't have to change the whole machine, I change that component and there are a few components that I'm targeting right now. Component one, two, three, and four. These particular columns are really capturing the time since I have last changed that component, and the ages. Obviously, the overall age of the machine, how long it has been with my... not particularly my factory, but since the machine was delivered I guess, right.

Bhupendra: 27:05 Obviously, we are looking at the same sort of machines, but they have different models and SKUs. I think the version that sold in the german factories might be a little different than one that has been sold in the U.S. factories. That's what model one and four indicates, right. These are just machine I [inaudible 00:27:20]. Then, you may clarify each one of those.

Bhupendra: 27:22 Really, what I have here is historical information of which component failed. So, component one, two, three, and four, these are the components I mentioned we are focusing on today. Obviously a lot of nones. This is all-time cities' data. Really, I have used the distributed computing power that MapR provides and RapidMiners capabilities with the thorough platform to build this particular data set here.

Bhupendra: 27:44 Again, we'll skip that data preparation part. The focus of today is how can I quickly build meaningful models out of this and then use it in production to prevent failure and also do predictive maintenance. Now that you guys are familiar with the data set, right. What we're really doing to do is follow this wizard like interface, which we call the auto-model. The problem belongs to a prediction kind of problem. I'm trying to build a model that'll help me identify my failures, if there's any failure, and then hopefully, we'll get one step further, and tell me what kind of component is going to fail, right.

Bhupendra: 28:21 I've done that here, I've seen the data, I am now at the target variable. We go next, this is simply showing the distribution. As expected, I have a lot more nones, which is... you know, failures don't happen every day. That's irrelevant, then different failures of component one, two, and three, and so on, right. That looks good enough to me. This is what I expected out of the data preparation steps.

Bhupendra: 28:44 Over here in this step, what RapidMiner is doing, is trying to help me as a data scientist, pick out the right set of columns. One of the challenges that I have to deal with is, I recently created... you know, picking up data sensor information, making it all available to me, but as a data scientist, I have to make sure I am using the right sort of columns and not overburdening my learning exercises. Rapid Manager helps me identify the right columns here, right? A simple red/orange/green indicator overall tells me the quality of the columns. It gives me more meaningful information like stabilities out of a lot of a lot of missing values, are there text permissions, those kind of things.

Bhupendra: 29:22 It also empowers me as a data scientist. I don't have to blindly class this. It's not a black box. I can order it and say, hey, you know what? I know the total number of errors are going to be important. Let's capture that. Right? Let's make sure that's entered in while I build the models and so on. Maybe daytime stamp, I don't need it, because I've already captured the age of the machine, and so on.

Bhupendra: 29:43 There you go. I have now made a selection out of dozens of columns, which are the right set of columns to use. I hit next. In this part of the screen, what RapidMiner is offering to me is a simple selection of which algorithms I can really try on this data set. Now, many of the machine learning data scientists, or practitioners you know will watch this. They'll spend a considerable amount of time on data prep, then a lot of iterations around trying various algorithms, various parameters, various techniques, to see what fits your data well. What models, what algorithms are going to work well for that particular problem. With RapidMiner, it's just a manner of completely flipping the switch on or off to say let's try that model or not.

Bhupendra: 30:27 Over here on the right, we'll skip some of this, but quickly to highlight, if I enable the date extraction, it will automatically extract age, the date, the time between now, and the daytime column. The month of the day, the year of the day, and so on. So, if you're having trouble or, let's say Saturdays, because we are low staffed, or maybe there are voltage problems and those kinds of things. That will be important if you could extract that information. Similar to text information, maybe our maintenance records are textual logs. We want to extract tokens from it or the keywords from it. That's what this will automatically do for you.

Bhupendra: 31:02 As the industry evolves, as people are moving more towards IOT, there's many, many, many things that we are trying to capture. An overwhelming number of attributes per asset. This will help me automatically select the right list of features. Along with that, if I enable feature generation, it'll also automatically find the relationship within existing columns, and generate new columns automatically for you.

Bhupendra: 31:26 This is basically a simple workbench, a cockpit, for the data scientist to leverage all the powerful features that RapidMiner provides by simply flipping on and off radial switches. At this point, all it takes for me to try the various algorithms is select them, and hit run. RapidMiner starts building the start of... one is already built. I'll quickly move to a different screen. This is what it'll look like eventually, right. Now, overall, we have built about 180 different models belonging to this various algorithm families that you see here. These are obviously, those are the best of each of the families. Right?

Bhupendra: 32:01 So, my logistic regulation, my best model had 85%. My disentry had 97%. Now, this is all geeky data scientist stuff, right? I can spend days and days at length understanding accuracy, sign of deviation. I can go into my weights and understand the factors that are influencing the predictions, here. I can look into my performance, I can see hey, how is the test data behaving? Am I seen a particularly certain class of getting more precision, lower rate guards, those kinds of things.

Bhupendra: 32:32 I can geek out in this particular panel to understand which of these models are great. You know what? My business is not going to understand them, so I need a way for me to have that communication back with my business first, right. For that, RapidMiner enables me with a couple of additional things. If I go to my predictions tab here, a long with the prediction from the test data set here, it actually tells me, in a simple color coded manner, what columns are more influential. Obviously darker the shade, means more influential, positive outcome. The redder, it means influential to negative direction, right.

Bhupendra: 33:05 That helps me have a more basic understanding of what the model is, beyond the map of the model, right? Basically, I can leverage this to have a discussion, and dialogue with my business saying, hey, these models are pertaining to predict in a certain way. But, I can also use my similar data here, which is a very cool feature, which allows again, so select high particular inputs to the model and see how it will behave. For example here, if I, let's say change my mean iteration, you'll notice my prediction quickly flips from component two failing to none. Maybe there is something going on here, right. If I change my rotation again, maybe my machinery operates well, and component two doesn't fail if I have rotation at a certain level. It also tells me the other factor that I am influencing.

Bhupendra: 33:55 Now, we are beyond the realm what the model is mathematically, but here it's actually telling me, what are the factors, how the model will behave in the case of various input changes and so on, right. Looking at all this rich information I have, I will eventually come to a conclusion with my business. What's the right model to use? What's the right model for it? Sometimes, you know, this might not give me the final answer, so at that point, I may have to go into it further so I can really click on open process here, and look at RapidMiner solution. This is an executable, modifiable workflow. This is not a pretty picture, but really, a solution that we can modify/edit if you wish to bring in your R or Python scripts at this point. Totally bring it in here. If you want to switch the algorithm with any of our other 200+ algorithms, absolutely do that.

Bhupendra: 34:42 The idea is, auto model got you that quick early win. You were able to leverage all the excellent power that MapR brings to the table, and were able to build the full set of models. For argument's sake, let's say I want to go ahead and deploy that model.

Bhupendra: 34:57 Pretty straightforward stuff with RapidMiner. Once the models are built. I'm actually going to open a quick process here, yep. You design a workflow like this. These are simply dragging operators, combining them into building a solution, so you're again not coding or scripting anything. You are simply taking the output of various/all these models, and so on, combining it with the current sets of data, and applying the model to get the prediction out.

Bhupendra: 35:22 Here's the nice thing, right. A typical airline engine has more than 100000 parts. If I just tell you the machine is going to fail, good luck in finding which part is going to fail in that 100000 components, right, or what's the reason for it. What RapidMiner can do is, after you're model application, you can augment that with explained predictions. I'll quickly show what the output of that looks like, here.

Bhupendra: 35:46 Essentially, it's now telling me which component is going to fail, in this case, component four is failing. That's the component for failure, but then it also arguments the whole output with the reason why it's going to fail. It thinks it's going to fail because the vibration, mean vibration is above a certain threshold. The rotation is, again, at this level and so on. Not only it's telling me what component is failing, but the reasons behind it. That way, when you're sending that engineer to fix that problem, he'll know exactly where to focus on. He can understand, hey, there is a higher vibration than normal, or maybe the temperature is lower than normal, so he exactly knows where to focus on, and that way, save some time and solve problems as they happen here.

Bhupendra: 36:28 Obviously, this is all still in my fancy little RapidMiner workbench. I could be sitting in my office following the world hunger problem, but if nobody is covering that solution, this is still useless. To deliver the solution, what we have done is put one quick solution on MapR site. What we have done is, on the MapR side, we have set up a restful end point, which is now set up to handle incoming streaming data from RapidMiner over the best protocol. What the process does here, is as the data comes in, it evaluates the prediction and the constants, values and so on. If something is predicted to be highly likely failure with a high confidence, we can alert someone. If it is somewhere in the middle zones, let's say between 70-95% confidence, then we're going to set it up for review so that my site engineers can have a look at it and still have a man in the loop kind of solution in place.

Bhupendra: 37:27 Obviously, where the confidence operations are low, or if nothing is going to happen, we don't expect any failure, we just discard that out, right? This is how we have set up the solution on the MapR site. Obviously, we could also take this one step further, we can say the alert could be as simple as send out an email saying, hey, this machine, this particular component is bound to fail, or maybe even call another even service which stops the machine or pauses the machine, or raises the fire alarm or whatever. Again, that will depend on the context of what the error is and what we are trying to do. But, we have now set up a solution on the MapR side using StreamSets to get some incoming data from RapidMiner. Basically the predictions, and the support, and so on. Right?

Bhupendra: 38:11 So far we have actually built that. We've applied a model. We have explained the predictions. Then, I simply have to process the output to allocate a new data format that I want to do. Send out class of this is what it'll look like. I'm basically sending for machine I81, my prediction of confidence. Four, feeling is very high. One that is 100 percent, and this other reasons for supporting as I said, counteraction. Other than sending all the data back, I'm going to send this meaningful information back and then the RapidMiner provides a quick way to make out a shipment request so that, as an operator, you are called out to the MapR web solution endpoint that's been hosted. I'm going to pass this data over to the JASON format, and that's all it takes.

Bhupendra: 38:56 This could be processing a million lodes, a billion lodes, whatever, RapidMiner will make those number of requests out, and on the other side, will process all of this data in a streaming fashion.

Bhupendra: 39:08 As you noticed, what happened here was, Andries set up the data for me. He set up the infrastructure for me. I was able to leverage all of that computing power, all of that rich information to build that model. I was able to then leverage... if it will actually let me use this slide here, then I was able to actually leverage the complete power of the cluster to build my models. Once I've identified the right set of models, I was able to then deploy it, and then send back the results of the predictions back to my cluster for, again, EDGE processing and so on.

Bhupendra: 39:46 We were able to really complete the loop here. Not only we just built a predictive model, but we deployed it and made the results available to place where it matters, and thus give back to the StreamSet solution hosted on top of the MapR cluster. Not all of this sounds good in theory, I'm showing you some examples, here. But, the key cushion is had we done this in real world, and the short answer is yes.

Bhupendra: 40:10 One of our largest customers, Lufthanse Industry Solutions, which is a provider of maintenance transportation, and logistical services for aircraft companies like Lufthanse, and Eurowings. They have built a solution on top of RapidMiner for predictive maintenance. They are trying to predict which aircraft, what component is going to fail. Then, take it a step further so they can allocate downtime for this aircraft, making sure things are serviced in time, and there's a minimum component failure. Not only the goal is to do predictive maintenance, but also, make sure that those planes are available for maximum service capabilities and so on.

Bhupendra: 40:51 Overall, we are very, very proud to say that using a solution on top of a platform like ours, they have been able to successfully reduce the downtime on several of the major issues. Down by almost 20% there. That's a very, very big win for RapidMiner in this particular domain and in this particular use case and we bring the expertise in along with our partners her at Lufthansa Industry Solutions to solve real world [inaudible 00:41:16]. For a couple of other use cases that we have here, I'll leave a back foot on it, but thank you again.

Ronak: 41:24 All right awesome. Thank you BP. In addition to the tons of the... Sanchez is another grid story that we wanted to talk about. A U.S. based exploration and production company. Their focus was basically building a solution for predictive maintenance for onshore rigs all across the U.S. They really had a hard time for bringing data up from these rigs and essentially do anything useful with it, and not take days, basically. We were able to implement MapR into their platform, into their environment, and bring data from literally hundreds of rigs and help them with monitoring the performance of the rigs and essentially, building a solution for them to do predictive maintenance on those rigs.

Ronak: 42:34 The third example we have here is Tupras, which is Turkey's largest oil refinery. This is a classic case where we were able to deploy the MapR Data Platform at the main refinery site, and map out EDGE just as, and these we're explaining in an earlier slide. At the same time, also using the MapR event store to essentially bring data from the EDGE site, the more refinery sites, to the core, to the main refinery site and do predictive maintenance essentially to reduce and, which visibly improves productivity, and reduce down time for the refinery.

Ronak: 43:24 All right. After this, I have one more poll. I'd like your feedback again. There you go. I will appreciate if you can take a few seconds here and enter your responses. I'm going to wait for a few... 10-15 seconds here. I see responses trickling in, so thank you so much.

Ronak: 44:04 Give it a few more seconds.

Ronak: 44:12 All right awesome. Thank you for that. Just to summarize the discussion today, and a few key takeaways. Hopefully, this was a good overview of the technologies that the two companies have to offer. The idea here is that you, as an organization, shouldn't need to take your equipment offline, right? There are technologies at our disposal. All you need is the right solution, like we've been saying in this webinar.

Ronak: 44:49 Crucial components for a solution of this kind, basically boils down to four things. Basically, ability to bring all of that IOT data from an EDGE site to a primary cluster or platform or assist that data, all the multimedia data across on-premises in the cloud or at the EDGE. Flexibility of developing machine learning models and then deploying it in production near the target environment, so hopefully you saw that in DPs demo, and essentially doing all of this in a very industrialized way. By that, I mean if you go from one use case of application to the other, then to the next, you shouldn't have to restart your effort. You should be able to use the platform and do more with it. That's the idea here.

Ronak: 45:52 All right. I just wanted to close the session saying that this is not the first time we are getting together with RapidMiner. I just wanted to pass over a link to a previous joint webinar that we did back in July. The focus there was a little bit different. I won't go into those details at this point.

Ronak: 46:20 All right. Hopefully, this was interesting. This got you interested. Feel free to connect to any of us here, the presenters, or reach out at or At this point, we can start taking questions. I see them queued up here.

David: 46:45 Yeah. At this time, just a friendly reminder, you can submit a question in the chat window in the bottom left hand corner of your browser. The first question is for Andries. How does MapR manage network failure between EDGE site and core sites.

Andries: 47:05 Typically when the application is set up for the MapR event store for Apache Golf Cart as data screens in and will be asynchronously applicated to the core site. But in the case of a network failure, the data will be still be produced into the MapR event store on the EDGE side and kept on being collected, and as soon as the network connection is reestablished, the data will then continue to be replicated with what's left of on to the core site as well will then actually can be consumed from the event store on the MapR core site. In addition, MapR can do encryption and compression of a network as well, which most customers actually do.

David: 47:53 Great. Thank Andries. BP, which one would you like to go for?

Bhupendra: 48:01 Okay. I think I'll answer the first one here. Pretty straightforward, how does the solution scale? First of all, thanks to technologies like MapR, who provide the distributed plus sync capabilities, we're able to scale during the learning phases, during the model training phases, simply by pushing our workloads into the cluster. A big thanks to the hearty ecosystem in MapR here. On the deployment side of the house, if RapidMiner is responsible for doing real time web service based scoring, then our solution can be obviously scaled both horizontally and vertically as it is put in a docarized/containerized environment, and then want maintain a cubanized or something of that sort.

Bhupendra: 48:43 Really, depending on the scale of things happening, we can auto scale and up or down as the demand is out there. Enterprises of all sizes have used RapidMiner for solving real time solutions. Then, along with partners like MapR, we have pretty much solved most of the scaling challenges that are out there. I think with our joint solution, pretty much the challenge around how much data we can handle, how fast we can process, that's should not be a major concern for any of our users. Hopefully, that answers your question. Thank you. Back to you, Ronak.

David: 49:19 Thank you BP. One more for Andries. How does MapR manage multiple EDGE clusters replicating to the same stream, and the core cluster?

Andries: 49:29 Typically, what happens in environments, we have multiple EDGE sites. You will normally use the seam stream or MapR event store to have data coming in, but for the telemetry data coming in, you'll normally can add a topic for each one of those EDGE sites. For different types of areas, you don't have the offsets stepping on each other. That's the way we typically deal with it. That's why it's very powerful MapR that you're really not limited in the number of topics you can add to the stream. As your H-sites try, and tend to expand, you can keep on increasing the number of topics that you have available to you.

Ronak: 50:14 Aside, I see a question here. It says what are the key attributes to predicting machine failure. It depends on how or what kind of machine. It depends on the nature of predicting machine failure. It could be alarms, it could be basically, historical data, and what is important in the machine to monitor. Before it goes down, which components are important to you and BP showed that in his demo. You could track the individual components, you could track oil levels, you could track corrosion levels, you could track completure. All of these are good attributes, and as long as it is... data on that is of a level in a historical fashion, it forms a good attribute.

David: 51:29 BP, is there another question that you would like to address?

Bhupendra: 51:33 Yeah. I think we have a couple of students on the line. The question is there a student version without substantive charges to learn how to use the platform?

Bhupendra: 51:42 The answer is yes, but I think broadly, for everybody on the call, and anybody who is going to watch the recording later on, RapidMiner platform is available for trial purposes as lend, obviously. If you are in an academic setting, you can have special, education licenses. Again, please visit our website as you see on the screen, here. You'll find appropriate download links that should get your free trial to explore our platform and go from there.

Bhupendra: 52:07 MapR guys, you want to add something to that?

Andries: 52:13 Yes. From my perspective, we have MapR community edition available, or a MapR sandbox, but typically, you can just download a community edition. Then, you can deploy it either on the cloud or on a very small environment, wherever you need to.

Ronak: 52:33 I see one more question. It says is it possible to use the data set that you used in the demo along with a MapR demo? So, Robert Harding, just connect with us offline, and we'll help you through your question. Thank you.

Andries: 52:52 Then, I see from Frank, there's a question. Can you allocate alarms to monitor in real time on the EDGE? We showed in the demo, if you have to go for a training model, there will obviously be some latency from the machines that are getting selected, getting and applicant core site being held to run for the models, but in a lot of cases, some of the output of these models can be specific triggers on certain sensor data, which can actually be deployed on the EDGE site itself, so you can have a consumer sitting on the telemetry stream itself, and in the real time, basically monitor persistent EDGE triggers, on the EDGE itself. Then, alert according to that itself.

Andries: 53:37 But, if last to go through it all, through all cycle, there will be some latency involved.

Bhupendra: 53:44 I see a couple of questions here about comparing either RapidMiner or MapR to either a specialized platform for predictive maintenance or some of the other well known names in this space. I think in the interest of time, we will not answer the question. Obviously, in a public forum, we don't want to be bashing our friendly foes around, but I highly recommend you guys to reach out to us, we definitely want you to understand how we are different than some of the other vendors in this space. Especially, of the guys who have our particular solution. What other limitations you might face right over there. Please reach out to both MapR and RapidMiner on how the two of us can put out a giant solution for you in a shorter time frame and probably a better cost for all the same.

Bhupendra: 54:29 Back to you guys.

Ronak: 54:31 I see one more question from Salvi. He is asking how does a customer feel toward... in security about data and multi cloud environments? Great question. We have several customers who have deployed MapR in multiple cloud environments. In fact, I shared the customer story of Sanchez. They're actually deployed in two cloud solutions. Two cloud service providers. That is just one example. From a platform perspective, we have a unified security. It's not bolted on, it's built in. There's a difference there. It's unified security built into the platform, so the data governance, security, encryption, authentication, authorization to individual volumes as you bring data into the platform, into the MapR cluster, all of that is accounted as part as that function. All right.

David: 55:50 Maybe two more.

Ronak: 55:52 Two more. BP do you have any that you want to go for? We have two more. Time for two more questions.

Bhupendra: 56:03 Okay, okay. There's one question around how does the predictive test model put into prediction to what truths after process, and another question along the same lines, how is it managed?

Bhupendra: 56:20 The short answer is the models that are built by, and up in a manner can be deployed in more than one ways. Obviously, you can do the classic batch base approach where we could schedule things, or you could trigger things for a batch processing, and so on. Along with that, any solution we build with RapidMiner is made available as a restful endpoint, and that means you could actually contact or communicate with RapidMiner from external systems so your EDGE devices, your machinery can actually talk to RapidMiner, asking for hey, this is the sensor data, tell me if I'm going to fail or not, so that can happen in real time, thousands and thousands of times per second.

Bhupendra: 56:57 Obviously, what we showed today was another approach to do the same thing, but other than somebody calling RapidMiner, we have some sort of a micro batch going here, where for a small window of time, we look at the machinery information, if anything fails, we are actually pinging a StreamSet to do the rest of the balancing/processing, when other things are going bad. Depending on what kind of machinery you're looking at, you could do a batch, you could do a micro batch, you could do real time processing, stream processing, and so on.

Bhupendra: 57:26 More than one ways to deploy the solution. The good news is, as you noticed, there was no coding/scripting required. I was not really handing off things to another software engineer to write code for me. When the model is built, you're pretty much a few clicks away from getting it into production if that is what you want to do immediately next. Obviously, with MapR in the scene, we can scale it up to ask any or judgment any clusters as many EDGE devices as need be here. Hopefully, that answers your question. Thank you.

Ronak: 57:58 One more question there. Okay. There's a question that says, does the data on MapR have to be sampled so that it can be used in RapidMiner for auto model building?

Ronak: 58:10 Great question. There is a little bit of work involved. Not a whole lot. BP, I don't know if you want to add to that, but the question is from Chia Lee. Again, feel free to connect with us offline, and we can explain this in greater detail.