Senior Technologist, MapR Technologies
Director, Data Engineering, MapR Technologies
Customers and prospects of Cloudera and Hortonworks are confused, and rightly so. With numerous redundant projects, it is unclear which ones will stay and which will go once the two companies merge. Offerings will be “rationalized” over time, as Cloudera promises a Unity release sometime in the months and years ahead. Regardless, neither company has any single-platform, production-ready offering in the areas that matter most to organizations today - that of AI/ML, Hybrid Cloud, Containers, Operational Analytics, and IoT.
Fortunately, there is no need to wait. MapR provides Clarity today.
Learn more about the MapR Clarity Program, including:
Mitesh Shah: 00:00 All right, thank you David. David mentioned I am joined here in the room by Dmitry and together we are very excited, thrilled in fact to talk to you about the MapR Clarity program. The program that we announced just a few weeks ago. By way of overview, we're gonna spend just a few minutes up front talking about a little bit of the background and then context and why we created the program in the first place. And then dive into the detail around the program itself and most importantly how it might benefit you as a customer or as an organization. And then we'll conclude, as David mentioned, with a 10 to 15 minutes for Q and A. So with that we'll dive right in.
Mitesh Shah: 00:44 All right. So by way of background, why create this program the first place? Well, really as we see it in the industry, there are two evolving dilemmas. On the one hand we have a merger, as you probably all have heard by now, that was announced in the industry between Cloudera and Hortonworks. This was announced in early October, I believe. And our belief is that this merger is actually going to cause a lot of confusion, both internally within the organizations of Cloudera and Hortonworks, but also for their customers and prospects. The primary reason for that is that these organizations have many duplicate offering, along many different dimensions that we'll talk through today around security with Sentry and Ranger, for example, governance between navigator and atlas. And there's going to be some rationalization happening within these two organizations over time. It's unlikely that they're going to support both of these products over time and so which will they choose? Do they know and if they don't know, where does that leave the customers and the organizations. So it's gonna be a lot of confusion there.
Mitesh Shah: 01:59 That's on the one side, on the other side, you've got existing concerns about how you can evolve from analytics to AI. A lot of organizations they wanna take care or drive initiatives around AI and machine learning and deep learning. How do you do that and evolved really from an analytics framework into Ai? What about Multi-cloud and Hybrid cloud initiative? How do you support that? And what about coordinating and containers? How do you take advantage of that and really make sure that you're able to support stateful containerized applications. So these are the real sort of the business challenges that are already happening and for which legacy, technologies and certainly technologies from Cloudera and Hortonworks aren't able to solve today.
Mitesh Shah: 02:42 So you've got that on one end and you've got the confusion on the other, but with the merger and that's really creating what we call a perfect storm of confusion. And our belief is that MapR really provides that clear path forward with our data platform and a couple of other angles to our program, the clarity program that we'll talk about in just a minute. So let's double click first on this notion of rationalizing. Why do we believe that these organizations as they merge and then the coming months, we'll need to rationalize. Well, as I mentioned, there are a lot of duplicate offerings within these two organization and from a business perspective, it simply makes no sense to support duplicate projects and duplicate offerings in the longterm. And so by their own admission, their executives are out there talking now about the so called Unity release.
Mitesh Shah: 03:38 This Unity released, which really has two problems with it as we see it. Problem number one is that, there are these duplicate offerings that will need to be rationalized and we don't know which ones will need to be rationalized and likely they don't either at this point. They'll probably figure it out over some time, but the second problem is that the timeframe is unclear, right? They say that it's going to, the first phase of that will come out in the next few months. And then in the longer term it will come out in the months and years to follow and the timeframe is extremely uncertain. So really two problems with this.
Mitesh Shah: 04:13 Well, in the meantime, MapR has built an enterprise grade production ready platform and we've done that since day one, almost 10 years ago. We've supported our customers in their production needs, meeting their production SLA. And second, we've supported containers and specifically stateful containerized applications for more than two years, right? These are areas that our competitors, Cloudera and Hortonworks had never caught up to with MapR and frankly, given their underlying architecture, it is likely they never will. But I digress, right? I'll just leave you with a quote here at the bottom, again, this is, in their own words, this is Hortonworks CTO, Scott Gnau saying, as it relates to creating this so-called frictionless Hybrid environment and being able to support Big Data management, whether data is stored on-premises or in the public cloud. Well, his statement is, but there is still a lot of engineering work to be done.
Mitesh Shah: 05:12 So again, this is by their own admission here. They don't have a solution for this today and MapR does. So with that as context, right? Let's dive into the program itself. So the program is really in three pieces, three parts here. Part one is the platform itself, the MapR Data Platform. That platform is available now. It's available today and it's solving our customers' needs around the data initiative that matter most to them. Initiatives around AI and ML, Hybrid Multi-cloud, containers and of course operational analytics. So we'll talk in much more detail here over the course of the next 20 to 25 minutes about the MapR Data Platform relative to Cloudera and Hortonworks, but there are two other pieces to the program that we really wanna call attention to.
Mitesh Shah: 06:01 And the sort of motivation behind these remaining two pieces is that we recognize that there's a lot of confusion out there around these technologies, right? Not just around the technologies themselves and how to use them, but once you're ready to make the leap to MapR you kind of, you probably have some concerns and questions around what use cases are best suited, et cetera. So we really wanna take this opportunity to clear up that confusion and we're doing that with the remaining two parts, parts two and three.
Mitesh Shah: 06:35 Part two is the step up with the On-Demand Training, but if there is confusion around these technologies, this on demand training that MapR offers is, and it is a great way to get trained up on these things, right? So we have higher levels, sort of business level courses around things like Kubernetes, and AI/ML. And then if you're interested as a developer or data scientist and doing a deeper dive or even getting certified against technologies that are trending these days like, Spark or technologies around SQL Analytics or even getting certified with the MapR Administration courses, you can do that with the step up with on demand training program.
Mitesh Shah: 07:15 And the third piece is the one that Dmitry will talk about in much greater detail in later today in the later this hour on this webinar with the free Data Assessment Offering. So I won't steal the thunder today, at this moment I'll just point out that this is really meant to, if you're thinking about migrating off of another vendors, specifically Cloudera and Hortonworks, we've got professional services there that will help you in that process. We'll take a free data assessment.
Mitesh Shah: 07:45 So three parts of the program. Now, let's dive into part one, the MapR Data Platform and use this as a framework, right? There are many different dimensions here that we can look at. And if I thought hard, I could probably come up with a half dozen more, but these are probably the top eight along all of these different dimension there are these duplicate offering that Cloudera and Hortonworks offer. Data Science because Cloudera has got the Data Science workbench, Hortonworks through their partnership with IBM has IBM DSX and the world of SQL, it's Impala versus Hive LLAP and more in security is Sentry versus Ranger and so on. So this is sort of the overall framework, and I won't go into each of these, but I will go through a handful of them to kind of explain not only were there are duplicate offering that will need to be rationalized, but where there is a, I'll call out to where there may be a shaky foundation or no foundation at all with these vendors and where MapR really shines as it relates to these dimensions.
Mitesh Shah: 08:47 So with that, let's dive right in to Data Science and AI, right? Again, as I mentioned, we've got, Cloudera has got the Data Science workbench, Hortonworks with their partnership with IBM has IBM DSX. The more fundamental issue here, as you see on the left hand side is that if you want to take advantage of these more cutting edge, newer python machine learning libraries, you've actually got to move the data over to a separate cluster entirely. In the case of Cloudera, the Cloudera Data Science Work bench, you've gotta actually move it to the cloud to take advantage of it and the reason for that are numerous, but one of the primary reasons is simply that Cloudera does not support via HDFS, they do not support POSIX. And increasingly these python based machine learning library require POSIX as an interface to work out of box.
Mitesh Shah: 09:42 So that's a big problem, right? It's a big problem because now you've got data that you need to copy from one area to another and that has downstream implications on a lot of different areas. In particular, as you're copying data and now you've got to secure it in multiple places. Now, you've got to manage the lineage in multiple places and figuring out where it came from and where it's going to. And of course the biggest issue here is that you've got poorer time to value. You're increasing your time to value by now having to copy data over to somewhere else to then do your Data Science. So big fundamental challenges with the approach to begin with, and then as I mentioned, you've got to duplicate offering that will need to be rationalized over time and who knows which route they're going to go. Maybe they know today, maybe they don't, but they certainly haven't announced it as far as I know.
Mitesh Shah: 10:33 On the other hand, you've got MapR and you'll see this by the way, I just wanna kind of give you a quick lay of the land with these slides. With these wordy slides on the left hand side, you're going to see where we call out the deficiencies with Cloudera and Hortonworks, as well as where their offerings need to be rationalized, where the duplicate offerings need to be rationalized. On the right hand side, you're going to see where MapR really provides clarity in the given dimension, in this case, Data Science. So I've got that list on the right hand side here, but I've also got it in pictures, right here, right?
Mitesh Shah: 11:08 The key is with MapR we make doing Data Science easier by having everything on one cluster by being able to do AI and analytics on a single cluster and we're able to do that because we uniquely support not just HDFS, which things like Spark might understand, but also POSIX, which increasingly, as I mentioned, these python based machine learning libraries like TensorFlow, Scikit-learn, PyTorch need to have in order to conduct their Data Science. So everything on one cluster just makes it easier.
Mitesh Shah: 11:50 Let's move on to security, right? The fundamental problem here which you see on the top left, is that really both Cloudera and Hortonworks have chosen to have Add-on tools to handle and address security. And it's easy to see why they've done that, it's easier to see why they've gone down that route. Because you've got not just sort of their so called platform, you've also got all of these different compute engines on top, that need to sort of speak to the platform and speak to it in a secure way. So I guess they felt like the best approach to do that would be to have an Add-on tool in the case of Cloudera is Sentry and in the case of Hortonworks is Ranger to be able to do that. Unfortunately, that's not the right approach from a security perspective, right? They're not handling security where it needs to be handled and addressed, which is at the platform level, right? They're handling security, they're handling things like authorization then auditing through these Add-on tools. And that's fundamentally a misguided approach.
Mitesh Shah: 12:54 Second issue here is that as a result of all of that, they've also gotten no support for impersonation Cloudera in particular along, key frameworks like Impala do not support impersonation. And so what does impersonation mean? It means, well if I'm going through Impala to do a query for example, does it appear that Mitesh, is querying the files or does it appear like some privilege user or like an Impala or a Hive is actually looking at the file. Without support for impersonation, it actually appears as if this privileged user is looking at the looking at the file. And that has downstream implications from an audit perspective, from an encryption perspective. So it's really not a wise approach, but that's the route they went to because of their choice here now for going with the Add-on tools here, Sentry and Ranger. And of course you've got the issue of having to rationalize between Sentry and Ranger, which are they going to support longterm? We don't know.
Mitesh Shah: 13:56 That bar really provides clarity as it relates to security. The biggest way we do that is by handling security at the platform level, right? So we handle security at the platform level. There is no Add-on tool to handle and address security, but the other major and important piece here is that we're actually secure by default. We've done the hard engineering work to actually make sure that all of the connections between the various components in MapR are both authenticated and encrypted on the wire, so that out of box everything is secure by default. Which really takes the guesswork out of having to secure and having to monkey around with configuration files, hoping that you're getting it right. It takes all the guesswork out of that and make sure that you are secure by default out of box. Now of course you've gotta apply for permission to your file, et cetera. But it's obviously a shared responsibility, but out of box that bar is secure by default. We've taken the hardware, we've done the hard work, the engineering work to do that for your benefit.
Mitesh Shah: 15:02 Beyond that, right? ThiS is our trust model along the four pillars of security, data protection, authentication, auditing an authorization. We've got very unique capabilities that you can take advantage of from a security perspective. So from a data protection standpoint, you can encrypt data on the wire and that rest with authentication perspective, we support not just Kurbernetes but also username, password based registries. And we do that because our customers have told us loud and clear that Kurbernetes is a nightmare to manage. And so they really wanted this alternative to be able to tie to Ad, to be able to tie to LDAP, that we allow for that flexibility from an authentication perspective. And number three, auditing. We've got high performance auditing, which as a version 6.1 is also streamed to the events to over Apache Kafka and makes sure that all of your audit logs are kept safe and secure and can't be removed or monkeyed around with. And so that you can consume those laws with whatever downstream applications like wanting consume and use them and take advantage of them.
Mitesh Shah: 16:08 And number four, we've got authorization. The key differentiator here is really there are many, but it's the access control expressions, which are bullying expressions that you can apply to files, tables, streams or all three of them at the volume level, so you can secure all of your data at once with extremely expressive permissions called Access Control Expressions. Which again, are bullying expression. So security far superior to the Cloudera and Hortonworks and we really would provide clarity there. From a management perspective, yes, there are duplicate offerings here with Cloudera and Hortonworks. In the case of Cloudera it's a tool called Manager and in the place of Hortonworks is the tool called Ambari.
Mitesh Shah: 16:55 The bigger issue here is that they're not really a set up or equipped or architected to handle management in a seamless, simplified manner. And the reason for that, it's quite simple. It's their architecture, right? They've got this NameNode architecture that requires one or two points of failures called NameNodes that ultimately have the key to the kingdom, that ultimately tell you where all the files and information and the data are within your cluster. These again, are our points of failures that you will have to manually manage yourself and make sure are always up and available and that's a big headache. Not only a big headache, but it's also a bad approach from sort of resilience standpoint.
Mitesh Shah: 17:40 And the second thing is really no advanced capabilities around multi-tenancy, storage quota management, data placement controls. There really is no concept of logical unit of management for which you can apply policies like that. In the case of Cloudera and Hortonworks, they had no concept of that because again, they're architected with HDFS underneath and there is no real concept for these capabilities. So it's really just fundamentally harder to administer in the first place, but they cover some of that up with tools like Manager and Ambari, which I view as nice ghoulies, which is all well and good, but again, at the fundamental level, these production deray capabilities and production ready capabilities around high availability and disaster recovery, they're extremely difficult to manage 'cause they're simply not architected to be enterprise grade in the first place.
Mitesh Shah: 18:33 Well, with MapR of course, we provide clarity because we were architective since day one to be that production ready, enterprise grade data platform. We support volumes, we support this concept of logical units of management for which you can apply policies like multi-tenancy, right? You can have files, tables, streams all grouped together so that only the marketing department can see them or only to finance department can see them. This is something that you simply can't do in the case of Cloudera and Hortonworks. And beyond that, you've got quota management capabilities, that you can say that with a given volume, it should only go up to a terabyte or two, for example. You've got data placement control capabilities, where you can save all the data within a given volume, which all be placed within a certain node or set of nodes and snapshot and mirroring capabilities, right?
Mitesh Shah: 19:27 Key to this is MapR's ability to do point in time, consistent snapshots with our files. And we're able to do that because we've better read write file system underneath. Cloudera and Hortonworks do not. They've got HDFS or simply as an append only file system that does not support consistent point in time snapshots. And as a result, we're able to support these production ready enterprise grade capabilities around disaster recovery, there's true consistent snapshots and replication and marrying capabilities with tables and streams. And of course, if you want the nice gooey, you've got that with MapR as well, with I think as a version 6.0, we've introduced a brand new MCS the MapR control system. You see a screenshot of it over here. The key to this is really not just how nice the gooey is, which is all well and good, but if your ability to now manage all of your data at once because you can store files, tables, streams all in one cluster, one platform with MapR can uniquely do that. You can manage all of it at once as well, including permission and including your disaster recovery needs and so on.
Mitesh Shah: 20:44 That was management. Let's move on now to governance. The big issue here is that both Cloudera and Hortonworks really treat governance within only the walls of, we call it Big Data, within the borders of their own platforms. They don't treat governance as an enterprise wide challenge, which is really what it needs to be thought of as. But it is not the case that you've got just data sitting in Cloudera and Hortonworks, obviously that data is coming in from somewhere, perhaps relational databases, perhaps the cloud, perhaps some other systems or other legacy systems. Your enterprise has many different systems in it and governance is truly an enterprise wide challenge and it needs to be thought of in that manner.
Mitesh Shah: 21:35 Well, the tool Navigator and Atlas really have that limited view of only treating governance within their own borders. So that's the basic problems, but beyond that, these tools really lack capabilities around being able to classify data automatically. Imagine having terabytes or more or even petabytes of data that you need to go through and actually classify to see which ones might be sensitive, which files might contain sensitive data. PCI data, for example. Well, they were personally identifiable information, for example. Well, that's simply not possible with the tools that the Cloudera and Hortonworks provided out of box. And the other issue here is that there is really no real capability to rate or review data.
Mitesh Shah: 22:23 So the key to this, is data stewards within organizations obviously have the most knowledge about which data is a golden set and how data should be used. Well you want some ability for them to be able to go in and say, well, this file is the most correct set of data. It's been one that's been most enriched, is the one that should be used in x way or y way, right? So to be able to do that is that feature is lacking with Navigator and Atlas. With MapR, you can actually, we actually offer through MapR a data catalog that solve many of those needs and most importantly treats governance as an enterprise wide challenge, so that you can govern not just the data within MapR, but government data within RDBMS systems and cloud systems as well.
Mitesh Shah: 23:11 So that's governance. We move on quickly here to cloud. Specifically Hybrid and Multi-cloud. So I called out SDX and DataPlane, SDX in the case of Cloudera, DataPlane in the case of Hortonworks. Really, I'm probably being generous here by using those names in the context of Hybrid and Multi-cloud. I think those are mostly marketing stories and even if they weren't as their build today, it's mostly about managing and sort of governing your data across the Hybrid cloud environment. I will point out that just a month or two ago, Hortonworks has announced what they're calling an Open Hybrid Architecture Initiative. They're calling it an initiative which to me signals an indication that they really have no Hybrid cloud story today in the first place, right? Why would you announce an initiative if you already had a product that could solve for these Hybrid and Multi-cloud needs? Well, I guess they felt like they needed to announce the initiative to manage that.
Mitesh Shah: 24:22 MapR on the other hand, solves for your Hybrid and Multi-cloud challenges today. We do that through this three pillar approach, integration, portability and inter-cloud. You see a picture of that on the right hand side here, but let me walk through each of these in just a few moments, right? So starting with integration, you've got files, tables, streams and a whole host of open API that you can run your existing or newer applications against the MapR Data Platform and you can do it in the cloud. We've got, obviously MapR Data Platform works in the cloud just as easily as it works in on-premises environments or in fact edge environments. And in some cases like AWS, you can even MapR and use it by the hour if you could, if you would like through their marketplace.
Mitesh Shah: 25:18 So we've got integration with the cloud vendors themselves. We've also got portability and this is really the key. We have these open standard APIs, not just HDFS, not just Kafka, not just HBase, but also POSIX so that you can run your existing and newer applications on top of the MapR Data Platform. And we've also got synchronization capabilities where you can move your data around easily and seamlessly across different environments, across different clouds if you would like, that to on-premise environments if you would like, from the edge to the cloud. So you see that again, the picture of that on the right and this is made possible through our production ready enterprise grade capabilities around a replication and nearing. Uniquely made possible by MapR. And that of course, is a better approach than the competition here, but it also has the added benefit of reducing switching costs. Should you choose to go from amazon, for example, to google over the next year. You can do that easily by simply moving your data and then replicating that data through the MapR Data Platform or from one cloud to another and then continuing operations on the other cloud.
Mitesh Shah: 26:36 And then finally Inter-cloud processing, right? We're approaching holiday season here, we're just coming off of Black Friday as well as Cyber Monday, obviously many retailers have a lot of data they need to sift through and analyze. In some cases, you may not simply may not have the compute capacity in your on-premises environment to do all that and you really want to maybe spin up AADHAAR instances on a short term basis in the cloud. With MapR that's easy to do, can simply mirror and replicate your data to the cloud, burst your, sort of on demand instances in the cloud, do your analytics and then shut it down on an hourly basis if you would like in this cases. So again, integration, portability and inter-cloud only possible With MapR Data Platform.
Mitesh Shah: 27:35 Okay, and then finally, I'd like to point out just very quickly around containers. Obviously there's a big trend in the market now to use containers, to use Kubernetes. In this case, it's actually pretty easy answer. Neither Cloudera nor Hortonworks have any clear current offering to support containerized applications. There is certainly no offering to support a stateful containerized applications and so visually no rationalization to be had, but even if they were to plan to do something, this merger that they announced, it's probably going to impact or delay whatever plans they had to support containers. So with MapR, we've been supporting and containers now for over two years and we've actually evolved in many cases in a big way.
Mitesh Shah: 28:27 So starting with support for stateful containerized applications. If you've got applications in your containers, it's great to have them there, but now where are you gonna store your data? Are you gonna bring that data to your container? Well, in that case, what if a container goes away, then your data is lost. So with the MapR Data Platform and with the persistent application client container, we've actually bundled up all into a docker image that the POSIX clients and other libraries that you can use to actually access all of the data, including files, tables, streams within the MapR Data Platform. True support for stateful containerized applications and by the way, you also have support for microservices because of our support for events to sorta Apache Kafka and not just support for although we offer the mentor for Apache Kafka, so that you can actually send messages from one container to another to support these newer microservices based application.
Mitesh Shah: 29:30 The second is really around the Data Science Refineries, a product offering that we have, but we bundle up a nodebook like Zeplin as well as data exploration capabilities through Apache Drill as well as Spark, other compute engine to be able to access data in the MapR Data Platform so that your data scientists have access to the data that they need to conduct their AI in Data Science. And then finally, over the past few months we've announced the MapR Data FabrIc for Kubernetes, which is basically support for the MapR Kubernetes volume drivers. So if you are using Kubernetes, you can use the MapR Data Platform as your persistent data store.
Mitesh Shah: 30:10 So that was about six or so different dimensions along which again, either Cloudera and Hortonworks have a shaky foundation or no foundation at all from an architectural perspective. And even if they do have some offerings, they're going to have to be rationalized over time. And that's really again, as we say, will cause a lot of confusion for customers and prospects over time. So those were six. There are a few more that we won't have time to cover over the course of this webinar, but obviously we'd be more than happy to drill into any of these topics or the remaining ones with a followup conversation.
Mitesh Shah: 30:51 With that, let me just talk about the second pillar of the MapR clarity program. As I mentioned, a lot of the technologies I just mentioned can be confusing and you really wanna kinda make sure that you learn about these technologies and how they can be useful to your organization. Again, we offer Free On-Demand Training in many of these technologies, both at the higher level, the business level with courses around AI and ML and Kubernetes, but also at the deeper sort of technical level with courses around Spark version 2.1, SQL Analytics and even administering the MapR Data Platform. So go to mapr.com/training click on on-demand training and you'll see a list of three dozen or so courses. And I'll point out again, these are free courses, but they're also built in a way that is conducive to your training needs, right? These are not sort of people just talking through slides. They're not just an hour long presentation where we talk about technology. These courses have quizzes, these courses have games in some cases and they have very rich content around all of these technologies and again, it is free.
Mitesh Shah: 32:08 So with that I'm now going to turn it over to Dmitry, to talk about the third pillar of the program which is the Step up with a Free Data Assessment led by our esteemed professional services team.
Dmitry Gomerman: 32:23 Excellent touch. My name Dmitry and I run the US data engineering team. Let's say you decided to explore your alternatives. I'd love to talk to you about our step up program. PS is really here to help. Anytime you approach to migration, if you want it to be successful it involves the collaboration of many talents, including operations to get your cluster up and running, installed and configured, it involves workflows to get data flowing into your cluster in the most optimized fashion, it involves BI and Analytics to get the most use out of your data as quickly as possible and even custom solution development and design, not only to migrate your applications onto the MapR platform, but to take advantage of the unique features that we offer.
Dmitry Gomerman: 33:26 MapR professional services has all of these talents under one roof. And we're happy to help with a successful migration. In fact, we do these migrations on a daily basis. If you think about the larger picture, we rarely build a cluster just to build it. In the majority of cases, every single cluster installation involves a component of data migration. For example, customers look to offload their data, build a data warehouse that typically involves data living elsewhere that needs to be migrated. It involves applications and use cases residing on other platforms that needs to be consolidated within MapR. Even for new use cases, you're typically dealing with data residing elsewhere that needs to be migrated onto our platform. Our migration philosophy is simple. We shoot for zero downtime, zero risk of data loss, we pushed for a gradual use case based adoption, minimizing the impact to your existing applications and workflows, we always provide a roll back and recovery strategy and whenever possible, avoid the big bang approach.
Dmitry Gomerman: 35:05 Our goal is to partner with you to expedite the adoption of MapR software, minimize the dependence on your resources and provide subject matter expertise and Hadoop and Big Data technologies. We leveraged not only PS, but customer service, educational services and other departments to ensure the short term and long term success of your migration. We always designed for rapid deployment and user adoptions, keeping your SLAs in mind. The migration timeline is driven by your needs and we can support a gradual or an aggressive approach. We have a proven five step methodology that we use for each migration. It starts with an assessment, which I'll talk about shortly. We then move into an initial build out of the clusters and we always start with two at a minimum, a development cluster and the broad cluster. We'll talk about that in a second. Once the initial clusters have been built, we move into a repetitive cycle of use case migration and expansion.
Dmitry Gomerman: 36:31 You may be wondering what is a use case? For us, a use case is a logical separation of workflows, applications, data and the dependencies that can be migrated independently of other use cases. We shoot for a use case based migration approach in order to avoid a risky big bang theory. Each use case starts with a plan and design session. Once the design has been signed off by all parties, we move into the cluster preparation, which may involve the installation of additional ecosystem components or the configuration of shared features such as data governance, et cetera.
Dmitry Gomerman: 37:28 We then test and validate the migration in the lower environment such as Deb, we document it and once it's signed off, we migrate it into production. The entire process is documented and we provide cross training to your team. Optionally, we then recommend decommissioning some of your existing nodes from your current environment and adding them into the MapR clusters, to expand capacity for additional use cases. Steps three and four continued to cycle until you've onboarded all of the use cases and decommissioned your old environments. This approach saves on hardware costs and total cost of ownership. At the end of the deal, we provide enablement and support for your resources.
Dmitry Gomerman: 38:31 Specifically regarding the data migration assessment, it's a one week free exercise with no obligation to buy MapR software or additional PS services, it's completely free, where we reviewed your current cluster architecture, workflows and applications as well as use cases and usage patterns. We evaluate your queues, jobs, review your data associated to the various use cases, keeping in mind that the majority of your applications will run unmodified on MapR because generally they're uncompiled code such as Hive queries and Polly queries, et cetera. We identify any issues you're having with your current environment and define any shortcoming. We assessed the product support needs, performance and security requirements. We then discussed any expected business outcomes and service level requirements. We deliver migration presentation, which includes the high level architectural changes, performance and security planning as well as MapR PS recommendations. We create a feature mapping to address any defined shortcomings. We lay out your new cluster and its services and we create a migration plan to move to the MapR platform, whether it's on-prem, in the cloud or a Hybrid thereof. Finally, we scope the migration via mutual service plan and include the price to migrate.
Dmitry Gomerman: 40:23 As part of the deliverables, we create a detailed queue analysis. We evaluate each of the queues and the jobs running in the queues. We create a risk classification and determine the jobs most ready for migration. We always like to start with a quick win to show you the value of the MapR platform. Obviously we work with you to define the appropriate use cases that are most critical to the business and we migrate them first. As part of the assessment, we evaluate the Network Throughput, which we then use to estimate the transfer duration. We break it up by use case and we give you the total estimate of how long it would take you to transfer the data.
Dmitry Gomerman: 41:30 In visual terms. This is what our migration strategy looks like. You start with the initial cluster build out where you validate the hardware for both clusters. You install and configure the MapR environment and its ecosystem components, along with any shared services. And then you move onto a cyclical use case migration phase, where you go through discovery, you migrate into the dev environment, test, validate, document and then promote to production. As an optional step, you decommission all old hardware, old nodes and add them to your existing MapR environments and the cycle repeats itself with additional use case migrations.
Dmitry Gomerman: 42:29 As I mentioned earlier, we do this day in and day out. We've migrated many are use case, many are cluster. One particular one I'd like to point out is the AADHAAR migration that we performed. For those unfamiliar AADHAAR is the largest biometric database in the world with over $20 billion metrics. They serve over a trillion ID verifications per week for about a billion enrolled residence. The system requires high performance, low latency, and zero data loss. We perform the migration with no downtime and no impact on production systems. Three MapR PS engineer's were dedicated for six months. Three months of which were spent designing, testing, and validating the migration methods. The actual migration took about five weeks to complete.
Dmitry Gomerman: 43:39 In summary, we have a proven migration strategy to get you from any distribution onto the MapR Data Platform. Professional services has a turn-key solution and we're here to help. The timeline is always dependent upon the selected approach and the availability of your resources. Post migration, we'd be delighted to help you explore some of the new and unique features of MapR such as easier access via POSIX and NFS, hardware and costs savings via compression and erasure coding, high availability and disaster recovery via snapshots and mirroring. We'll help you the unified secure by default platform security as well as take advantage of the MapR database and the event store for Apache Kafka. We'll also help you explore some of the new features released in MapR 6.1 such as object hearing. There's many other features and functionality available to explore and we would be delighted to work with you to make that happen. Thank you. Back to you Mitesh.
Mitesh Shah: 44:57 Okay. Excellent. Thank you, Dmitry. Sounds like an exciting, fantastic offer. I just wanna clarify. You mentioned several times that this data assessment is in fact, free. Is there some obligation than to move onto the data platform or can the customer prospects and simply walk away after that?
Dmitry Gomerman: 45:19 No, there's zero obligation. It's a one week free engagement, where we'll showcase the features and functionalities and evaluate the migration for your specific environment and your use cases and at that point we leave it to the customer to make the decision whether they want to move forward.
Mitesh Shah: 45:41 Excellent. Sounds like a great deal, well, what's that? I see a bunch of questions coming in and we're gonna get to them in just a couple of minutes. Before we do that, I think with this incredible offer, maybe now is a good time to just do a quick poll. David, is that something we can do to kinda gauge interest in whether the folks on this webinar might be interested in this free data assessments offering?
David: 46:03 Definitely. So we have a quick poll on the next slide here. So what we would love to know is if you're interested in taking advantage of the free step up with a data assessment offering from MapR. Yes, no or not sure. If you could go ahead and submit your entries now.
Mitesh Shah: 46:28 Great. So we'll take just a few seconds here to make sure people have time to populate their responses. Again, no obligation here. Just wanted to kinda gauge folks reaction to this incredible offer. [crosstalk 00:47:03] oh, no, you could still answer on the poll. So we're still getting a number of results. Thank you for taking the poll. I appreciate the feedback. With that, we'll move on to one last slide and then we'll address some of your question.
Mitesh Shah: 47:27 Yeah. So one last slide, right. Let's end on a positive node here with the MapR Data Platform itself and how we can benefit you and your organization. So you see a picture of our product diagram on the right, along with on the left, really how we can enable you along the full spectrum of workloads from ML to AI. And as we talked about over the course of this presentation, we can really solve for your Hybrid Multi-cloud needs as well as your stateful containerized application needs and operational analytics needs with, which we did not have time to cover in this call. But we can solve all of your needs there, but the key here is that the MapR Data Platform really makes it easy to create, manage and orchestrate all of your data, right? So This is the key. Now we see obviously with the rise of Kubernetes and containers, this ability to orchestrate your applications across disparate environment, but what about orchestrating your data and being able to seamlessly and easily do that? Well, really the MapR Data Platform is uniquely suited to be able to orchestrate your data across these different environments.
Mitesh Shah: 48:40 So, I see some questions here about we what MapR edges is and deployment options? I'm gonna actually just spend a minute here on the product diagram itself and see if that answers some of these questions starting from the top right? These are existing applications that you might have that are already built. These are maybe some newer analytics or ML and AI applications that you're building. And that's all from the organization. Like these are the applications that you are building. The next level down is the various compute engine that would work with and on top of the MapR Data Platform, including all the components here within Hadoop and Hive and Spark and Drill as well as newer, more cutting edge, Python machine learning libraries like Tensorflow, PyTorch and Scikit-learn.
Mitesh Shah: 49:30 Now many of those, many of these compute engines are included with the MapR distribution. Some are not, but in both cases they will work with MapR directly. And in the case of Tensorflow, PyTorch, Scikit-learn, it will work with the MapR Data Platform out of box because of our support. Unique support here for POSIX. So you see that level beneath there, the APIs under the MapR Data Platform, which is one of those API's, totally HBase and Kafka and S three where you can actually use the MapR Data Platform as your S three object store. So many open APIs to be able to have your compute engine and your application speak to the MapR Data Platform.
Mitesh Shah: 50:13 And then underneath that is really the core offering within the MapR Data Platform. It's three integrated offering between the distributed file and object store where you can actually store files and objects at exabyte scale you would like. It is the MapR database and it is a multi-model database that can store data wide column, key value document format, and it is the event store for Apache Kafka, published subscribe mechanism that is a compliance the Kafka API. And so that is really the core of the MapR Data Platform those three offerings right there.
Mitesh Shah: 50:53 And the key again, is that these services are integrated so that you don't have to deploy each of these to distribute file and object to the database the events store forever have Apache Kafka on separate cluster. You can do it all on one cluster. That is really the key. And then below that you've got the deployment options on-premises Multi-cloud and IoT Edge. I spent a lot of time here talking about Hybrid, Multi-cloud and that's really what that is showing where you can tie it all together through replication and mirroring with the MapR Data Platform. You can have your data sitting in all these environments at once and synchronized through the magic of the MapR Data Platform. So that's a quick overview of the product diagram.
Mitesh Shah: 51:36 I know that there was a question here about what is MapR Edge? Well, let's define what edge is in the first place, right? Edge is really, let's call it remote locations. Maybe it's an oil rig, maybe it's a remote office, maybe it's a retail office, maybe it's a hospital environment. These are locations where potentially you have limited space and potentially areas that are been with constrained. And in those cases you really wanna make sure that you're collecting data from all the sensors and devices in that location and potentially doing the analytics and AI and Data Science in those locations. The very real case of autonomous driving, for example, you've got both the case where you've got a space constrained environments, very small vehicle for example, maybe just a trunk to be able to deal with the data collection. And you've gotten bandwidth constraints potentially.
Mitesh Shah: 52:30 So you wanna be able to do all of your analytics and AI right then and there in the vehicle. You simply don't have the time to be able to move that data to the cloud, for example, or to on-premises environment and so you from a sort of latency perspective, you need the speed to be able to collect that data and react to it right then and there. So MapR offers a product called MapR Edge, which is really a packaging mechanism where in a three to five node cluster that you could actually install on basically intel nooks if you would like, you can deploy MapR the data platform and collect all that data and analyze all that data right then and there in those edge environments. And then to do the deeper analytics, you can actually download sample some of that or replicate all that data to an on-premise or cloud environment for more deeper Data Science and analytics if you so choose and then report back the results to the to the edge environment.
Mitesh Shah: 53:31 So really that's what the MapR edge offering is, I'll walk through just a couple of the other questions here. I Think we answered to what is MapR Data Platform and tell me what it looks like. So the question here around are roads to the current MapR deployment on-premise or the cloud, its a fairly good mix actually and increasingly we are seeing a bigger trend towards deploying MapR in the cloud. So that's the one piece of the puzzle right here is the, our ability to seamlessly integrate with the cloud offerings and the cloud infrastructure, but the other piece is that we are increasingly living in this Hybrid and Multi-cloud environment. So MapR again, uniquely supports this Hybrid Multi-cloud world where you can have your data really wherever it needed to be, whether it's on-premises or in one cloud or in multiple clouds or the edge or all of the above.
Mitesh Shah: 54:26 Hopefully that answers that question and then there was a question on the main components. So that of the platform I couldn't answer that. Maybe Dmitry, I'll turn this one over to you here. Is it possible to design dev Q, A prod environments on a single cluster, how to logically separate each environment using container group with few containers for each of dev, Q, A and prod?
Dmitry Gomerman: 54:49 Yeah, thanks, Mitesh. That's an excellent question. It's absolutely possible to separate a single cluster into multiple environments or multiple tenants and we take advantage of unique MapR features such as volume such as storage placement, such as topology to isolate the storage or physical or virtual nodes or even cloud instances for specific environments or tenants. And we can absolutely employ the container strategy, for example, Kubernetes to run certain environments, containerized others not or run all of the environment and containers using data via volume that's separated across instances. So great question. Absolutely possible.
Mitesh Shah: 55:46 I think that, thank you Dmitry. I think that answers most, if not all of these question. David, I'll turn it back to you.
David: 55:54 Yeah. Just there was a couple more questions for Dmitry. So is the free data assessment only available to existing customers of Cloudera or Hortonworks?
Dmitry Gomerman: 56:05 So the free data assessment is targeted towards Cloudera and Hortonworks customers, but we'll make it work for you. So if you have a migration that you're looking into for a different environment, for example, Terra data or another legacy system, talk to us. We'll make it work.
David: 56:24 Okay. One more, does the free data assessment include implementation?
Dmitry Gomerman: 56:31 The free data assessment does not include any implementation.
Mitesh Shah: 56:37 Actually. I see which is one of the questions here, I want to know more about the MapR training and certification? That's an easier response, just go to mapr.com/training and you'll see all the offerings there. And then also just wanna point out we did a quick poll here on interest in the data assessment and for those that expressed interest, we'll certainly follow up with you on next steps.
David: 57:00 I'm seeing one last question here. How long does a typical migration take?
Dmitry Gomerman: 57:05 Great question. So we typically split the migration into one of three categories. There's the simple migration, which takes anywhere from a month to three months to complete. There's the medium complexity migration, which takes anywhere from three to six months. And then there's the complex migrations, that take six months and above. Typically, most migrations even the complex ones are wrapped up and under nine months, but it does depend on the environment, the use cases, the availability of customer resources, availability of hardware and things like that. We worked very closely with the customer to come up with a timeline that works for them.