Sr. Director of Product Management, MapR Technologies
Today's AI/ML and analytics workloads can be bursty and unpredictable. Provisioning infrastructure for worst-case (maximum load) compute scenarios is costly and increases administrative overhead. Kubernetes solves for this, partially, by letting organizations orchestrate and spin up containers as compute needs arise. Yet challenges persist.
Learn how new innovations from MapR simplify the independent scaling of compute and storage. You will discover how these innovations will help your organization:
Speaker 2: 00:00 Suzy, it's all yours.
Suzy V: 00:01 Thank you. Hello, welcome and thank you for attending the session. So we are going to have a good one here talking about what the benefits are of compute and storage for the analytical workloads of today.
Suzy V: 00:16 Before we get started, let me set the context here as to why suddenly separating compute and storage is important.
Suzy V: 00:25 So many of you are probably thinking that this is something that has been done, it has been done especially in the world of virtualization if you are a VM expertise. Even cloud vendor so do that today where you can subscribe to compute storage differently.
Suzy V: 00:42 So we'll take a look at why this is important for the analytics workloads, and also articulate as to why this is important when you have an existing analytics deployment model on on-premise data center.
Suzy V: 00:59 So with that said, let's take a look at where we are in terms of big data analytics.
Suzy V: 01:05 So over the past decade or so, we definitely have come a long way, haven't we? In terms of of big data analytics. So you would ... If you did a quick Google search today, you'll get a litany of write ups and blogs around where we are in terms of big data analytics, what the market share is, you would take a look at things around IDC reports and so on and so forth.
Suzy V: 01:34 What we can all agree to is the fact that data has grown quite exponentially and will continue to do so. That is quite a plethora of information around how much data a single person generates, how much data is being analyzed on day-to-day basis and so and so forth. But we can all agree that data will continue to grow.
Suzy V: 01:58 One other interesting tidbit of information that IDC has shared is the fact that the revenue for the big data and business analytics alone has definitely expectations.
Suzy V: 02:14 What does that mean for us? That means that there are quite a number of solutions available indicating that there is quite a number of demand for solutions that are targeting big data analytics alone.
Suzy V: 02:32 And finally, one other information that we can all agree to is AI is becoming definitely a concept and a hype that everyone talks about, considers, and debate around it. But what I do tend to see here is AI somewhat engulfing the concept of analytics, and there are several research that says that a lot of new applications will come out in the future with built-in AI capabilities.
Suzy V: 03:08 So these are all just some kind of context setting here mainly for me to drive home the point that big data analytics is evolving, but while data exploding, while there are a plethora of solutions available, there are still certain considerations that you as organizations need to keep in mind.
Suzy V: 03:31 So what do I mean by that? What are we talking about when I say big data has evolved. So there are several areas that are worth expanding and looking at.
Suzy V: 03:43 So first, let's take a look at how analytical data has changed.
Suzy V: 03:48 So if you have been involved in the Hadoop world a decade ago, you would have dealt with batch mode, which then moved over to realtime analysis. Now a new category is emerging which is being termed fast data.
Suzy V: 04:07 This is to indicate a category where data is analyzed and value-applied most immediately as the data is created. This is where I would love to talk about edge use cases.
Suzy V: 04:21 For those of you who are in the business of having edge clusters, edge sides, this is beyond IoT devices as well. But the trend I'm seeing as I speak to customers and just industry experts is the data that is being produced at the edge side requires analytics to be applied then and there, so that refuses the requirement of moving the data from the edge site over to a central data center.
Suzy V: 04:53 So this concept of fast data is somewhat overtaking realtime and batch mode. And that is one change that we are starting to see in terms of how the data is being categorized in itself.
Suzy V: 05:10 The other aspect is how the data is being analyzed. So many of your applications and just your environment is quite steady, however I do see its origin the way the analytics is being performed. Many of you do have bursty or spike periods. It could be a seasonal set of analytics where perhaps there are organizations in your company which does, say, quarterly reports, or perhaps you're in the business where you have peaks in the way your data comes in and therefore you would have analytics that burst. That means your computer is bursting, that means suddenly the computer resources that you would need have to increase for those set of applications.
Suzy V: 06:01 So this trend of having a bursty set of applications is also increasing in organizations. So I would say that the way in which the data is being processed and how it is being processed has definitely changed quite significantly than how it was almost a decade ago.
Suzy V: 06:25 Now, having said that, we talked about how and the analytical mode has changed, now let's talk about the hardware aspects of it. Now, this is important mainly because when Hadoop started off, there was an adage that in order for you to process the big data, in order for you to get the most performance, you have to have compute where the data is. So place the compute and storage together, create a cluster around it so that you achieve a couple of things. One is data locality is achieved by simply placing these two aspects together, the second aspect of that is you reduce the latency and the bandwidth bottleneck. However, that area has improved quite significantly as well.
Suzy V: 07:18 So one of the biggest things that has come up in the recent past actually is this emergence of GPU-based service. We all know and we saw the surge in the GPU based Bitcoin.
Suzy V: 07:34 But what has happened with vendors like NVIDIA and the other vendors is they have been busy releasing GPU-based servers just to focus on highly performant analytics. If you have use cases that need massive parallelism and have sub-microsecond latency, then GPU-based services and solutions that are being built of course are taking advantage of GPU-based servers are starting to show up in the market today.
Suzy V: 08:08 What about network? We saw the example with the edge use case, so when you have massive data being collected at the source, how do you make sure that the data is being migrated or moved to a central repository, and this is where the classic example that I see today from edge to the cloud example that comes into picture.
Suzy V: 08:33 How do I transport petabytes of data from the edge to the cloud? Or petabytes of data from an on on-premise data center to the cloud?
Suzy V: 08:42 Now, there has been significant improvements in the network aspect as well, not just in the hardware but the solutions that are surrounding the hardware as well, has come a long way so that the band [inaudible 00:08:57] the latency requirements are almost given, and it is something that organizations don't have to worry about these days.
Suzy V: 09:07 So we are seeing a lot of trends in this aspect as well. Now, having said that, so what about the workflow?
Suzy V: 09:17 So this is where I see a huge stark distinction. So if you were to take a look at the slide and literally cut down in half vertically, each one of you will be able to have a use case where you could resonate to the details that are shown in the slide.
Suzy V: 09:38 So let's talk a look at the analytics workflow. If you were a long time Map Reduce user or a Hadoop user, yes, you will bring in data, you write Map Reduce jobs, it's a batch mode, perhaps you have transitioned over to a realtime analytics and your Map Reduce jobs are analyzing the data.
Suzy V: 09:58 However, what is happening around us in the AI world? So now we have several categories. First off, that is a separate category of even people and personas who are determining okay, let's take a look at what the business value is. Let us understand the data, let us profile the data.
Suzy V: 10:19 In fact, I'll give an existing customer example. There are quite a few number of customers today who are long-standing MAPR customers, who are doing this IO profiling of their applications.
Suzy V: 10:34 These are applications that have been long standing and running in their organizations. However, they are trying to do a pilot on those same applications to understand their IO profiles, understand the benefits out of them if they were to adopt newer technologies.
Suzy V: 10:54 So there's a whole set of inventions and a whole set of effort being made in understanding the data.
Suzy V: 11:03 Then the AI/ML world has definitely increased in terms of applying it to the current organizations, meaning how do I mine the data?
Suzy V: 11:17 And once you have run these training jobs, how do I do the modeling and immediately apply the model to the data that is coming in, so that I get quick ROI?
Suzy V: 11:28 So the analytics workflow, if you look at it as slowly merging into a data science workflow, I can almost say that every one of you will probably have a data science team that is independent of your IT organization, or independent of your current analytical organization.
Suzy V: 11:50 So these are all coming together. Now, while I've been talking about this, you would have noticed that I did not say anything about the data itself. So as you can see here I'm heading towards explaining this, right?
Suzy V: 12:06 What you can see is this workflow and how you orchestrate the applications, how you make sure that the solutions you choose today give you the maximum benefit tomorrow, it's almost independent of how the data itself is evolving.
Suzy V: 12:22 So now let's take a look at the data itself. We talked about the PaaS data, however the trend that I'm seeing today is the concept of these tired analytical data is slowly becoming the norm.
Suzy V: 12:36 What that means is the fast data typically is something that is even beyond realtime, it requires the analysis to be applied immediately, the output applied immediately to the source so that it creates a process.
Suzy V: 12:54 Then there is this realtime batch mode data which is typically brought for historical data. Maybe perhaps you are trying to learn that and applying different applications in house. So there is a sort of a tier forming around the historical data tier as well.
Suzy V: 13:15 Then, since we have accumulated so much amount of data, at least in the last four or five years, some of the data are being stored for long term use, which is archival data, historical data.
Suzy V: 13:28 Now, if you have been in this business of storage specifically, if you are a storage admin or you're a data presentator in your organization to choose the best storage solution, you can totally agree with this archival concept.
Suzy V: 13:45 So there is a certain set of categorization that is happening in the data which is almost independent of the way your application is evolving.
Suzy V: 13:59 And if you have been keeping track of what is happening in Hadoop 3, when they had the press release last year, they came up with Erasure coding technology as well.
Suzy V: 14:10 Now, why is that, right? Because there is almost a need to optimize for capacity because data is growing and it almost makes sense that you have to categorize them so that you get the most efficiency aspect of the storage investment you have made.
Suzy V: 14:32 Okay. So I have talked about how the data has evolved, how the data workflow has evolved, how the categorization of data has also changed, and we talked about how the hardware vendors have done their part in going along with this journey.
Suzy V: 14:53 So what does this all mean for your business? Now, usually there's a whole bunch of articles around this as well. You would see articles around the top five tips and tricks to keep in mind. So likewise, I would like to share what I have seen in the organizations like yourself, and what I know by talking to industry experts. But I would like to keep it really simple here.
Suzy V: 15:22 So there are four rules or best practices that I would like to convey. The first and foremost is there is so many change happening in the ecosystem in just the tools that are being produced on a day-to-day basis that it is almost imperative that you keep your solution simple, flexible and elastic.
Suzy V: 15:48 Now, what that means is if you can break it down into smaller pieces, separate out your applications from the way how your data is being managed, and if you've not already done so, take a second look to it, and see if you can do that.
Suzy V: 16:06 The second aspect around that is don't purchase a provision for your worst case scenarios, because I do see that a lot of customers do that even today. Mainly because it is much easier to do so. You may have been given the budget for this year, and it is much easier to just continue with the buying process that you're familiar with for the past year that you have been dealing with, and it is so much easier to just purchase and provision for your worst case scenarios or you burst compute and your burst scenarios.
Suzy V: 16:47 We have reached a point in analytics where that would no longer been the most prudent method to do so.
Suzy V: 16:56 So make the effort to analyze your own business, your own applications, and choose loosely coupled solutions where if you have compute burst, then perhaps you want to look for a solution that only caters to compute.
Suzy V: 17:16 If you have a steady growth in storage, I know customers who have a steady increase in the data that they bring in, it is almost you can chart them as increasing graph, you were to chart their quarterly and even by yearly growth in the data. If that is the case, then definitely this is the period for you to have the predictable storage growth brought into your planning phase and make a solution choice that will scale well.
Suzy V: 17:50 One thing that MAPR does very well and we have proven in many customer examples is you can really bring in petabytes of data within the same cluster in MAPR, and we will manage it very efficiently for you, and you won't even see a ... you will see a negligent performance impact to it.
Suzy V: 18:10 So you wanna take into consideration solutions like that, so that once you orchestrate for today, then it is just a matter of keeping up with your storage group.
Suzy V: 18:23 Store and manage your data wisely as well. If you do not have a tiered data solution today, this is the time for you to take that into consideration. And don't just relegate your self to be on-premise. Even though it is in my best interest to advise everyone to keep everything on-premises, the right thing to do here is if you have a need to burst to the cloud, if there is a set of applications that are siloed enough that you can easily move, say, even for long-term archival, do take a look at the cloud vendors.
Suzy V: 19:01 MAPR has a very good partnership with all the vendors, and we have a seamless way for you to move the data and applications to the cloud as well. So this is something that I would highly recommend you to consider.
Suzy V: 19:15 And the fourth point, this is where I think many of the organizations struggle. You have to take a look at your own ability to adopt newer technologies. There is so many new items that are being shared today if you were to just go and take a look at AI today, there is so many tools, there's so many algorithms, so many best practices that are shared by everybody, it is mind-boggling.
Suzy V: 19:45 But you would have to build up your own skillset and especially down ... If you are a data engineer let's say, or you are heading a team who are data engineers, this micro services architecture is something that everybody gets confused about.
Suzy V: 20:02 Understanding what micro services architecture means is one thing, and then applying it to your legacy applications is another. This takes time, there is a learning curve around it. So take a look at what technology you would need. Would that mean you would need to embrace new data science tools if you're a data scientist? Then perhaps this is a time for you to take a look at that as well.
Suzy V: 20:27 So these are the four overarching guidelines that I would say. You would have to keep in mind, if you're going to keep up with this trend on the analytics and the AI space.
Suzy V: 20:41 Okay. So now having shared with you what it means for your business, now let's actually drill deeper into what it means to choose a flexible or an elastic solution.
Suzy V: 20:56 So what are the benefits of choosing something like that. So your overall goal at the very end should be towards whether you can actually lower your costs or not.
Suzy V: 21:09 Now, costs means many different things for many different people. It could be your upfront cost, it could be an overall operational cost, or it could just be the ease with which you bring your products to market so that your time to market gives you an edge over your competitors, and thereby you have revenue coming in.
Suzy V: 21:33 So it's very important for you to identify which cost you're trying to lower. It could be all three of the above as well.
Suzy V: 21:41 Now, once you've identified that, the biggest thing that choosing a flexible solution gives is efficient resource usage. If you have bought a certain set of infrastructure and a certain set of solutions to run on top of it, the combination of solutions, using the maximum aspects of your infrastructure, combined together is the one that is going to give you an efficiency, and the agility that you need.
Suzy V: 22:15 Now, if you are someone who has a bursty set of workloads, simply buying more additional servers doesn't cut it anymore. So what are some of the solutions that you would need to take care of bursty workloads?
Suzy V: 22:30 Maybe you wanna add more applications but not necessarily more storage capacity, so separate them out.
Suzy V: 22:38 If you can independently scale compute from storage, it gives you the flexibility, then choosing a solution that runs on top of such an environment will give you a way to elastically scale your demand and all of that will converge into an ROI of lowering your cost.
Suzy V: 23:01 So this is the three adage or the three guidelines for choosing a flexible solution that I would like to then drill deeper when talking about what we, MAPR as a company's doing towards it.
Suzy V: 23:18 Okay, so now let's talk a little around the product itself. So for those of you who are new to MAPR, there's quite a plethora of information about our MAPR data platform, how we can scale well. We do have a quite a number of customer case studies that is out there on the web page. So if you're not familiar with our technology and our architecture, I would highly recommend that you get familiar with it, because you need to understand the fundamentals of how we can scale petabytes of data, and then you can talk ... then you will understand some of these newer products and features that we are bringing to the market.
Suzy V: 24:03 So how is MAPR approaching this flexible solution? How are we approaching and providing features for you to separate compute from storage?
Suzy V: 24:16 So the first aspect, without through the slide from left to right. So the first thing that we have done here is we have embraced Kubernetes containers in a big way.
Suzy V: 24:30 We have established a leadership in that. We are continuing to invest in that integration, and you will hear from us a whole bunch of integrated solutions that Kubernetes and containers.
Suzy V: 24:46 The first step that we are going to take in this is something that where we are going to focus primarily on the big data framework tools. We have chosen, picked the most active ones, and we have natively integrated with Kubernetes.
Suzy V: 25:04 And in this aspect, we will bring in the Kubernetes value add, we'll add our MAPR's value prop on top of it, and we will enable you to run tools such as Park, Hive Metastore, Apache Drill, natively integrated in Kubernetes.
Suzy V: 25:27 The second aspect is while we are taking a closer look at the compute aspect, we have done a lot of innovations on the storage aspect as well. We introduced MAPR data tiers, we introduced that a few months ago.
Suzy V: 25:45 We give you a way to have three different tiers. To keep it simple, we have the concept of hot and cold. This is of course subject to change based on your business requirements.
Suzy V: 26:00 We have a hot tier where we recommend that you place your mission-critical applications. It is backed by replicas of copy so that they're highly available.
Suzy V: 26:13 We have a warm tier which is target towards capacity optimization. It has Erasure coding in the backend. And there it is primarily focusing on how you can get the maximum out of your storage capacity.
Suzy V: 26:29 And of course, if you have a need for long-term archival, then we have a cost optimized tier where we tier the data out for you to any object storage which is the cheap storage versions.
Suzy V: 26:42 So we are focusing on the data categories as well. Then from bringing it all together aspect, we have a way to offer the MAPR persistent storage, if you're running state full applications in Kubernetes.
Suzy V: 27:02 So you can a trend here actually. We are first and foremost having Kubernetes first, sort of deployment model that we are offering for customers who want to get this flexibility and elasticity.
Suzy V: 27:17 Now, please don't take this to mean that we are now abandoning all our existing customers or new prospects who have a huge bare metal footprint, that will continue as well.
Suzy V: 27:32 What we are offering is we are giving you an easy way to one, migrate towards a container world, we are integrating with Kubernetes and extending the MAPR data platform into this world as well.
Suzy V: 27:48 So let's take a closer look as to what we have done for deploying Spark and Drill jobs in Kubernetes.
Suzy V: 27:55 So we are introducing a new concept called as a tenant. So we already have here a wide expanded tech preview that is underway as I speak to all of you. And this tech preview has shown that this concept of tenant is something that resonates very well with customers.
Suzy V: 28:17 This tenant customer is basically to separate and isolate computer resources. So if you are a customer who are running Spark jobs, but let's say sets of users, say there is a user Peter or there is a user Joe or a user Sally.
Suzy V: 28:33 Now, if these three users have different varying needs in how they run their Spark jobs, or it could even be your custom applications, not necessarily a Spark job, or maybe you're running Drill jobs, so today it is extremely difficult to run all these three varying users and their work jobs all on the same platform. It requires constant management and constant hands-on configuration to host them.
Suzy V: 29:05 What we are giving you with this concept of tenant is evenly separating out these group users. The segregation could be based on users, it could be based on projects, it could be based on even types of access mechanism if you want. It is entirely up to your admin how they want to create tenants.
Suzy V: 29:30 And this tenant concept will let you have dedicate compute resources for those tenants.
Suzy V: 29:38 So as I speak, it should automatically come to you that you're talking about efficient use of compute resource among different users.
Suzy V: 29:49 Now, many of you ask what kind of design pattern we use when we are integrating Spark and Drill within these tenants. We are embracing the concept of Kubernetes operators. If you're familiar with Helm charts, this is an alternative, we find that Kubernetes operators is becoming an industry standard, so I would highly recommend that you take a look at Kubernetes operators as well.
Suzy V: 30:16 Now, why am I focusing on this tenant concept? So one thing that Kubernetes has left everyone with is a whole be-settled set of terminologies. There is way too much terminology overload, and understanding them, and mapping them to your existing environment is extremely complex.
Suzy V: 30:39 And this is where MAPR comes in. We are trying to embark on this journey with you where we want to encapsulate all of these different terminologies, concepts, into a tenant concept, and we are going to provide you with example templates so that it is much easier for your engineers or yourself to understand how to deploy your applications in the Kubernetes environment.
Suzy V: 31:10 We have already had a lot of success in our tech preview with customers where they find that it reduces their learning scope, and it makes it easier for them to deploy their environment in a Kubernetes environment, so it is all about accelerating your go-to-market, accelerating running your Spark jobs and Drill jobs, and even your custom jobs within this environment.
Suzy V: 31:35 What is the value add MAPR brings to this table? So in order to understand this, you have to understand what Kubernetes does actually.
Suzy V: 31:46 So Kubernetes subscribes to the portable aspect of containers in itself. And Kubernetes gives you a way to auto-scale your containers.
Suzy V: 31:56 So when Kubernetes does that, what is MAPR doing? So one of two main things that we bring into this picture is you already saw the resource calculation that is encapsulated around the tenant. What you see in this slide is a way where you can specify the CPU resources and the memory resources required for each tenant.
Suzy V: 32:21 Now, we will give you soft limits and maximum limits. So with these quotas in place, you have the freedom to increase up to your limit, if you have a spike period, or if you are say in a PUC, or you have a test environment, you can always have a lower limit and a maximum limit in your quotas, and thereby address your compute burst, your compute spike overloads if you have one.
Suzy V: 32:55 Now, if you're sitting here and thinking, "Well, I don't have a spike period, mine is quite steady." It is still an efficient way to use it because when you specify these resources, they are dedicated for you.
Suzy V: 33:09 So if there is a rogue application running, then it will not affect your own application and your own environment. And so there is a lot of benefits that using these value adds that MAPR brings to the table and applying it in a tenant model that you can now address that you couldn't before easily.
Suzy V: 33:32 The other aspect as I'm hoping each one of you's able to come to is by introducing this encapsulation, now you can have multiple tenancy on the same platform with real resource isolation. This is one of the challenges that many of you face today when you are trying to deploy not just on premise, but in the cloud as well.
Suzy V: 33:58 Now, this will give you true resource isolation, and with that, you can once orchestrated, this will just run on its own. Okay?
Suzy V: 34:09 Now, having said that, what about security? So you will find, as you start learning about Kubernetes if you've not already, that this is one area that is quite complicated, because there are two paradigms heres; Kubernetes gives security to run your apps, but you also need that same security extended to accessing the data, and that's exactly where MAPR fits in.
Suzy V: 34:37 What we do is we automatically generate your Kubernetes Secretes. Kubernetes Secretes is the their terminology for ensuring secure access for your applications, and we also give you example LDAP PODs to run so that you take the same secure tenant to run your containers, and also to access the data.
Suzy V: 35:01 We will take care of mapping your Kubernetes paradigm into accessing the data. And as we give you example templates, what we recommend when customers come to us and say, "Hey, I would like to test drive this product," we walk you through these by giving you these example templates, and giving you this utility called ticket creator.
Suzy V: 35:27 So once we walk you through these steps, then all of this is hidden from you so that you actually get a unified seamless secure advantage that you wouldn't have normally.
Suzy V: 35:38 And also this leads automatically to the next step which is if you can do it on an on-premise, you can do it on the edge. If you can do it on the edge and on-premise data center, you can also do it in the cloud. So you see the trend as we go through the discussion today.
Suzy V: 35:56 Once you set the platforms here and have a unified management layer, then you can pretty much have the same experience, take it with you regardless of your environment choice.
Suzy V: 36:12 Okay. So now we talked about the integration part with Kubernetes, how do you access the data? If you have separated compute from storage, say you have embraced this concept of running everything in Kubernetes, you have parked jobs running in Kubernetes, now let's say you have different users, but they all want to share the data, wanna access the data.
Suzy V: 36:37 So the way you can do that is using our volume driver plugin, which is you mount MAPR volumes, and you mount them to the tenants that you have created earlier which we saw in the previous slides.
Suzy V: 36:52 Now, you have a couple of choices here. You can mount these volumes exclusive to the tenants, or you can share them across tenants.
Suzy V: 37:02 The sharing part and separating the compute and storage this way is a huge benefit for data scientist especially. Because let's say you have a person who's building a model. Once a model is up and running, and if that model needs to be consumed by say 10 different users, then this is the exact solution and configuration that I would recommend for you. Because you have the models running, and then you place them in a set of volumes, now the volumes can be shared across tenants, so your 10 data engineers or data users can have their own compute resources but share the data among them.
Suzy V: 37:46 This is the true power of separating compute and storage, this is not simply just saying, "Okay, let me buy different compute servers and storage servers." You actually have to then choose a solution that will give you the benefits of separating the infrastructure.
Suzy V: 38:06 Now, let's talk about data tiers itself. So you have separated the compute, you have moved your applications into the Kubernetes world, you are sharing the volumes across your different tenants, now let's you have accumulated the data, and you want to separate them out so you can continue to efficiently use your storage resources as well.
Suzy V: 38:31 So this is where the three different tiers comes into play. So there are a couple of things that you can do here. If you really have streaming data, you have very active data, then I would recommend you keep them replicated, keep them even backed by NVMEs or SSDs if you choose to, so that you get the maximum performance out of it.
Suzy V: 38:57 But if you have data that has accumulated over years, and you still want to have control over it, then the recommendation is to move them into a tier that Erasure coded so that you can keep using that same volume for placing more and more data.
Suzy V: 39:15 But if you really have requirement where you just need to keep the data for a long-term but you do not envision ever having to access them, some of the ... If you are especially in the financial industry, you have compliances and regulations to meet. In those cases, if this is something that you simply have to have data on, say for the next 7 years or 10 years, then the recommendation here is steer it out of your on-premise data center. Place it in a cheap storage. It could even be a public cloud, and we will allow you to tier them out of MAPR into an object store.
Suzy V: 39:55 Now, all of these can be automated using policies, MAPR will offer you policies as well. Policies could be something as simple as if the data is not being written to for the last 60 days or has not been accessed for this many days, move it out. It could be based on type as well.
Suzy V: 40:16 Say if you are just an engineer, just going on having your development files over there, you could even tier them out based on your file extensions.
Suzy V: 40:29 So these are some of the things that you might want to consider if you're not already using data tiers.
Suzy V: 40:36 Okay, I have talked a lot about how the benefits will come together if you were to simply separate compute and storage. It is not simply just separating them out from a hardware perspective, do take a look at your business ROI, take a look at how you want to restructure your applications.
Suzy V: 41:01 So if you were to go back to the three benefits of separate compute and storage, now let's take a look at how the features that we just saw fit into three benefits.
Suzy V: 41:14 So independently scaling compute and storage, how can you do that? Now, if you were to run Spark or Drill jobs in containers, in Kubernetes, now if you have spike periods, what can you do?
Suzy V: 41:28 Now, using your quotas, now you can spin the compute independent of your storage. You can segregate your storage independent of the compute jobs you're running.
Suzy V: 41:41 Now, the same the features that we just saw a couple of minutes ago can now attribute and hopefully connect the dots for you and how you can independently scale.
Suzy V: 41:50 Now, if you're still asking yourself, "Why do I need this independence?" This is precisely why you need. If you want to focus on multi-tenancy, if you want to focus on adding new applications, if you have compute burst loads, then highly recommend separating out your compute cluster from your storage cluster.
Suzy V: 42:15 But if you are one of those who say, "Well, I never have a spike period, mine is pretty stable, and I really think that I need to put compute and storage together," by all means continue to do so. We are not saying that if you don't separate compute and storage, then that's the end of the world.
Suzy V: 42:33 So these are the benefits you will get or these are the challenges you can address if you apply these features and scale out the compute independent of storage.
Suzy V: 42:47 What about the second benefit? How can you now use resources efficiently? So now, by just looking at the past, the features and what we have done, this should be self-intuitive now.
Suzy V: 43:00 So this is where the quotas and limits for your compute job really comes handy.
Suzy V: 43:06 By combining it with Kubernetes, now you have the power to activate your jobs, when you don't need them, you can delete them and you can even shut them down. Okay?
Suzy V: 43:18 Now, that way you can auto-scale up and down your compute resources, when you delete the compute jobs, those resources are freed up and you can run other jobs.
Suzy V: 43:29 Now, you can also burst to the cloud. This is a powerful tool where you can not just burst to the cloud, we give you a way to even bring the data back here.
Suzy V: 43:39 So what I've seen especially in the data science area is many of you still have data in your laptops, but when you're really talking about production, or really want to experiment with it, you would run a tool in Amazon or in GCP or Azure, but then bring it back onto your on-premise and run it in production on-premise. In those cases, this is exactly where you bring in your quotas and limits and run it in the Kubernetes environment. That will cut down your time to market and go to production significantly.
Suzy V: 44:16 And finally, lower costs. Now, I really don't have to explain this at all, you have separated out computer storage, you have made your storage usage efficient, you've also made your compute resources efficient, you have created the concept of tenancy, you have given them dedicated resources, you're sharing the data.
Suzy V: 44:39 If you were to run, say, a normal FIO job in such an environment, you should absolutely see a reduction in percentages on just the cost of running these separately, running your FIO jobs, and you should be able to articulate that to a certain percentage in production.
Suzy V: 45:02 There has been many experiments that we have done internally to showcase that. And our tech preview customers are doing that as well.
Suzy V: 45:11 So hopefully by wrapping this up, I have draw first off the need for separating out compute and storage. What MAPR is doing to making that easier to deploy, easier to manage, and also bringing in in that process, embracing newer technology and making it easy for you to embrace a new technology, and then tying it all together to the benefits. We are the features that we are bringing.
Suzy V: 45:44 So hopefully this was very helpful for you. Before I open it out for questions, I would leave you with a couple of points. The concept of having persistent storage for Kubernetes already exists. We have a plethora of information in our website, go read that and familiarize yourself with it.
Suzy V: 46:05 If this concept about running everything in Kubernetes from a Spark job, Drill job, or you are simply having questions around data science, I would be more than happy to engage with you one on one in sessions, maybe in a roadmap session. Or if this interests you and you want to do a tech preview of our features, I would more than be happy to fold you into our different phases of tech preview and you can get hands-on support and experience on it.
Suzy V: 46:35 So with that, I'd be happy to take on questions here.
Stephen: 46:39 Okay, thank you Suzy for that very nice presentation. We did have a few questions come through, and I see we still have around 10 minutes to get through those. So if you're still here, we're moving into the Q&A section now.
Stephen: 46:55 So the first one that came is is MAPR offering its own Kubernetes version, or can I use an open source version of Kubernetes?
Suzy V: 47:04 Yeah, very good question, and actually this is the most asked question at this point. So no, we are not offering our own version of Kubernetes. You can bring your open source Kubernetes. Now, there is a subtle point that I would convey here.
Suzy V: 47:21 Some of you choose a really old version of Kubernetes. In fact yesterday I was talking to a customer who was using Kubernetes 1.9 I believe. You have to keep in mind that because Kubernetes has a large community, the improvements are so fast that they keep building out new version with high efficiency and speed, that certain features like say persistent storage APIs, all of that are not available in the old versions.
Suzy V: 47:54 So be prudent on the version of Kubernetes that you choose from the open source. If you have ... If you want to use persistent storage, then start with that version of Kubernetes, so that you can get those benefits.
Suzy V: 48:08 And one thing I also want to mention, upgrading from one Kubernetes open source version to another is not that easy. So keep that in mind as well. And final point around that is if you are using layers like RedHat Open Shift, they bring in another round of complexity to this. They are usually behind in the Kubernetes version they support. So watch out for all those aspects when you choose a Kubernetes version.
Stephen: 48:41 Okay. Another one that we had come in is what if I don't have workload piece and have more of a steady state of apps, would you still recommend separating compute from storage?
Suzy V: 48:52 Yes, absolutely. And I hope I had touched that point a couple of times in my presentation.
Suzy V: 48:59 Listen, so it's not necessarily for spiky workloads, right? If you really want to efficiently use your resources, do separate them out, because then you can use the advantages of having dedicated CPU resources, having multiple tenancy.
Suzy V: 49:17 There is a value add with that, maybe you don't need to create more jobs and break them, but all the other benefits will apply as well.
Stephen: 49:27 Okay. Another one, do tenants only work with Kubernetes structure and only shared MAPR services, or could they also be used to mount a MAPR snapshot in the same cluster with separated resources?
Suzy V: 49:42 Yeah, very good question. So actually a short answer is yes, you can mount a MAPR snapshot into the same cluster as well. But one thing you have to keep in mind is when you say do tenants only work with Kubernetes structure. So the tenants are for your compute, okay?
Suzy V: 50:01 So when you separate the tenants, and I earlier saw a question in this chat, which is is your Kubernetes ... where is your Kubernetes cluster? So visualize this, right? Your Kubernetes cluster has to be an independent cluster from where your data is. Only then you will truly benefit from this.
Suzy V: 50:22 Now, your tenants will work in the Kubernetes structure, but when you mount those volumes, you can use the MAPR snapshots onto the same cluster, but separate out the resources.
Stephen: 50:34 Okay. Another one that came through was I could move to the cloud to get separation of compute and storage, why would that not work?
Suzy V: 50:43 It will work, and I also mentioned in the stock, burst into cloud is not something that you should all shy away from. We don't. We partner with all the cloud vendors, and we have quite a bunch of you deploying us in the cloud as well.
Suzy V: 51:00 But there are again, best practices there, right? Take a look at is cloud really the answer for you? What do you want to achieve by moving to the cloud? Like I talked about the data science use case, many of you will run your sandboxes or your pilots in the cloud. But if you really wanna control your data, have the security that you need, then you will end up in very ... really bringing it back to the on-premise or your cluster that you have.
Suzy V: 51:32 So if you're looking into burst into cloud, choose a solution that give a hybrid cloud solution.
Suzy V: 51:39 What I mean by hybrid cloud solution, you wanna ask yourself, the application that is running a certain way on-premise must run the same way in the cloud as well.
Suzy V: 51:50 And also keep in mind, when you move to the cloud, there will be additional cost to this. So cloud is not that cheap anymore, we all know that. So keep that in mind as well when you burst to the cloud.
Suzy V: 52:03 Okay, so then I have a lot of questions in the Q&A boxes, let me address them one by one.
Suzy V: 52:10 Okay, one of the questions there is hopefully I addressed where does the Kubernetes cluster reside. So they should be independent of your MAPR cluster.
Suzy V: 52:19 The second question is what happens to Yarn? Yes. So this is a very interesting debate that we all have.
Suzy V: 52:28 So what I have seen is Kubernetes is fast catching with Yarn. Okay? There are certain things that Yarn does too that Kubernetes doesn't do yet. But it is only a matter of time.
Suzy V: 52:42 Now, can I run Yarn in containers? I personally don't see a value add of it because Kubernetes does everything that Yarn does from terms of resource management. So if you really look ... If you can move away from Yarn to Kubernetes, this is the time to take a look at it. But if you really have subscribed to features like say setting user quotas, those things Yarn does very well, and you can't move out of that, then have a solutions where ... see where you can use Kubernetes along with Yarn. That is the recommendation I would give today, but I do envision a lot of ... if write moving towards Kubernetes by end of 2019, so if you're having conversation again around the end of this year, perhaps I would have a stronger recommendation for you to move away from Yarn and go into Kubernetes.
Suzy V: 53:36 But again, it's a use case basis. If you have more questions, I'd be happy to have meetings with you and your team and walk through that. Okay?
Suzy V: 53:46 Next question. Does it work with Open Shift or OSS Kubernetes? Yes, absolutely, I already mentioned Open Shift aspect of it.
Suzy V: 53:54 Second question, how do you reconcile NSS access with MAPR ACE? Excellent question.
Suzy V: 54:01 So I had a slight around security, remember? Where we are going to take the Kubernetes secrets and converting it into MAPR tickets. We will take care of that for you, and if you have current ACEs, we will fold that into the Kubernetes Secrets, and you will see that on both ends actually.
Suzy V: 54:23 So your data using the ACEs will continue to be applied on the volumes. Okay?
Suzy V: 54:29 Next question. How will you migrate customers from Yarn to Kubernetes?
Suzy V: 54:34 Okay, now, this is another aspect that is very important for you to understand. So there is no easy way where you can just flip the switch and suddenly you move it away from Yarn to Kubernetes. It does require well-thought out migration strategy here.
Suzy V: 54:54 What I see with customers who are the early adopters is they are choosing when they do their data classification, and they're doing their IO profiling they end up doing two things; One is either they just lift and shift their apps into the Kubernetes.
Suzy V: 55:16 But what I see is they actually create an independent cluster with Kubernetes. They run the same application in that Kubernetes cluster. They compare it with their Yarn cluster, and then they see the benefits that come out of it. So then once they see that they keep this Kubernetes cluster and they envision moving from the Yarn cluster to this Kubernetes cluster over time.
Suzy V: 55:41 Right now, that is the cautious way I would recommend as well, because understanding Kubernetes and the complexity that comes with it is very very high at this point. So instead of lift and shifting over night, I would recommend that same strategy that I'm seeing in these early adopters.
Suzy V: 56:01 We are at the top of the hour. So forth those questions that I have not answered, I will take them outside of this call, and then follow up one-on-one and give you the answers.
Suzy V: 56:14 So hopefully this was quite helpful for all of you. If you have more questions, feel free to reach out to me, and I'll be happy to follow up.
Suzy V: 56:23 Thank you very much for taking the time.