Ted Dunning PhD
Chief Application Architect, MapR Technologies
Ellen Friedman PhD
Principal Technologist, MapR Technologies
Value from big data depends on successful production deployments of AI and analytics applications at scale. Is your system ready?
In this webinar we explore 5 specific questions you should ask about your own system in order to set up for success. Ted Dunning, Chief Application Architect for MapR Technologies, and Ellen Friedman, Principal Technologist with MapR, will take you through a live (and lively) discussion of these key criteria along with practical suggestions for how to meet them.
Ted Dunning: 00:00 Yeah, well, I think that the first thing is for us to introduce ourselves and each other. Ellen, go.
Ellen Friedman: 00:09 Hi, thank you all for joining us today, we have some contact information here on the slide and we will repeat that at the end, we would love to hear comments and feedback from all of you. Ted?
Ted Dunning: 00:22 Yeah, so that's me. MapR or Apache just like Ellen, O'Reily just like Ellen and contact information, just like Ellen. You can get ahold of us all kinds of ways.
Ellen Friedman: 00:34 Okay. Let's jump right in. We want to talk to you today about Kubernetes data fabric and multi cloud because these are three emerging technologies. These are three trends that people really putting to work to be able to get real value out of their large scale data projects. They're making a big difference. But, I think the best way to avoid a problem is to never have the problem, and so we want to focus you in on some gotcha's, some potential gotcha's using these technologies so that you can be set up and not run into problems when you're using them. We post those as by questions.
Ellen Friedman: 01:20 But, before we get into the questions, want to take a moment just for a little bit of background, particularly with regard to Kubernetes. Some of you are already using this, for some of you it would be new, and so we just want to put you a little bit into the context of what Kubernetes is and why people use it and how they use it. This starts with the idea of why you would want to use containerized applications. Containerization is really helpful, it lets you run different applications in definable, repeatable, predictable environments. That gives you a lot of flexibility, especially because you can be running different applications basically side by side on the same cluster but in very different environments. It also makes it easier for you to be able to deploy those applications.
Ellen Friedman: 02:12 But, containerization also needs some way to orchestrate what you're doing. Ted, why don't you explain a little more about-
Ted Dunning: 02:21 Yeah, and that's really the key advance that you can see in systems like Kubernetes. That's that you can take a few containers that need to run on a single machine and make them into what's called a pod, and then you can group pods together so that they work together as a single service. Kubernetes provides the ability to define pods, to define affinities so that pods live near each other or far from each other. Then the ability to name the pods as services and that naming is actually a really important aspect of it. It isn't just starting, stopping, and positioning, it's starting, stopping, positioning, and naming. We'll get into how names are magical in a little bit later part of the talk.
Ellen Friedman: 03:16 Okay, so let's jump right in to that first question. How will I store data from stable applications that are orchestrated by Kubernetes? Ted?
Ted Dunning: 03:29 Yeah, well, that is a huge issue. It is the number one listed sore point for users, current users of Kubernetes in the Cloud Native survey from last year. The survey's only done every other year. But it's interesting to point out that this isn't a source of doubt for people who are thinking about Kubernetes, this is the major problem for people who have Kubernetes in production. The key takeaway here is you don't wanna put storage in containers. You can do a little bit of that, but you really want to avoid that. The other key takeaway is that simple solutions are really not satisfactory for the people who are sophisticated users of Kubernetes already. The idea you buy a storage appliance, the idea that you'll use specialized storage things like S3 isn't really the answer.
Ted Dunning: 04:41 The basic idea here is that Kubernetes is controlling the execution of your application. Here we've got the applications written in squares. Really, they're not single applications, single containers. They're typically a small fleet. Kubernetes of course starts and stops them and positions them. But you need data. You need for them to read files, you need streams, you need log files, data that's produce by one app is read by another. That means that the life cycle of this data outlasts any particular application. Ultimately, what that means is that you need a data platform that stores this data and does essentially the same things as Kubernetes yet creates the data as applications try to create it. It positions it, and ultimately deletes it. Very very importantly, it provides naming that's successful to those applications.
Ellen Friedman: 05:47 Now, another aspect of using Kubernetes we touched on earlier is not just using Kubernetes, it's using containerization at all is that it gives you a huge advantage in terms of multi-tenancy. Let's you be running many different applications at the same time on the same cluster, it's a great way to optimize resources. It's a great way to share data and to do that safely. Multi-tenancy we see is a big advantage at all stages in development and particularly in production. That brings us to our second question.
Ellen Friedman: 06:25 Second question is will my data platform complement the multi-tenancy that's offered by Kubernetes? In other words, you set up this orchestration of a number of containerized applications but if your data and data platforms aren't also giving you flexibility you basically have tanked the advantages that you're getting from Kubernetes and containers. Ted, what should people look for to see if they are set up to complement what's going on with Kubernetes?
Ted Dunning: 07:03 Yeah, this is a really key thing, and I think this is one of the major topics that are causing people in the Kubernetes world, people who have Kubernetes in production to be dissatisfied with their storage solutions. The capabilities you need are exactly analogous to the things that Kubernetes does for long tenancy and that is Kubernetes controls the locations and where things are started and stopped. It will even lose them if necessarily, the data platform should do that exactly. It should control the location where data is stored. Which machines and such, it should handle the creation and the deletion and naming of that data. Kubernetes arranges those services in a cluster in order to improve ... to enable that multi-tenancy.
Ted Dunning: 08:02 Noisy neighbors get moved apart. The data platforms should do exactly the same thing thing. It should arrange the data so that IO loans are split across the cluster. While the tenancy for different storage types all in one cluster has to be done as well, and in order to that properly you really need to have a single named space that refers to all kinds of these objects. As we see it, and now this is not news for us, we started napar many years ago. It seems like many, it's only 10. We started it a long time ago, but our vision was that we needed to provide this multi-tenancy for data across one cluster or many. We needed to supply standard APIs so that you could hit it from legacy applications, big data applications, whatever the next generation of applications is. AI applications.
Ted Dunning: 09:07 You should be able to hit the data from anything you want and still take advantage of the naming and the multi-tenancy. Now, you can actually see this. Here is my home directory for instance. If I do LS in my home directory I see files, directories, that's like any file system, but then I see streams and a table. S1, S2 are the streams, and T1 is the table. Those appear in my directory. That is a crucial organizing property of modern file systems. The fact that it's extended to include streams and tables is a big deal.
Ted Dunning: 09:52 Then if you look down in the lower left there, you can see the path then from my home directory /mapr/se1 user tdunning. That MapR Se1 specifies which cluster we're talking about. The rest of it is just a normal path thing. There's a thing called a volume that's key to doing this management. I'm gonna have Ellen talk a bit about volumes.
Ellen Friedman: 10:21 Yeah, I wanted to tell you a little more about volumes, what a MapR volume is. Because this is a great example of how you can do the data orchestration that is the parallel for what Kubernetes is doing for your containerized applications. Now, as Ted mentioned, a MapR volume is basically like a directory with superpowers. It's a really convenient tool for doing management. You will have multiple volumes on your cluster basically spanning the cluster, as Ted has already pointed out you can see that within a volume you have files, tables, and strings all within the same volume. Because I'll just remind you MapR is a fundamentally different technology where files, where the database, where the string transport are not separate components that are talking to each other through connectors, and then being called a data platform.
Ellen Friedman: 11:18 This is literally one code, it's the same process, it's the same technology, and that puts it all under the same management, administration, it puts it under the same security. That's a really convenient way to be able to orchestrate data. You're doing this with different types of data structures. The volume is also the key point where you apply certain policies, including giving you the fine grain control over who has access and indeed who does not have access to certain data. Now you can also apply access at a more fine grain level. But at the volume level you can control access by users, by user groups. The volume is also the basis for mirroring. MapR has really efficient mirroring.
Ellen Friedman: 12:05 The volume is also the basis for making real point in time snapshots, a good way to do data versioning. You can see all of this is a huge advantage in being able to orchestrate, to manage data in a way that supports multi-tenancy, and that complements what's going on.
Ted Dunning: 12:26 Actually Ellen, this is just a really key point, I want to just amplify a little bit. Control and management is ironically something that helps the cluster, helps the storage system, the data platform provide multi-tenancy? Right? That's what you're saying there.
Ellen Friedman: 12:45 Right.
Ted Dunning: 12:47 In the little diagram there each color is a different data object, could be a stream, could be a file. The triangles are the volumes spread across multiple wraps of computers. But those don't have to be wraps that are in the customer's facility, right?
Ellen Friedman: 13:04 That's true. But we'll get to that later, I want to point out one more cool thing that you can do with volumes, and that is people do have situations where they need to control data placement, they need to control data locality. This might be for something like using specialized hardware, such as GPUs, we see that increasingly with people who are doing AI and deep learning applications where they want to run certain applications on GPUs and on other ones elsewhere on the cluster. This is still all being handled by management via volumes. Compliance requirements also can make it important for you to keep certain data at a particular location or on a particular machine.
Ellen Friedman: 13:50 This just gives you the flexibility to have this done in an automated way or to be able to step in and do in a configured way. But what Ted was beginning to allude to is these sorts of situations, and indeed using containerization, using Kubernetes to orchestrate those containers is not just something that you do on machines that you have right on premise. People are increasingly looking at going to cloud deployments, or they're going to a hybrid of on premises and cloud, or even multi-cloud deployment. They want to be able to use these management things like containers and Kubernetes in their cloud deployments. That brings us to the next question I think, yeah. Oh, we just wanted to emphasize that what we're saying here is that you need something that orchestrates data in the same way that Kubernetes is orchestrating the application. That is what is meant by the term dataware it's doing more than just storage.
Ted Dunning: 14:53 Yeah, and that's a really crucial thing to underline. That the orchestration of data is necessary in the cloud, it's necessary on premises, they're exactly the same issues whether you rent or buy the hardware. There really isn't that much difference in the physics. You need that control, that optimization for real multi-tenancy. Here we go.
Ellen Friedman: 15:21 For our next question, you should be asking yourself, "Can I run Kubernetes applications not just in the cloud, but can I run them in any cloud?"
Ted Dunning: 15:31 Now that sounds weird, but this is a really really key point. If you start running your business in The Cloud, if you think of the cloud as a singular thing you have lock-in, inevitably. That's a problem. The reason the lock-in occurs is because cloud systems have idiosyncratic APIs. One cloud doesn't have the same APIs as another, so that if you write to those specific and specialized APIs you're gonna have a system that's locked to one of the clouds. But, if you build a system instead on a single data plane ... so if your data plane goes across that then the most critical APIs can be made open and consistent. That means that you can run code with the same API calls on premises, on the edge, in the public cloud, or in any other public cloud.
Ted Dunning: 16:42 This makes your data portable, and it makes you portable. That puts the cloud vendors in a commodity situation. We've seen this for instance, this uniform computing environment allows people to move data around. This is where the data platform becomes more than a data platform, it starts verging into becoming what's called a data fabric. That you can manage data globally across multiple clouds, across a hybrid setting. This is a huge advantage. This is where the promise of cloud really becomes real, having platform level data replication, and having uniform access is a big, big thing.
Ellen Friedman: 17:26 That leads us right into question number four, which is, "Ask yourself how can applications in different clouds access the same data?"
Ted Dunning: 17:38 The exact same data?
Ellen Friedman: 17:39 The exact same data. Before you answer it, let's think for a moment why that's an important question for people to ask themselves. What is the situations where they would want to be able to do that? One thing that occurs to me is that they may be trying to move from one cloud vendor to another, they don't want to be locked in. But, more importantly they may actually have the applications that they want to run with different vendors because they're, maybe, taking advantage of different services. But, those applications are actually, ultimately, going to address the same data.
Ted Dunning: 18:17 Yeah. So the move, of course, you can't move all at once. Nothing moves all in one instance. So, you have to move things bit by bit. And so, some things have to move before other things and you're going to have two clouds during that time.
Ted Dunning: 18:33 And, we have customers, for instance, who think that Google's machine learning is far, far better for them than anybody else's. But, they also think that they can do a better job of managing their cluster than any of the cloud vendors because they control the crusts that way and they're committed and solid. So, they use on-premises and the Google cloud with a little bit of the data platform in both places.
Ted Dunning: 19:05 This allows YouTube use this capability of the global namespace where you can use the path name. That's what we're showing here to refer to data that lives in any given location. It can be in the cloud, it can be in any of the clouds, it can be in any on-premises data center, or it can even be out at the edge. If you're doing manufacturing, or if you're trying to push data centers very, very close to your customers, then you may want to push this data way out there.
Ellen Friedman: 19:42 Now, before you go on, let's just dig into this a little bit. I think it's pretty obvious why being able to do what you just described is an advantage. That sounds pretty attractive. But, if people don't have these sorts of capabilities and let's list those off, the ones that you feel are essential in the data platform, then what; what's the contrast? So, what are the capabilities that you just described that need to be there in order to do this in such a seamless way?
Ted Dunning: 20:13 One, is open ABI's.
Ellen Friedman: 20:15 Okay.
Ted Dunning: 20:15 So, that all kinds of programs can get to it. And so, if your data is in this cloud, that cloud, on-premises, on the edge, the programs could be identical and they can access the data in exactly the same way.
Ellen Friedman: 20:29 Okay.
Ted Dunning: 20:30 That's one.
Ellen Friedman: 20:32 And, MapR actually have open ABI's for all of these different data structures and different data access.
Ted Dunning: 20:39 Absolutely, and I like to joke in the previous slide where we showed the results of LS. LS, most people this stands for List Directory. I think it stands for Legacy Software 'cause it was written in the '80's and it's an example of a program that was written many years ago, before many programmers today were born.
Ellen Friedman: 21:00 Okay.
Ted Dunning: 21:00 And, yet it works perfectly because of open ABI's.
Ellen Friedman: 21:04 So, MapR as a platform, essentially, almost becomes invisible is what you're saying? You don't have to translate things into some special structure. They don't all have to be translated in to, say, a loop which is translation EFF platform.
Ted Dunning: 21:19 Right.
Ellen Friedman: 21:19 We're actually using modern ABI's, open source enterprise stuff like the code that you have. It basically all just runs on that bar.
Ted Dunning: 21:32 Yeah but the key thing is the right tool for the right thing.
Ellen Friedman: 21:36 Okay.
Ted Dunning: 21:36 Yeah. Now, a second thing is that you need to be able to manage this data so that you get true multi-tenancy. And, the platform and you have to be able to cooperate.
Ted Dunning: 21:51 And, the third capability is this ability for data to move, not just within a cluster, not just under your control, but from cluster to cluster as needed so you can access it either remotely, via the path name that specifies the location, or the data can actually be ubiquitous and updatable in a multi-master sort of situation. That's really, really common with streams, for instance, that people want the same steam to be in multiple places.
Ellen Friedman: 22:22 And so you're saying you would have this in the way that MapR is acting as dataware, that you would have the ability to, basically, have choice. To customize when you want to replicate things, when you want to place computation next to data, when you want to access it remotely depending on what's... You look at each individual situation and figure out what's optimal for that situation. But, the same time you also have automated data management going on. You don't have to stop and do all of it point by point and, indeed, usually you wouldn't want to.
Ted Dunning: 23:01 And that's really, really crucial. You have to have control. You have to have choice. But, you also have to have the platform-level capabilities that handle that data motion. That's a really, really complicated thing to handle by hand because it doesn't work out all of the way. So, if you can name it, you can control it. You have to have those names. You have to have the system do replication, failure tolerance. You have to be able to name it so you can access it from anywhere. You have to name it so you can set permissions on it in a completely uniform way.
Ted Dunning: 23:40 Most of the system in most of these clouds are totally idiosyncratic to themselves, not just to the cloud, but every service in each one of these clouds typically has a different security model. You need a single global namespace, a single platform, a single data fabric.
Ellen Friedman: 23:58 And, you've introduced the term data fabric. Let's go back to the last slide for a minute please. To the previous slide.
Ellen Friedman: 24:06 You talked about having a global data fabric. But, that's a term, again, that people may not be familiar with and may not understand, mainly, the context in which we're using it. So, I just want to clarify that for a moment what, we mean by a data fabric. It's not... the term doesn't matter. The context is very important. So, you know, call it a data fabric, call it a data party, call it whatever you want, but the idea is what really matters here. What we mean by this, by data fabric, is not a product that you buy. A data fabric is something that you build. You basically use your data, your architecture, your infrastructure, the technology such as your data platform, you dataware, in combination with something like Kubernetes to build this seamless access to data across wherever you want it.
Ellen Friedman: 25:03 Now, it has the advantages of things like a data lake, a data hub, and these kind of older ideas that still have value in the sense that you want to have a comprehensive view of data, or you want to potentially have a comprehensive view of data that is; you don't want unwanted silos. But then, there are situations where not everybody should have access to all data and so you should be able to easily manage who does and doesn't have access to particular data. You should be able to find that data easily and Ted has just talked a lot about that, about the namespace, and why that's so important. So you can actually find what you want.
Ellen Friedman: 25:46 And, as a system administrator, it is useful to be thinking about where data is, are you meeting compliance, are you working with specialized hardware in a way that's most efficient? But, as the developer, as the data scientist, you shouldn't have to worry about where data is. You want to think about, for example, the models that you're doing, the analysis that you're doing, the insights that are coming out of that, what data you want to use, but not necessarily where the data is or how to place it where you need. So, that circulation of concerns is very important.
Ellen Friedman: 26:22 You want to be able to do all of this, obviously, with seamless security and without a lot of big overhead of system administration. And so, we see people who have built a data fabric that use the MapR platform, using some of these other approaches and technology who have amazingly small overhead for system administration.
Ellen Friedman: 26:45 I'm thinking of one very, very large retailer and as they come up on seasonal boasts in customer activities around the Christmas Holiday Season it's been traditional for people to basically set up a war room to try to handle all of the additional traffic and pressure on their clusters. But, when they started using MapR, in fact, they didn't need a war room they have, what, three or four people handling a very large... [crosstalk 00:27:15]
Ted Dunning: 27:15 We have three people handling multi-thousand cluster. Multiple cluster's actually. And this really is important, you know, we have people who have an on-premises-only policy. Oh, woopsy. There's something in the cloud they need. We have people who have an in-cloud only policy. Oh, woopsy. It was much better for them to get their own GPU's.
Ted Dunning: 27:40 And so, even within one cloud you end up with multiple availability zones. Even if you have cloud-only policy or a no-cloud policy, you wind up with a little bit of each. Nobody can stay... Well, no cluster is an island I guess. Although some are on islands. But, no single one is totally isolated.
Ellen Friedman: 28:03 So the point here is that we're using the term data fabric as just a kind of an image where the whole line up of goals of making the storage situation. The reason that we call it a fabric is that like a fabric you have this various edge, you can reach in, touch any thread, find that data, control access to single entities, but all of those threads together act as a single fabric. And so, as a single thing, you can have very simplified system administration and again, this unified security, this core security that's so important for these systems.
Ellen Friedman: 28:42 So we find that people are really moving toward this idea of building a global data fabric and that brings us to the next question. We jumped ahead a minute ago, but just wanted you to have that background of why people want to build a data fabric. But the question is, if you want to build a data fabric how can you tell? Ask yourself, how can I tell that I have the right technology and, indeed, the right architecture to be able to build a data fabric that's going to function in the way that we mentioned. Ted?
Ted Dunning: 29:14 Yeah, and so the real key is that a lot of this stuff, the orchestration, the positioning, the splitting, the repositioning and the data motion, possibly, between different clusters, possibly within a single cluster, has to be done at the platform level. We've alluded to this a little bit. Coding these at an application level is, and not to put too kind a word on it, it is disastrous. We have seen enormous schedule leaks when people try to implement this sort of capability at the application level. And the problem is really if you don't have a foundation at the bottom then you can kind of dig forever on more and more complex failure modes.
Ted Dunning: 30:03 You wouldn't think that copying files would be so complex but copying hundreds of millions of files can be enormously complex especially when that stuff happens. You have lots of different links between clusters in order to get enough [inaudible 00:30:20] and so you've got lots of things in flight, and some of these links go down. Or some of them get slow. Some of them get fast. And, the failure modes are both very, very subtle and very, very complex. The result is you need, if you're going to code it yourself, some really sharp people to be dedicated to managing that and to handling it. And those people are novelty seeking and they do not like having to do that kind of work on a long-term basis. So, these complex application-level data motions are not sustainable. Its very dangerous to do that.
Ted Dunning: 31:04 The data platform itself, starting at the foundation, at the absolute core and below the level that's visible to applications should be doing transactionally correct snapshots and mirroring so that you can move very, very large volumes really quickly and know that it's exactly correct to the bit-level. And, to not only know what bit's moved, but exactly when. So, you can know changes happen after to copy or before and no changes crossed over. There's no doubt.
Ted Dunning: 31:38 It needs to be at platform level for table and stream replication as well. And your gonna have to do transparent load and space balancing on both the data storage and the network itself. These are complex issues that application people just don't address when they start trying to do this.
Ellen Friedman: 32:00 So, before you go on, those three bullet points really matter. As you look at your own system and you ask yourself these questions, those are the things you want to look for. Transactionally correct, transparent mirroring. Do you handle this at platform level and do you have stream and table replication as well? And, transparent load and space balancing really matter. So whatever system you're using ask yourself if you see those capabilities and if the capabilities are there in a way that's reliable? They're available, you an depend on them, they don't take an army of people to be able to administer because otherwise, they're not really worth it.
Ted Dunning: 32:48 And particularly they don't take an army of PhD qualified people.
Ellen Friedman: 32:51 Yes, absolutely.
Ted Dunning: 32:53 Yeah. The dataware that you use has to handle all that motion and orchestration at the platform level.
Ellen Friedman: 33:00 And so that last comment we think is so important, we said it twice. We're going to use that as our closing slide for the content here just to remind you that dataware, you're not just handling storage, you're handling motion and orchestration of data and data access, data replication, much of that should be handled at the platform level, not just at the application level, and that give you the full advantage of the rest of what we talked about. This Kubernetes, orchestration, containerized applications with working with cloud, multi-cloud environment and so forth. And so, that's what's underlining a really successful data fabric.
Ted Dunning: 33:44 And this isn't just a random thing to come up with on Tuesday, this is a long-term vision that you have to have to make this really work. And that what we've been working on for a long time. We talked about MapR here because that's how we know how to do this. We don't know of any other system that can do this. It's a new kind of problem. It's a new kind of capability. But it follows directly from the things that people need to do.
Ellen Friedman: 34:14 One of the nice things about moving to these kinds of approaches is that, you've heard a recurring theme of the need for flexibility and having these designs and technologies that really support flexibility and agility. Those are very important for the way modern businesses work. But, it's also good because by their very nature of being flexible it's easier to transition to them from existing systems than more models with existing when you make a change. And so, its the sort of thing you don't necessarily have to transition the whole, everything you do all at once. Although that's not that hard to do either. You could do this in stages. So these are really practical systems and these are systems that we've looked at.
Ellen Friedman: 35:04 If you go to the next slide Ted. Just want to remind people, over the last several months, we [inaudible 00:35:12] people ask how to get their machine learning, their AI systems, their analytics systems, these large-scale systems and applications into production because it's in production that you really start to get that value back. We started this webinar on a slide talking about how to get value form data across a variety of industries. It's once you get them into production is where you're really reaping the benefits.
Ellen Friedman: 35:41 We've found from various reports coming from industry that people working in some large-scale systems, particularly that are working on HDFS's or [inaudible 00:35:51] platforms, that only about seventeen or eighteen percent of them felt that they had these systems in production and in production successfully. We started looking at what our customers MapR are doing, we know that over 90% of MapR customers have these systems in production, and in many cases, they've done this. I know of a financial institution in Europe started a new project, actually for an internal goal, they got that up and going in about three or four months. Had that in production, and realized that they had a broader value that they could sell a modification of it as an offered service. And I think the end goal for that is in about six to eight months from the beginning of working on this MapR platform they have this into production.
Ellen Friedman: 36:39 So, these systems can move very fast. [inaudible 00:36:44] had an idea, we looked at a number of MapR customers, we looked at what they are doing, what their habits are even their habits in terms of the cultural organization of their businesses. How they connected these applications into final business goals. In other words, what were they doing and why were so many of them successful at getting these systems into production?
Ellen Friedman: 37:08 And then we put together what we found, this observation, into this new book that was published about the middle of September of this year, AI and Analytics in Production, and MapR makes that available to you as a free PDF and so we've left the link there where you can download it and we encourage you to take a look at that. Even some of the material we talked about today is in it, but it goes into a lot more. And we would love to hear back from you if you do look at the book, if you have comments or questions about it.
Ted Dunning: 37:39 And frankly, Ellen has copies of the book right here, hold it up so that people can see. But when we're at shows, conferences, we often sign these books and you can even get a physical copy that way.
Ellen Friedman: 37:54 That's right.
Ted Dunning: 37:54 Which is fun.
Ellen Friedman: 37:56 Well that's true, we're going to be in London next week as a big data London and I think we are doing some book signings there. I had forgotten that part.
Ted Dunning: 38:05 I'm afraid we are.
Ellen Friedman: 38:10 Yeah. Always, Ted and I encourage you to support women in technology. Women and people of all diversity. This isn't just good for those groups, it's good for society itself and we thank you very much for coming to this webinar. I think we're going to slide this over to David if there are any questions.
Ellen Friedman: 38:32 Thanks Ellen, thanks Ted. Just a reminder, everybody, if you have any questions to submit it in the bottom left hand corner of your browser through the chat window.
Ted Dunning: 38:45 Yeah let's go ahead and look. I'm trying to get that thing back open. To take a look. Yeah okay. That's actually a really, really good question. Does MapR follow Open Industry Standards? And the answer is absolutely, we follow standards. And that's really key. One of the great innovations of computer science is the idea of standard interfaces, standard APIs. And the reason that that is powerful is that a standard API allows for different implementations that inter operate with other applications in the same way, but provide certain advantages. So, you ask which open standards. There are several, because of course different programs want different standards. HDFS is one, this is useful for big data systems. Another one is Posix. So when I showed my home directory using the standard Linux facility called LS, that was using the Posix API. And we provide that with a couple of different technologies. We also provide the H base table API. We provide an open source API called OHI for document databases. And we offer the [Coscat 00:40:32] API for streaming data. These are all standards either defacto or [dejurue 00:40:38], Posix, of course is subject to standard organizations HDFS and Cos gate APIs are subject to community defacto standards via the corresponding APACHE projects.
Ted Dunning: 40:54 But yeah, open standards are absolutely critical. Also, I have a question here about edge computing. We mentioned that but we didn't say very much about it, and you know, like what's that? So we have customers who have manufacturing facilities or extraction industries where they have remote machinery. We have people who have media systems so they have lots and lots of data centers. Like 100 data centers that are very, very close to their customers. These are all forms of edge computing. And in an ideal case, the fabric extends all the way to the edge. And we're even seeing when people are embedding the data fabric in cars, or in other mobile machines, and having the data fabric go all the way to the edge like that means that you have a really simple open API at the edge to say, put things in a stream.
Ted Dunning: 42:03 And if that stream exists across your entire data fabric, then it can exist in the cloud as well. So all of these little edge things, I don't know what they are, they are things, right? An internet of things. They can all be putting data in a stream which also exists in your cloud deployment. Now most of the hardware that you own, or that you're actually working with can be in the cloud. That was part of the question, and yet it can see all of the data that lives at the edge, or at least starts at the edge.
Ted Dunning: 42:39 And that's really, really cool because it makes it really, really simple. And Ellen talked ... the reason why it gets simple is what Ellen talked about for a moment. She said separation of concerns. The people who are programming the part at the edge, it should be a really simple task and it is, if you have just the task of just putting stuff in a stream. The people in the core programming should have a simple, simple task of just reading from the stream. The actual data motion between many many edges and core systems should be handled at the platform level. And that's what makes that simple.
Ted Dunning: 43:24 Here's a question about marathon instead of Kubernetes for orchestration. That's a really good question. Marathon is a Mesos based system for long lived processes. And right now, I mean this is a fact of life, right now the vast momentum is in favor of Kubernete's over systems like Mesos. It's like, 80 90% of new deployments are happening on Kubernetes. That said, Mesos has announced support for Kubernetes, and so essentially, Mesos can be a control plane for Kubernetes. And so, we don't officially support it, but we have clusters that use Mesos to coordinate containers. And they can still use the Kubernetes capabilities to access data.
Ellen Friedman: 44:28 Before you go to the next question, that was a very good question and I just want to mention, we talked about Kubernetes, we talked about Kubernetes in multi-cloud. We talked about the role of the data platform, the data ware in giving you that uniform computing environment for multi cloud. We think that's a nice combination but what we were saying about the need for data ware, what we said about the need for building a data fabric and having that kind of access to multi-cloud doesn't mean you have to do it with Kubernetes, right Ted? I mean a to of what we talk about is based on platform capabilities even if you're using a different orchestration.
Ted Dunning: 45:13 Yeah, you don't have to use Kubernetes, but I think that Kubernetes is right now, by far the most powerful system for putting containers together into actual systems. Most container orchestrations systems other than Kubernetes think of containers as individuals far, far too much. Kubernetes takes the lessons learned in the Google board, their huge orchestrations systems they've used for many years, and Kubernetes is essentially the open sourcing of the board. That's a big, big deal.
Ellen Friedman: 45:50 As I recall reading a report a week before last, and I'm sorry I can't quote it, because I read several things, I can't give you the exact reference, but it was looking at the way people are using containers and cloud and Kubernetes. And one of the observation is for example, for people who are using Google as their cloud and that means they are using Kubernetes in --
Ted Dunning: 46:17 [crosstalk 00:46:17] because Kubernetes is the native language of Azure and Google, yeah.
Ellen Friedman: 46:21 But 85% of them were also using Kubernetes outside of cloud for their on premises-
Ted Dunning: 46:28 [crosstalk 00:46:28] Yeah, so it's a very-
Ellen Friedman: 46:31 [crosstalk 00:46:31] So people that have been using it are really happy with it. It's not something they've gotten forced into, and there seems to be a voluntary move towards using it.
Ted Dunning: 46:38 And it's very common to do it in a hybrid architecture. We have another question about standardization. And this is an interesting point, they point out that we've been talking a lot about standardizing about the actual data access API through HDFS or faster Posix, or Hbase through Kafka and so on, and they ask ... well they assert, that they don't see that as sufficient for metadata access and discoverability and so on. And is there an open standard that we would embrace? And right now there are no good wide spread open standards there. There are different systems that are available. Some of these include for instance, the hive metadata store. And it is distinctly unsatisfactory in some respects, but it is fairly widely used. And we support that.
Ted Dunning: 47:41 Another widely used system is Onminishell . And Omninishell actually runs directly on MapR largely because of that standard API. So that's a cool thing too. And then there's another standard which is the Kafka metadata repository. And we support that too, but you know right now it's a little bit like the Yogi Bear-esque thing that you know I love standards because there are so many to choose from. And with metadata access there is not yet a dominant defacto standard a lot like there is with Kubernetes or Posix. So, there's a question about Red hat open shift. Until recently, whether or not we support that. Until quite recently, open shift was lagging pretty far behind the Kubernetes standard. And that meant that somethings like the common storage interface were not available. That it was only available I think as 1.8 and some of the later stuff we also like to have like 1.9. I believe that open shift is catching up on that, and so we have some very good capabilities coming up on that. And we will, obviously be using Open Shift, or supporting it. In house, according to the CMSF survey, something like 15% market share the major cloud vendors, Azure, Google, especially, but also recently AWS, it all supports Kubernetes. And that's where most of the deployments are. There are some ... well still a lot of on premise deployments and then there are secondary players, Oracle, Cloud and so on. And also, Open Shift. Open Shift is not yet a dominate way to run Kubernetes, but I think it will be very, very important, especially with the merger with IBM. I think there will be some meat behind that with that being declared as the major strategy for them.
Ted Dunning: 50:18 That of course, is increasing the momentum for Kubernetes. But I think Open Shift is going to be important. And we've committed to supporting that as well. Looks like, sorry just doing a quick check. Oh, oh somebody asked what do they need to do to change their applications in order to use MapR with Kubernetes? That depends on a little bit about what you're application is, but in general, there should be zero change. If you access data via open standards, Kafka, Hbase, Posix, HDFS, then you should be able to use that exactly. Especially with the Posix API, because it's a standard as the operating system level. They could be even no change at all to the existing containers. And this is kind of bizarre, when most people talk who talk in the big X space, talk about how much change you have to make, but you can take the vanilla, totally standard, post grass containers from the container repositories and run those with MapR under Kubernetes. And that's mindblowingly wonderful.
Ted Dunning: 51:41 For other systems, like HDFS and Kafka, you'll have to replace some jars, because it's not in the operating system level. So things like side car containers can't be used to provide that capability. We'd love if it were possible, but it's just not quite there yet.
Ellen Friedman: 51:59 And once you're set up, I have a question. Once you're set up to do that and you're running on MapR, then if you move to the cloud, or if you move from the cloud to cloud, what modifications do you have to make?
Ted Dunning: 52:12 Yeah that's a great follow up. Right now there are kind of small dialect differences between the Kubernetes on the different clouds or Kubernetes on premises. And we and partners are working to paper over those dialect differences. And of course, with these storage platforms that we provide, we're doing more than just papering over the differences say in the [Amaeril 00:52:38] or Helm charts when you define your systems. But we're making it so that the same data can be accessed across that with exactly the same code.
Ted Dunning: 52:48 So there are currently still some small changes that you need to make in the Kubernetes specification of your service.
Ellen Friedman: 52:55 Okay.
Ted Dunning: 52:56 To make it live in that slightly different Kubernetes environment. People are working to make that smaller.
Ellen Friedman: 53:03 But do you feel at this point is MapR offering and advantage over the amount of changes they would be making if they weren't running?
Ted Dunning: 53:13 I would defer to my customers to say several of them have said to me they couldn't do this without a common data fabric.
Ellen Friedman: 53:22 Okay.
Ted Dunning: 53:22 Well that phrase, like Kubernetes but for data comes for a customer. And it's really, really critical there that you get the benefits of Kubernetes but for data. Kubernetes is awesome for computation. And you need exactly the same thing for your data. And when you have it, you have amazing portability.