One Data Fabric to Rule Them All


Howard Marks

Founder & Chief Scientist,

Suzy Visvanathan

Director of Product Management, MapR

As technologies like cloud computing and big data analytics go mainstream IT managers have realized that the independent storage silos they built to support each of their major applications are now islands of data. They’ve stood up HCI environments for VDI, and/or server virtualization, and a Hadoop cluster with local disks for HDFS, only to discover that they now need to come up with ways to load data from the VDI cluster to the Hadoop cluster, or to a public cloud provider for further analysis. A data fabric, like MapR’s MapR XD, allows user organizations to store their data in a single logical repository that serves up their data across not only application silos but also between their data centers and public cloud providers.

In this webinar we’ll explore three examples of how an integrated data fabric simplifies IT processes and increases agility:

  • Bringing data services to cloud storage. Most public cloud storage offerings severely limited, lacking many of the data services enterprise applications have relied on. The data fabric allows users to Lift and Shift their current applications to the public cloud without sacrificing the data protection those applications were designed for.
  • Eliminating the analytics data silo. Hadoop has traditionally used HDFS to provide storage from local drives on each node. Since other applications couldn’t write directly to HDFS users were forced into batch load, analyze, unload workflows. With a data fabric the Hadoop cluster can analyze the data in the same place it was written originally.
  • Simplifying data management within the data center. By presenting a single name space across multiple nodes, and multiple storage tiers users can leverage a data fabric to replace multiple scale-up and scale-out filers with a single point of management.


David: Hello, and thank you for joining us today for our webinar, One Data Fabric to Rule Them All, featuring Howard Marks from and Suzy Visvanathan from MapR Technologies. Our event today will run in total approximately one hour, with the last 10 to 15 minutes dedicated to addressing any questions. You can submit a question at any time throughout the presentation via the chat box in the lower left-hand corner of your browser.

David: With that, I'd like to pass the ball over to Howard to get us started. Howard, it's all yours.

Howard Marks: Thank you. To start off, I thought I'd introduce myself. I'm your not-so-humble speaker. I spent 30 years as a consultant in writing for publications from PC to Data Communications. I now run an independent test lab and analyst firm, DeepStorage, LLC, where we do things, as you see in the photo, like proving that scale-out storage systems can survive a node failure by causing a node failure in the most dramatic means possible, a flower pot full of thermite. I'm also co-host of the GreyBeards on Storage podcast with my fellow graybeard, Ray Lucchesi.

Howard Marks: The problem we're here to talk about today is one that I'm sure you all understand deeper than I could explain. That's that most of the data in corporate America, whether it's on premises in your data center or in the public cloud, it's trapped in some sort of silo. Databases and virtual machines are stored on the VMAX or some other block device, the user data is in filers. In addition, filers are holding data that comes from devices and loggers and factory floor instrumentation. All of those things speak NFS.

Howard Marks: Even if we think about filers as a silo, it turns out they're multiple silos. We have one set of NetApp filers that supports the QA department. We have another set of Isilon systems that our marketing and entertainment department's use. If we're doing big data analytics, and of course we should be, because all of the data in the other silos has so much hidden value, then HDFS, the Hadoop Distributive File System, becomes yet another data silo.

Howard Marks: Management, having listened to too many commercials that tells them that the cloud fixes everything, tells you, "Let's just move things to the public cloud and we'll get rid of all of our silos instead of having a VMAX silo and a NetApp silo and a data logger silo. We'll put everything into the Amazon cloud and we'll get rid of all of the silos."

Howard Marks: That's true if you look at it from the 427,000 foot view from the International Space Station, but down on the ground where IT guys like me live, all you've done by moving to the public cloud is replace one set of silos with another set of silos.

Howard Marks: I have EBS for block storage. It provides LUNs for the EC2 EMI VMs to run, but any given EBS device can usually only be accessed by one VM at a time. There are restrictions and costs associated with EBS. If I want to move data or have data that's accessible by multiple locations, then I might fire up the Amazon file services but that has limited performance and both of those solutions are substantially more expensive then the product that comes to mind when we think of cloud storage. That's the object store S3.

Howard Marks: Compared to the file services on EBS, S3 costs as little as a 10th as much. If I go to the persistence or low access tiers with products like Glacier, the amount that I pay goes down and therefore the difference between EBS or Amazon file services and the object store increases, but that causes SLA implications. If it takes eight hours to recover data from Glacier, is that really where I want to put files my users may ask for at a moment's notice?

Howard Marks: The problem is that more silos mean more management, more fragile transports from silo to silo, and more copies, because if the work flow may theoretically be move the data from place A to place B, process it there so we may be moving data from SAP into HDFS so that we can run MapReduce or Spark jobs against it and then move the results back, the truth is much more frequently because we are storage guys and our job is not to lose data. We copy rather than move.

Howard Marks: I'll export data from SAP and move it into HDFS so I can analyze it but the workflow doesn't always delete the data from HDFS and delete the export files that got created from SAP. We end up with more and more copies of the data, especially if we're doing any kind of redundant or time series-style analysis where we need to keep the data, even though it is a duplicate, in the analytics platform.

Howard Marks: What we really need to do to simplify data management in the public cloud, in the on-premises data center, and most significantly, as the transport between those different sets of silos, we need to have some overall intelligence that provides a common pool of storage accessible by multiple applications the way those applications want to access that data. That means that I need file protocols like SMB and NFS so that users can create files, so that devices that only know how to speak to NFS can create files, so that we can export data from something like our ERP system and have a script that runs on the ERP system and creates the export and writes it directly to the place where the next application's going to pick it up. We need file protocols for new containerized applications that need to read and write data that passes between containers.

Howard Marks: We want to have an object API. The truth is, thankfully the object API wars are over. That means the S3 API for cloud-native applications. We're hiring a bunch of young programmers to write their new code in Go and Erlang. They don't want to deal with ancient things like NFS. They want to use the new hotness that is web or cloud particles and therefore S3.

Howard Marks: Then, since much of the reason that we're storing more and more data is that big data analytics like Hadoop have shown us that there is long tail value in that data, we need to provide access for that big data analytics application via HD HDFS API so they can think it's still talking to Hadoop or Spark.

Howard Marks: We need to be able to migrate data to appropriate storage locations for tiering and archiving. We've had partial solutions like HSM and ILM or hybrid storage systems that all think about this putting data in the appropriate place function but because they've all been limited or X to a single storage platform or external, none of them have satisfied the requirement as well as we would like.

Howard Marks: For example, in most HSM/ILM systems, files are replaced by stubs. If somebody does something stupid like scan the entire marketing department folder or files containing Smith, all the files that were stubs get recalled to the performance tier, which there may not be enough room for. We want this tiering and archiving to be much more transparent. We need to be able to replicate data for access so that we could create data on-premises, replicate it to Amazon S3. Then, analyze it using EMI images, because we only run that analysis occasionally and maintaining a dupe cluster for that is excessive.

Howard Marks: All of this, of course, needs to be a single scale-out repository with a common namespace so that all of these applications cannot just store their data but share it amongst themselves.

Howard Marks: For the rest of my conversation today, I want to talk about a few example use cases. I want to talk about how using a Data Fabric in the public cloud lets you take EBS and S3 and the other services available and build a single higher performance repository out of them. I want to talk about using a Data Fabric to replace HDFS so that we have a single place that data can be accessed both for analytics and for other applications.

Howard Marks: I want to talk for a minute or two about simply simplifying data management within the data center and using a Data Fabric to eliminate the internal silos.

Howard Marks: When we look at the public cloud, we see a huge need for something to tie these functions together. AWS, Azure, Google, the public cloud providers offer a series of services but these services are very basic. For example, when you use EBS, it is block storage but it's not enterprise storage. Enterprise storage systems talk about availability in the five-nines and EBS is designed for a .2% annual failure rate, which isn't even two-nines.

Howard Marks: EBS can't really be considered as resilient or as performant as enterprise storage, unless I'm paying for provisioned IOPS, in which case the provisioned IOPS start becoming very expensive. If I'm going to have 10 2,000 IOPS on a relatively small SSD partition in EBS, I end up paying 10 to 100 times as much for the IOPS than I do for the storage. 2,000 IOPS works out to $130 a month. This means that I have to now manage performance and capacity in the application.

Howard Marks: In addition, things like up Amazon file services are limited to availability in a single zone and it's only for NFS 4. AFS starts at 30 to 36 cents per gigabyte per month compared to two cents per gigabyte per month for S3. What we want to do or what we can do to both make this cloud storage set of services look more like what we do in the data center is to add the Data Fabric. A Data Fabric here is software-defined storage. It runs on a series of cloud instances. Because Amazon is relatively large in this field, I tend to think and talk about these things in terms of AFS services. It can use the block SSD EBS as a performance tier with or without reserved IOPS. Then, can use S3 as the capacity tier.

Howard Marks: S3 has a very high degree of resiliency. Amazon talks about twelve or fourteen-nines of resiliency but it really only has two or three-nines of availability because one S3 zone goes offline every several months. Of course, we read about it on the web. The Data Fabric can run in multiple zones in AWS and replicate all of your data to multiple zones in S3 so that if Amazon East goes offline, you can keep your applications running in the west zone. It provides multi-protocol access to this data so that applications that we've simply lifted and shifted from the corporate data center into Amazon, can access these services because they look like the enterprise data services we offer in our data center.

Howard Marks: We talked a lot and customers expect to be able to lift and shift applications from their data center into the public cloud but they don't realize that most public cloud providers have designed their systems for greenfield applications. Those applications are written in a cloud-native manner that's quite different than the way applications are written to run in the enterprise data center where the assumption is that the infrastructure is perfect.

Howard Marks: In our second case, I want to talk about analytics. Traditional analytic storage is yet another silo. Because the analytics application, Hadoop, are looking specifically for an HDFS data silo, we have to have complex workflows that move data from the source to some NFS data store and then transport it into the HDFS cluster via a script. We then run MapReduce and rinse and repeat or we have to run Spark jobs, rinse and repeat. Then, finally, when that job's finished, we have to add another script that copies whatever the results are where our users or the applications that are going to consume those results live.

Howard Marks: All this means we need more storage because we need to store multiple copies. We need to spend more time writing our scripts. Anybody whose written a script knows they're not perfect. They're never perfect. Once I've scripted some application, I also need to babysit it and make sure that the scripts all ran properly, so that I know that the reason we got a no result out of last night's Sparks job is that I gave it no data to work with. GIGO, garbage in garbage out is the oldest acronym in the computer business because too often, complex workflows simply break.

Howard Marks: The reason that analytics has become yet another storage silo is that the HDFS, the Hadoop Distributive File System, was designed to solve a very specific problem, which it does relatively well but it was designed to make this big data analytics solution run on commodity hardware. That means that data protection is performed by replicating files across multiple nodes. We have a minimum of three times the amount of data in storage. It's got dedicated metadata named nodes.

Howard Marks: The advantage is that because HDFS and Hadoop are so tightly tied, Hadoop knows where data is being stored. When a MapReduce job accesses, say, the yellow files in my little graphic there, the jobs will actually run on the nodes that hold that data. This was a big deal when we were talking about one gigabit data networks or if we're talking about Amazon-size scale. The scale most organizations run Hadoop, with 10- or 100-Gigabit Ethernet, a good external file system will provide better performance with a 7,200 RPM local drive.

Howard Marks: Because HDFS is a special-purpose file system, it's not fully POSIX compliant. In fact, HDFS is what we call an append-only file system. You can create a new file. You can append to the tail of an existing file but HDFS doesn't support opening a file and changing data in the middle. That means that we can't access an HDFS cluster via NFS. We either have to use a FUSE library or write code to specific HDFS APIs, all of which means that HDFS simply becomes a bottleneck in the process. If our Data Fabric implements a scale-out file system, then we can write data into the Data Fabric via NFS. That gives us full POSIX semantics for the applications that are used in them.

Howard Marks: Hadoop can then continue to access its data via HDFS. This gives us higher storage efficiency because we now have one logical copy even higher because the Data Fabric is so much smarter than HDFS and doesn't insist on storing all of its data via three-way replication. It's a hybrid storage pool so that active data can be on flash instead of spinning disks. This completely eliminates the load and remove workflows, simplifying my job as an administrator. Of course, simplifying my job as an administrator is at the top of my list.

Howard Marks: We, like SpongeBob SquarePants, are drowning. We're drowning in unstructured data. Not only are we having more and more objects or files and, at this point, the terms are pretty much synonymous, we have more and more objects and files being created but in addition, the average object size is growing substantially. We're using more rich media and the media that we're creating is getting richer and therefore bigger. We've transitioned from standard definition video to HD video. Now, we're going to 4K video. I imagine an organization like State Farm Insurance, where they send adjusters out with digital cameras. Those cameras every year are sending images that are four times as large.

Howard Marks: Scale-out NAS has been a partial solution to this problem but it's been relatively expensive if you're talking about solutions like Isilon. It still requires a bunch of hand-holding especially if we're talking about the other general class of scale-out NAS solutions, the kind of file systems that are generally used in high-performance computing.

Howard Marks: By adding a Data Fabric, we create the endless NAS. Instead of a single filer or a scale-out, that traditional scale-out NAS like an Isilon, we have software defined storage nodes that have local SSDs that they use as a cache or performance tier, spinning disks and a policy-based file location system that not only migrates data from flash to internal spinning disk but also migrates data to an external object store like S3 to preserve it for archives. We create a single namespace, the same file folder structure is visible in our San Francisco, New York, and Miami offices where we have instances of the fabric running but it's also available in the cloud.

Howard Marks: If we run AMI instances of the software-defined storage stack, then we can allow access to other Amazon tasks or other Amazon AMIs to that same set of data. We can replicate data to the cloud store for security and resiliency. We can run analytics or other jobs in cloud compute, things like this is a common solution in the medium entertainment business where we might be transcoding or creating thumbnails or doing rendering in the public cloud where compute is available that we only have to pay for by the hour.

Howard Marks: Now that I've discussed in general how a Data Fabric can improve our lives, I'm going to hand control over to Suzy, who is going to tell you about how MapR's converged data platform or MapR XD can help you with your problems.

Howard Marks: Suzy, the floor is yours.

Suzy V.: Thank you, Howard. That was a very excellent segue to a bunch of items that I wanted to share with everyone. Okay. You heard from Howard about why a fabric is needed. Some feel, he also stressed on some of the requirements of what a fabric should be doing.

Suzy V.: We'll talk a little bit about how MapR approaches this. Many folks still think MapR is a Hadoop distribution vendor. I have done several such webinars with the hope that the more I talk about our converged data platform, everyone interested would understand that we are much more than that. Hopefully, in the next four or five slides, I will be able to reiterate that point again.

Suzy V.: If you look at our flagship, our foundational aspect, it has always been our MapR converged data platform. Why did we call it the converged data platform? This is very essential to understanding how this fits into the Data Fabric story.

Suzy V.: If you think about this converged data platform, we do and cater to multiple diverse set of applications. First and foremost is definitely a analytics and ML engines, so because MapR's history started off with the Hadoop Distributive File System or rather a distributor of Hadoop, we definitely have a lot of expertise and customer install days and analytics. There is no doubt about that. We have a very good, rich ecosystem of big data tools. We have very diverse portfolio among customers who use our platform to run all kinds of analytics on it. Long gone are the segregation between batch and real time, if you ask me, because at the beginning of the onset of analytics, it was very much batch but pretty much right now many customers have data that are constantly being generated from different sources. They need results right away. We can pretty much envision how the transition has happened.

Suzy V.: That is predominantly one use case that our platform caters to. However, if you look at all the other verticals, they are also portfolios or customer segments that we address. Our fundamentals is actually our underlying cloud scale data store. This is probably a little less known but we introduced this and launched this as a product in itself, which is what Howard alluded to as MapR XD. Ever since we explicitly called it out, we have been having quite a tremendous success towards customers just using us as a data store. In the next few slides, I'll touch upon the very high-level salient points of what it is that we do in the data store.

Suzy V.: In addition to that, we have two important verticals. I love you to run your applications through our MapR Database. If you have a database structured kind of a way to do it, then we have extended our platform to also allow you to run those applications. One thing you probably are wondering. Okay. These guys started off addressing unstructured data. However, now they are telling me, "Is this a data store, so that means I can use it as a platform to just store my data? And then, I can even bring in structured data to this?"

Suzy V.: That's precisely the concept or the essence of a converged data platform. It is much more than just Hadoop or just analytics. The fourth main element here, which is called our stream product. This is very essential because bringing in high volume of data handling not just the high volume but the speed with which the data changes, you need an effective product or a feature to handle that. That is entirely addressed by our streamed product.

Suzy V.: If you look at it, everything has a connection. You need a platform to store the high volume of data. You need a big data ecosystem to actually run tools on them and extract the value out of the data. Not only that, I, as a customer, have structured data as well. Do I want to be running it on a different platform? No. We allow you to do that as well. We give you an efficient way to bring in high volume diverse data into the system. This is our concept behind converged data platform.

Suzy V.: How does this fit into the fabric? The fabric is actually a fancy word, if you will, to just say a platform can address more than one single use case. A fabric should not only address different type of applications. It should also offer a way for you to deploy the applications in different environment. We'll see that in the ensuing slides.

Suzy V.: But first, I wanted to talk a little bit about the Cloud-Scale Data Store. What is it that we are doing there? These are the essence of the MapR converged data platform. Our fundamental foundation is nothing but a highly scalable, a distributed system, or a file system that can store files. It can store the same data in an object. It can also store data for containers. This is a fundamental aspect that we build our product on.

Suzy V.: Now that you have addressed the basics such as scale, the basics such as performance, and the fact that it is distributed gives you or addresses the first aspect of a Data Fabric. Hopefully, everyone in the audience is with me so far.

Suzy V.: Having this distributed scaled environment allows us to do one main thing, which is a global namespace. This global namespace is the catalyst or the reason why you can bring in different kinds of data. You can place them anywhere. You could have it in an on-premise data center. You can drum them in a cloud. You could run it across the edge cluster. For instance many customers have a remote site where they need to be self-contained and they would be generating data by itself, which is very much particular to that site. We offer a solution for that as well.

Suzy V.: Because of this global namespace, an end user can access view the files regardless of where that file exists. Think of it as a data platform where you can store data. You can access data regardless of where it is. You can bring in the data now using different kinds of protocols as well. If you have been listening to what Howard was saying, he was talking about NAS quite extensively, so that the fundamental protocols around that is NFS. We support that you can bring in data using NFS. We support NFS v3, NFS v4 support would be coming right around the corner.

Suzy V.: We support a POSIX-based APIs as well, REST APIs. We support HDFS as well. Hopefully, you're getting the gist here. We are addressing bringing in data from different sources, which may or may not be using different kind of protocols to bring it in. We are allowing you to store the data. We then allow you to access or store this data or bring in the data regardless of where or how your organization decides to place them in a data center or a cloud or in the edge.

Suzy V.: This is a fundamental concept that you would need if you are considering to deploy an environment like this in production. Extending more on this. What is this if you're asking. What are you guys selling? Are you selling a hardware appliance? No. We are not. We are an entirely software-based solution. We run on any commodity hardware. We will, like I mentioned. I'll reiterate it again. You can even run it in the cloud, can have it on an edge device or you can have your own on-premise data center.

Suzy V.: That takes the complication out of it, since everything is entirely software-based solution deploying edge, installing edges becomes a very easy asset. We support all kinds of hardware underneath it. We have OEM partnerships. I would say four or five years ago, choosing the hardware and the solution for that was a main decision factor that many organizations used. I've been in this industry for quite a while to have gone through that journey but nowadays it's more about data. What does the platform give me in terms of data?

Suzy V.: Let's talk a little bit about … Okay. We talked about the file system or rather the distributed aspect of it. What is it that we do with it? Let's get a peek into what is it we do. For us, this red box actually, if you see, is very key. As part of supporting all these different APIs or rather, standard protocols, we then offer you high availability, mirroring aspects, snapshots, data protection and, more importantly, a concept of tiers.

Suzy V.: Since we are a software-based solution, it is easy for us to give you a way to segregate something called as a hot tier, something called as a warm tier, and because of our integration with all the public cloud vendors, we offer you a cold tier as well. These are just generic terms that I'm using. I have talked to several customers. Everyone has their own way of calling it. Some call it high performance density, capacity optimized and archival. Some just call it tier one, tier two, tier three but the fundamental concept is the same.

Suzy V.: We offer you different ways to do that as well. One neat feature MapR has is something called as topology. With topologies, nothing bad of [inaudible 00:39:22], where to place the data. You can actually designate certain data to be, say, in rack one, node three, where your node three and rack one could be entirely backed by SSD devices and you can assign policies where you then place a provision data into a different set of nodes where they are entirely backed by HDD. We give a lot of these very subtle easy features for an end user to distribute their data, store them, bring them through different kinds of standard protocols. By the way, we are extending this particular protocol layer in the near future. You will hear a lot more about us supporting other protocols as well.

Suzy V.: All right. We talked about the foundational layer. Now, let's talk about all the other aspects that makes it a converged data platform. By offering a foundational package here, we are almost saying that you bring in any data. You bring in all kinds of data. We give you a scalable platform where you can store it. One example is their customers. We have customers today who have surpassed 100 petabytes on MapR. I have worked at several other storage companies and other companies where this is quite commendable, I would say. That size of installation is only growing as we speak.

Suzy V.: Let's talk about the analytics aspect of it. Now, we have brought the data in. Now what? We offer a plethora of aspects where you can use any kind of open-source tool to actually run analytics on them. If you look at this right hand app side of these slides. For those of you in the audio only, I'm talking about tools like Spark, Hive, TensorFlow. These are predominantly used by customers to run on our platform. We have several customers who do that and we have done investments on those. If you keep up with our releases, you will notice how much of investment we are doing to keep up with this ecosystem.

Suzy V.: You would see a logo here that talks about Kubernetes. That is something that we released about, I would say five to six weeks ago. We have a full-fledged platform for you to run your applications on containers. That can be managed with Kubernetes. If you or your organization is contemplating about containers and Kubernetes, I would highly recommend that you either talk to me or talk to your account sales account on MapR. We'll be more than glad to educate you on this.

Suzy V.: Since releasing being very pleasantly surprised to see a lot of customers already starting to deploy it on MapR. We have several strong roadmap coming along this area as well. I'd be more than glad to have follow-up if you're interested in it.

Suzy V.: Now, talking about the insights aspect of this, we have built a strong product called as the Data Science Refinery much like Kubernetes and containers is running on its own trajectory and getting its own install base, DSR, the Data Science Refinery, is one aspect that we have recently seen a huge uptick on it. People bring in their data and then just deploy the Data Science Refinery. All of a sudden, you're running machine learning pipelines in all kinds of intelligent way to get the value of those data.

Suzy V.: Apache Drill is something MapR has been contributing to for a very long time. This is also one of our other tools that we develop on a day-to-day basis and invest in for you to be able to get some insights out of the data that your brought in.

Suzy V.: One thing I want to leave you with is why do customers choose MapR Data Fabric? By far, this is my favorite slide since I've been here because in just a little small nutshell, we leave you with everything that we want to convey. If you look at all the features that we offer for a customer, it truly is a converged data platform. If you are a customer who's looking to just store, we provide you a full-fledged enterprise storage features.

Suzy V.: To reiterate on the global Data Fabric, we offer you a way to not only view and access them, regardless of where the data is, we also let you deploy us regardless of the disparate environment. We offer you a way to store both structured and unstructured data. You can scale, if you recall, the customer example I gave. We are talking about petabytes of data and of course, since we are keeping up or ahead of the curve in many of the technology trends, if you're looking at multi-tenancy or micro services and those kind of new applications, we offer you a way to do that as well.

Suzy V.: Hopefully with what I've spoken so far, you get a pretty fair idea of what it is that we are doing. By no means, these are the only details. There's many, many topics that I can talk for many hours as well but with this I'm going to leave you with this. If you are interested in a follow-up, definitely reach out to me, reach out through our coordinator, or to your sales representative, and I'll be more than glad to do a follow-up.

Suzy V.: Back to you, Howard.

Howard Marks: Thank you, Suzie. To explain … There's too much. Let me sum up. Most data today is trapped in silos. IT spends way too much time and therefore the organization spends way too much money moving and copying data from silo to silo. A Data Fabric allows applications to access data the way they want while other applications access that same data in the different way that they want. We can then take that data and manage its location via policy so that files that are more than a year old are migrated to an object store where we don't need to back them up every night. Therefore, we simplify your life as the administrator. This works on site in your data center in the cloud for your analytics and between your data center and the cloud as an on-ramp.

Howard Marks: Now, I see we've accumulated an interesting set of questions. The first question I want to address is that is MapR XD or the converged data platform a data virtualization federation product? A decade ago, we used the term storage virtualization to describe products like FalconStor and IBM's SDC and HDS's VSP/USP, all of which were really block storage virtualization products. They allowed you to add a third-party layer of virtualization and data services on top of your existing SAN storage. Conceptually, MapR does similar things but there's a big difference between abstracting file and object access and creating a global name space and virtualizing block storage within the data center.

Howard Marks: Block storage virtualization had a limited success in the market, and therefore, storage vendors tend to avoid calling their solutions virtualization. Conceptually yes but not really.

Howard Marks: Okay. Suzy, I've got one for you. That's, "Can we use it in a hot/warm/cold storage idea?" I think we've seen the answer to that already.

Suzy V.: Yes and I can reiterate it. Yes, absolutely. We do have a concept of storage tiers. We offer storage policies where customers can apply those policies or SLAs to the data and have it move around the tiers automatically. Yeah call it cohort, hot, warm, or cold or whatever the name of your choice is.

Howard Marks: Right, although it's important for the listeners to remember that this is somewhat different than what would happen in a block storage hybrid system where data migrates to the performance tier, adds use. You guys are doing it by policy, right?

Suzy V.: Yes. Yeah, so that the end user can choose what his or her policy should be. It is not pre-baked by us.

Howard Marks: Right. The first question, which must have come up before we got into some details, "Does MapR have a play in the cloud?" There's the install specifically in the cloud play that I talked about to use EBS and S3 to create a single fabric. I think even more interesting is the fact that because MapR creates a global namespace, that namespace can not only refer to multiple data centers in my organization but also instances of the MapR platform running in a public cloud. That, as I mentioned, makes things like periodic data modifications that require large amounts of compute for small amounts of time in the cloud practical.

Howard Marks: Then, to answer the question that just popped up, "Can data automatically move from different storage tiers based on the aging?" Yes, that's exactly how the policy engine works. That can not only mean that files should be migrated from Flash to S3 as they reach ages that indicate that their frequency of access is going to be substantially lower but it also means that you can write policies that recall data from the storage tier when you know they're going to be needed for processing tomorrow or the next day.

Howard Marks: Then, the final question I … Go for it. Go ahead, Suzy.

Suzy V.: Sorry. There was one question where it says, "Can I use MapR for distributed processing on multiple clusters?" I guess the continuation of that is that is conducting machine learning over multiple stations at the same time. Yes. You can absolutely do that. In fact, that's one of the main reasons why the global namespace is a huge hit among customers such as yourself.

Howard Marks: All right. Folks, thank you for coming. We're coming to the top of the hour and the end of our time.

Suzy V.: Thank you very much for me and thank you for …

David: Yeah. Thank you, Howard. Thank you, Suzy. Thank you, everybody for joining. That is all the time we have for today. For a list of upcoming webinars or other useful information, please visit Thank you again and have a great rest of your day.