SVP of Product Management and Marketing, MapR
Senior TME, MapR
MapR's goal has always been to build the world's best data platform. That's why enterprise-grade capabilities such as security, high-availability, and disaster recovery are built in instead of bolted on. MapR customers have reaped the benefits. The vast majority have moved their workloads to production, many under extremely stringent SLAs. As customer needs evolved, new data-hungry AI and analytics use cases demanded a fresh round of platform innovations. With the 6.1 release, MapR delivers.
MapR 6.1 is a groundbreaking release from MapR with features geared toward:
Join Anoop Dawar, SVP of Product Management and Marketing, and Vadiraj Hosur, Senior TME, for a deep dive into these capabilities and more.
Included in this webinar is a demonstration of how you can tier data in MapR, giving you ultimate flexibility in balancing cost and performance.
Anoop Dawar: I want to start off by saying that MapR 6.1 is scheduled to be released in this calendar quarter of 2018. I hope this short update will help you understand the new capabilities that it offers.
Anoop Dawar: MapR 6.1 builds on a solid foundation of stability and scalability. The AI and analytics space is changing constantly. What stays the same is the need to use the tools of yesterday, today, and tomorrow, on the data of yesterday, today, and tomorrow, without creating tons of silos for each combination thereof.
Anoop Dawar: The MapR data platform is built for this work so you can rapidly evolve the tools you use and unleash them on the data, regardless of where the data is. Central, private or public cloud, edge, or even in containerized environments.
Anoop Dawar: What typically has started Hadoop journey quickly switches to new tools like Spark and now embraces artificial intelligence.
Anoop Dawar: Talking about AI, one of the things we have noticed is that 30% of early adopters, those that use AI at scale on it's core processes, say that they have revenue increases leveraging AI and are first to gain market share and expand their products and services.
Anoop Dawar: Furthermore, early AI adopters are 3.5x more likely than others to say that they expect to grow their profit margin by up to 5 point more than industry peers.
Anoop Dawar: As customers look to adopt this artificial intelligence technology it is important to note the key mistakes that come in way of success. One of the three key mistakes are adopting a siloed approach, picking a platform that is only able to do a part of the data, and then moving the data between an analytical cluster and an artificial intelligence cluster, only applying artificial intelligence to new use cases ... In fact, two thirds of the opportunities to use AI are in improving the performance of existing analytics use cases.
Anoop Dawar: It also allows you to build close to your expertise for better odds of success and create a strong foundation from which to launch into future AI use cases.
Anoop Dawar: Additionally, there is still an adverse, an avoiding of new technology that is geared for data scientists which diverges from standard paradigms. For example, containerization can allow data scientists to iterate and enable things really quickly and allow them to use the tools of their choice. However, that requires IT to adopt containers as a first class citizen in their infrastructures.
Anoop Dawar: Conversely, there are some key traits to having a successful AI strategy. According to McKinsey, the three key enablers: seamless data access, technical capabilities of the platform, and leadership from the top. Adopters, roughly 20% of those enterprises that are using AI and getting to success with it, are using these key traits to form a successful strategy.
Anoop Dawar: MapR's approach to enable secure and seamless access to all data, through a diverse API and an open approach, is what allows data scientists and data analysts to all work on the same data sets, even live production datasets, without having to create separate silos.
Anoop Dawar: However, the big advantage of this approach is that now everyone works on the same common understanding of data. This is critical as organizations are starting to rely heavily on data driven decisions. Having a different fragment of data, for instances, all data with the data scientists or a sample of data with the data scientists and the later data with the BI analyst, may lead to contradicting data driven insights from data scientists and BI analysts. Reconciling these insights after the fact is an exercise. Avoiding this the first place is the pragmatic approach.
Anoop Dawar: Therefore, in order to avoid that in the first place, both the data scientist and data analyst have to operate on the same data set. A way to avoid this is to ensure that the analytics and AI use cases are driven off the same data. This is traditionally a challenge with HDFS only based analytical systems because new machine learning in AI libraries often do not work out of box with HDFS. These new libraries, often built by research scientists in various universities and large organizations, are built so that they can work on laptops, on various other computers, and therefore are POSIX compliant. And later, if popular, may get ported to HDFS.
Anoop Dawar: With the POSIX, HDFS, and now as part of 6.1 the S3 API, customers are able to use all tools on the same cluster without having to copy samples of data around for data science use cases.
Anoop Dawar: Which brings me to MapR 6.1 release. In 6.1 release there are three key areas of focus. The first one are core data service innovations which help speed AI and analytics; at the same time, lowering the total cost of ownership. The second one is to simplify the development and deployment of AI and analytics. And thirdly, streamlining security and adding critical data asset protection capability. Together these innovations are focused to create a solid data platform for all our customers to run AI and analytics workloads, as well as for our partners to build the tools and the applications that our customers need.
Anoop Dawar: So with that, let's get into the details of it.
Anoop Dawar: That's start with the core innovations to speed AI and analytics, and lower the total cost of ownership.
Anoop Dawar: One of the first large services that have been added is the object data service built into the MapR data platform. This object data service means that developers can now simply use MapR for S3 compatible applications. This now extends the number of standard interfaces that MapR has supported beyond NFS, POSIX, HDFS, to also support full S3 protocol for data access.
Anoop Dawar: What is unique about MapR's S3 support is the ability to scale to trillions of objects due to the architecture of the underlying data platform itself. The S3 service also delivers extremely high performance and it can be geographically located where you need it so latencies are low and predictable. And finally, as the data is added or modified in buckets, bucket notifications can be pushed out over MapR streams to applications which subscribe to these notifications. This capability helps with building ETL jobs, remote ETL jobs, as well as help machine learning applications know when it needs to make a decision when a new document has just arrived into the platform.
Anoop Dawar: 6.1 also introduces a new capability called storage tiering. The challenge of storing all the data that's required by data analysts and data scientists into one platform is the ability of the requirement to scale and be resilient. This is something MapR has supported from day one. But soon, as the clusters get extremely large, cost becomes a factor. Not all data is equally valuable and all equally accessed. As a result, it doesn't make sense to store all the data on same type of storage devices or storage medium. Customers are demanding flexibility to balance the performance with storage costs while taking into consideration the frequency and the pattern of data access.
Anoop Dawar: So one way to carve out your data sets is to create a frequently accessed data set. This may be three way replicated like traditional HDFS and allows you to store it on either high performance spinning discs or solid state discs.
Anoop Dawar: Next, you probably have infrequently accessed data which you want to store in a capacity optimized way, probably on a slow disc or even in an erasure coded methodology.
Anoop Dawar: And finally may have rarely access data that still needs to be retained for compliance purposes or occasional analytics or needs to be cost optimized so you can recall the data as requested but enjoy a total lower cost of ownership.
Anoop Dawar: These last two storage mechanisms, the capacity optimized and cost optimized, are what we'll cover in a moment. But before we do, this is a reminder than MapR already has a policy based mechanism for moving data and the policy engine is now able to move data between these new tiers as well.
Anoop Dawar: So let's talk about the object tiering for rarely accessed data. MapR worked with several large customers to develop what we call object tiering, which is a capability of keeping rarely accessed data on a lower cost platform. This feature gives the ability to, automatically through a policy driven interface, steer data from MapR into any third party or MapR S3 object store. This could be Amazon's S3 service or any other cloud or, for that matter, online appliance that supports S3 protocol.
Anoop Dawar: The key benefits are a massive reduction in the total cost of ownership for your storage, as well as simplification of your data recovery processes, since this feature works along side existing volume mirroring capabilities are requires no changes. As a primary beneficiary of the feature, data scientists and analysts will want to access all the data regardless of the current temperature or location.
Anoop Dawar: Data that has been tiered out to S3 is still completely visible and usable since it is transparently recalled on demand back into the cluster. In this implementation, all the metadata, for the data that is being tiered to third party object store, is still retained inside the MapR Global Namespace and therefore iterating over the data is seamless. As well as the recall.
Anoop Dawar: When an application tries to read the data, the data is automatically recalled into the cluster. Recalls can also be scheduled ahead of time so that the first recall penalty of speed is not noticed by the application.
Anoop Dawar: All the data that is tiered to the third party object store is completely encrypted in-flight and at-rest when it's stored in the cloud to provide complete security. In addition, since the metadata is retained in MapR and not shipped to the cloud, any potential breach of the data in the cloud will not yield the actual data because the metadata is missing and nobody will be able to make any sense out of the data blocks that are stored in the cloud.
Anoop Dawar: Now let's talk about the capacity based tier. We are in 6.1 introducing erasure coding for your volumes. With erasure coding you can reduce you storage overhead by roughly half. In the diagram shown here it takes about three petabytes to store one petabyte of data when using 3x the replication which is a common practice in HDFS analytical systems. With erasure coding you could probably need only 1.3 petabytes of storage to store one petabyte of data. This is a dramatic reduction in overhead and associated storage costs.
Anoop Dawar: What is different about MapR's erasure coding approach is that it is still optimized for extremely high-speed ingestion and it preserves snapshot capabilities and other compression capabilities. In addition, this erasure coding data continues to stay completely reads like. Therefore, for high-speed ingest use cases, data can still be brought in as fast as it is brought in today and then erasure coded in the background.
Anoop Dawar: So, as you have these three tiers, very quickly the question comes out to "how do I know what data to put in which tier? And how do I decide when to move it?" To help with that, MapR has started releasing usage analytics dashboards that allow you to save on costs and optimize infrastructure. It will make it easy for you to manage your storage, based on your business requirement and your cost objective. You'll be able to see all the volumes, how much data is being offloaded and what speed, what's the recall speed, which are the top volumes by utilization, of storage and access.
Anoop Dawar: With this, let's move on to simplified development and deployment of AI and analytics applications. This second set of capabilities are really focused on allowing you to run any and all AI, as well as analytical ecosystems and tool sets.
Anoop Dawar: First off, with MapR 6.1 release there's a major uplift of the popular compute engines and APIs that will help enable the next generation of use cases. The MapR data science refinery notebook container is updated with Zeppelin 0.8, for better security and ease of deployment. Apache Drill 1.14 is enhanced with SQL support so that it now runs 72 TPC-DS queries out of box compared to 40+ in the previous version. It is also gotten tight integrations with the MapR database and utilizes the need of secondary indexes built into the MapR database for extremely high performance analytics on read/write data. As well as significant performance improvement for [inaudible 00:16:54] data.
Anoop Dawar: Apache Spark is also being updated to Spark 2.3.1 version, including structure streaming and full integration with Kafka API 1.1. Traditional batch analytics toolkit of Apache Hive is also upgraded to Apache Hive 2.3 with the underlying Tez engine. Open source Hue platform version 4.2 is now integrated with Livy 0.5 and has tight integration with Drill so business analysts can use Hue and issue Drill queries for interactive data exploration.
Anoop Dawar: Apache Kafka API version is also getting an upgrade from 0.9 to 1.1 API and all the associated ecosystem including [inaudible 00:17:46] Kafka, Kafka REST APIs, Kafka Connect, Kafka SQL, and Kafka Streams.
Anoop Dawar: In addition, there are other innovations to simplify development and deployment. For instance, now there are Python and node.js language bindings built for MapR database with a very lightweight client architecture. This is really useful for data scientists who need native Python support, as well as anybody else who wants to write applications rapidly with MapR database.
Anoop Dawar: MapR database has been shipping with a change data capture capability which allows you to tap into the changes that are happening into the database using an over streaming system. This change data capture is now available with JSON making it extremely and intuitive for developers to listen in or tap to these chances, typically in microservices based architecture.
Anoop Dawar: MapR database is getting also a very powerful database monitoring capability for mission critical applications. This is because the database has started to be used in extremely high performance mission critical use cases where sub-second monitoring of the database is critical to ensure high SLAs for the customers.
Anoop Dawar: The JSON data stored in MapR database is now also available to be queried at a rapid speed with built in indexes.
Anoop Dawar: Let's dig into some of these features in detail.
Anoop Dawar: Starting with the expanded native languages support. MapR database client architecture has been re-architected to support tons of extended language supports. What we have developed is a data access server which opens up the world of languages to MapR DB. The two new languages which that are supported in 6.1 are Node.js and Python and more languages will be added in future.
Anoop Dawar: This has also been done in an open fashion so that the language binding, as well at the Node.js and Python core base, will be openly available for all the developers and this also then allows customers, as well as enthusiasts, to write their own language drivers mimicking the capabilities of Node.js and Python core base. Next on the list with this are C/C++, C#, and Go language drivers.
Anoop Dawar: Several of our customers are using MapR DB to run extremely important mission critical applications. These apps have critical SLAs and it's essential that the customers get deep visibility into how the database is performing. With 6.1 we made enhancements to the MapR DB to provide granular monitoring capabilities, at the table level and even at the operations level.
Anoop Dawar: This allows the IT teams and the dev-ops teams of our customers to run extremely high SLA applications that are mission critical to their business. Not only analytical applications but also operational analytical applications as well as operational applications that run their businesses.
Anoop Dawar: The third category is streamlining security and critical data asset protection.
Anoop Dawar: In MapR 6.0 we released what we called a single click security feature. But we didn't go as far as saying it was secure by default because there was still some components that needed to be locked down. But with 6.1 we are now claiming a secure by default release. The main benefits are to reduce risk and maintain compliance as well as simplifying security setup while reducing the chance of configuration.
Anoop Dawar: However there's also a broader philosophical reason to this. As analysts and data scientists all operate on the same cluster, the amount of data and the criticality and the [inaudible 00:22:38] of the data in the cluster grows. It all of a sudden becomes a honeypot for hackers. Instead of a hacker trying to attack multi-siloed databases to get access to critical information, now they will look to access a single cluster and get all the information.
Anoop Dawar: In order to prevent this from happening it is crucial that the clusters are run securely. MapR wants to change the default behavior of MapR, as well as its customers, to start with security in mind first and therefore the secure by default from 6.1 will install the clusters with security on by default. A customer can chose to turn the security system off, however we highly recommend against it.
Anoop Dawar: This capability also enforces authentication of all the connections within the cluster and it enforces encryption of the data on the wire. It applies to all full ecosystem of component that MapR ships and it uses MapR Security instead of Kerberos, so that it's easier and simpler to configure and manage.
Anoop Dawar: We are also introducing with this release native data at rest encryption. MapR already supports multiple ways of encrypting data at rest, starting with using self encrypting drives, to Linux crypto-drivers, and third party partners. However many of our customers requested MapR to build native at rest encryption capabilities.
Anoop Dawar: With this 6.1 release, at the volume level you can easily and automatically encrypt data in the entire volume. This level of encryption is now enabled when secure by default is turned on and if you select volumes to be encrypted. MapR will automatically handle the creation and rotation of encryption keys to prevent breaches and recording of the encryption keys and therefore will also eliminate need for third party tools or services for key management. The whole idea here is to simplify the setup of secure volumes and increase data protection.
Anoop Dawar: In the spirit of making things more and more secure, with MapR 6.1 release we are also adding the NFSv4 data service which will enable us to address the security shortcomings that are typically found in the NFSv3 standard. NFSv3 standard has been supported for quite some time and has enjoyed broad customer adoption and will probably continue to do so. But NFSv4 now provides a richer security mechanism. So now with this we support real-time read/write access to MapR data or the latest version of NFS protocol.
Anoop Dawar: I want to talk about one more feature which wasn't originally shipped in 6.1 but actually shipped in 6.0 but is critical enough because it founds the foundation of the next generation data platform. This capability is to publish audit events onto the MapR event Kafka API, so that downstream subsystems can subscribe to those events and act upon them.
Anoop Dawar: Customers can choose to turn on audits and then selectively pick the events that they want to be turned on and all of the audit events are then published into MapR events or Kafka API. The advantage of doing this in a standard format is it allows any and all third party systems to be able to listen in to what's happening into the cluster in a standard robust and reliable way, as well as a scalable way.
Anoop Dawar: Some use cases that feed off this are the ability to create reports on new data, on new data access on your cluster, being able to monitor the audit events in your cluster and then create an anomaly detection system that shows anomalous data access or behaviors. For instance, authentication failures could be tracked here and an abnormal number or type of authentication failure could be highlighted, access patterns of different data scientists and data analysts could be observed and modeled and any deviation from that model that's statistically significant could be highlighted. ETL processes could be created in a more robust way using these event streams as triggers. And a wide array of security as well as analytics and use cases could turned on.
Anoop Dawar: To bring it all together, all these features together, is a hypothetical application. Many of you are probably use to depositing your checks through an online mobile app. Let's think of that kind of an application in context of these capabilities.
Anoop Dawar: Imagine multiple cluster spread all over the world, for rapid access and fast response times. Each of these clusters could be running a database and an S3 store as well event APF. When somebody takes a photo of their check on the mobile phone, that photo is immediately submitted back to the cluster and stored as an S3 object. As soon as that S3 object is store, you could imagine triggering an ETL job that actually examines this images for fraud as well as accuracy and completeness of information.
Anoop Dawar: That's the activity that's happening on the edge clusters, on the left hand side. MapR's mirroring capabilities, as well as it's migrate capabilities, allow you to instantly, or on a periodic basis, ship data up from the edge clusters into a centralized data cluster. Once the data is in this cluster you can immediately use tools like Hive, Spark and Apache Drill to run batch, interactive, or streaming real-time analytics on this cluster. This would allow you to monitor the number of check images that are coming in, per minute, per second, per day, per day of the week, and create trending analytics statistics for real-time analyses.
Anoop Dawar: Machine learning applications could be used for image processing detection and validation as well as fraud prevention and classic batch analytics could be used for monitoring the business growth.
Anoop Dawar: Having said that, since there is a object notification for every object that's deposited into the S3 server of MapR, that could quickly go to a Kafka event store and then publish through Spark for real-time streaming analytics, as well as aggregation. Aggregated data, per day, per week, per Monday, could then be recorded as output into the JSON database.
Anoop Dawar: And so it allows you a way of running this application right from operations to analytics to machine learning, using a single architecture on MapR. Overtime, it is conceivable that the data stored in the object store in S3 gets extremely large and needs to be tiered to a low cost platform. When that happens a simply policy will allow an object tiering from the S3 system, or any other data in MapR, to a cold data store.
Anoop Dawar: With this, I'm going to hand over the rest of the presentation to Vadiraj who's going to show a quick demo of the object tiering capabilities that are part of MapR 6.1.
Anoop Dawar: Vadiraj, it's all yours.
Vadiraj Hosur: Thank you, Anoop.
Vadiraj Hosur: So, Anoop has already covered the object storage tiering capability that MapR now brings with 6.1. So what we're going to do is, in terms of for this demo purpose, we're going to demonstrate tiering capability from the performance tier to the cost optimized cold tier.
Vadiraj Hosur: For our use case, we're going to look at a scenario where you generate ... it could be an application that generates surveillance videos at the rate of about 7TB per day and for six months worth of that active data is to be kept active, to be kept on the local cluster. So that will add up to about 1.2PB. One of the needs for this use case is the ability to archive to cloud and the ability to be able to recall volumes or individual videos back for analytics purposes.
Vadiraj Hosur: So having said that, let's look at how this can be done with MapR 6.1.
Vadiraj Hosur: To do that we log into MapR control system. So why don't we start a demo and share the demo shortly.
Vadiraj Hosur: Alright, so we're going to log into the to MapR control system.
Vadiraj Hosur: And once you are here, all of the tiering capability can be achieved through this page which is for creating volumes. I'm going to go ahead and create a hot tier.
Vadiraj Hosur: One thing to note here is with MapR 6.1 you have the ability to also create mirrored volumes and apply all of the tiering capability on the mirrored volume. But for this demo I'm going to use the standard volume and enable tiering on that.
Vadiraj Hosur: Specify a mount path and this is the mount path that an application, in our case the video generating application, would use and generate all the videos on the MapR cluster. And we have the ability to specify your replication factors and encryption capabilities that are also coming in with 6.1, this is encryption at rest and on wire. For this demo, I've just kept the default.
Vadiraj Hosur: So another thing to note here is how you can actually use both warm tiering, as well as cold tiering, from this single page. As a first step what we're going to do is we're going to specify a remote target, since we're doing a cold tiering these would be one of your cloud providers like your AWS or Google or Azure. You can browse through existing targets or you can create a new one.
Vadiraj Hosur: So I'm going to create a new one and give it a name ... one thing to note here is the look up topology, like we mentioned security but the look up topology is the topology where we're going to keep all of the metadata related to the objects that are going to be in on the third party. In this case we picked AWS, these are all standard S3 parameters that you would specify.
Vadiraj Hosur: Yeah, and then just go ahead and create the target.
Vadiraj Hosur: As the next step now, we're going to create a storage policy. You can use an existing policy or create a new one. For our use case we mentioned that we want to keep up to 180 days of active data and anything older than that we want to be offloaded to the cloud, in this case AWS. So we're going to create a policy for that.
Vadiraj Hosur: You could alternatively also use a different policy, for example something based on size of the file or the users that are creating the file, or a group of users creating those files. Let's use the modification time and create a policy based on that.
Vadiraj Hosur: The retention duration of use is an interesting one here. What we do is once you ... let's say you have to recall your videos from cloud, for the analytics or to look at them, you could have either made modifications or uploaded video a different video based on that or you could have not made any changes. After the retention duration is passed, depending on whether the file got modified or not, we would either offload it again, based on the schedule, or we would get rid of the file altogether so that it frees up the space for more relevant current data.
Vadiraj Hosur: The schedule is the schedule at which the automatic storage tiering gateway works and this tells the nature of the schedule for how often you want to offload it to the cloud. Depending on the nature of your data, you could say that critical data do it more frequently and decreasing order.
Vadiraj Hosur: There is another option part of the automatic tiering scheduler with MapR and what that does is it gives MapR the option to decide what would be the best time to offload to the cloud and this based on current load on the system and other things. So I'm going to pick that.
Vadiraj Hosur: At this point we have everything that we need to kind of, specified the schedule, the policy, and the tiering nature of tiering. And that's it, that's all that's required to setup object tiering in MapR.
Vadiraj Hosur: From this point on I'm just going to upload some data, mimic the application that's going to generate the videos and do two kinds of recalls. One is an implicit recall and the other would be an explicit recall. So let's go ahead and load some data on this.
Vadiraj Hosur: So the application would get a path to this group and I'm going to copy over some sample video files. You can see that all of these dates are kind of current and the storage policy that we specified was to make this ... to move files that are older than 180 days. Prior to that we are going to take a look, this is the MapR CLA command to take a look at how much storage is on the hard tier or on the current volume. So let me look for the tier local's usage. And it's about 102 MP, right now everything's in the hard tier.
Vadiraj Hosur: In order to mimic the files being older, I'm going to change the time stamp on two of these files, make them older than 180 days old, I just picked January 1st of this year, so now we can see that two of those files are eligible for offloading.
Vadiraj Hosur: Now when I offload for this volume, those are the two files that should moved to AWS. So you pick the volume, you pick an action as offload, then look at the tier volumes, it shows that offload is already running.
Vadiraj Hosur: Now let's quickly open AWS console and see what happened. I chose to crate a new bucket, you can also have used an existing bucket assuming you have the right privileges to that bucket. You can see that MapR immediately starts to write files to that bucket. In this case the files were kind of in the 55 and 35 mb range, so it was fairly quick.
Vadiraj Hosur: Now you can see that after we offloaded these files, we made more room on the hard tier and now the usage there's only about 17mb. There is a Hadoop way of looking for the status of files of these individual files as well. Can see that you can do that with ... see the status on these files and these files should be remote now, so it says file does not have local data. Although all of the metadata for this file is still local, same goes with the second file, all the remaining files should still be local because they were not eligible for offload.
Vadiraj Hosur: So at this point I have this cluster, and AWS mounted from my Mac, and I'm going to try to access the object that is in cloud through my AWS mount on the Mac and this is some of surveillance videos.
Anoop Dawar: So Vadiraj, at this point some of these files are actually tiered but because the metadata is in MapR, the finder of Mac just looks and feels as if there's no difference.
Vadiraj Hosur: Exactly.
Anoop Dawar: That's why you can actually see the name of the file, you can even see the image thumbnail pop up with the file.
Vadiraj Hosur: Absolutely.
Vadiraj Hosur: Now what we're going to do is an implicit recall, which means that my client, my core client, is requesting cold file. What MapR does is once it gets the request, starts to download the file from AWS, keeps it on the hard volume, and then the client, which is trying to read this, now gets the data so you can see that it's fairly [inaudible 00:42:52].
Anoop Dawar: So essentially what happened here is the file was remote, the customer actually clicked on a file stub not knowing whether it's a stub or the full data and in the background MapR transparently auto-recalled the file back.
Vadiraj Hosur: Exactly. So this could be used for bill collection and things like that.
Vadiraj Hosur: Now, what we'll do next is going to be an explicit recall. Now let's say there is a need for us to run analytics on all of the volumes, on the entire volume. So in that case what you do is you go back to the same console, select an action of recall the volume, recall data, and this will recall entire volume. This is the equivalent of downloading the files explicit from AWS onto the hard volume.
Vadiraj Hosur: It's already running and should get both of the files at the local tier. There we go. And we did it at the speed of about 36MB per second. So you get all of these details here on this console. Now let's go back and we should see that our usage on hard volume is back to about 102 which is was before we offloaded the data. So I'm going to do ... and you can see that the local usage is 102MB. And both of these two files are now local, everything is basically local at this point.
Vadiraj Hosur: So, yeah, that was the demo. I'm going to ... essentially was we saw was MapR object tiering capability, also security, all of the metadata is local, only the data objects are in the cloud, we do very simple offloading rules and schedules, on a single page you could set up the whole thing, and we can do both implicit or explicit recalls and specify aging on the recall files. And with this we can actually scale endlessly, there is no limit on how much data you can store.
Vadiraj Hosur: Back to you Anoop.
Anoop Dawar: Thanks, thanks Vadiraj. So let's summarize for the release. So MapR 6.1 ... after MapR 6.1 here's how the world looks like for customers that are using the MapR platform.
Anoop Dawar: You now have autonomous and intelligent data placement between performance optimized, capacity optimized, and cost optimized tiering. You have long term data retention capabilities and lower TCO, both because of the object tiering to cost optimized external tiers, as well as the built in erasure coding capability. MapR's erasure coding capability is unique in that it's optimized for high ingest unlike classic erasure coding which would make the ingest extremely slow.
Anoop Dawar: 6.1 is secure by default and automatically enabled security upon installation for wire-level encryption. You can also turn on disc-level encryption or data-at-rest encryption for sensitive data to be encrypted once written to disc, typically helping customer regulatory requirements. Cloud applications and tool portability is now eligible with S3 API which allows you to write next-generation applications and tools. But also, if you think about the key APIs, data APIs now dominating AI and analytics world, you end up with three APIs, the HDFS API for classic analytics, the POSIX API for next-generation MLAI tools, as well as the S3 API which is also becoming a very strong contender for AI and analytics workloads. With MapR you will now have all three in a standard way and you run across any and all clouds and be able to migrate from one to the other as needed.
Anoop Dawar: Simplified deployment and development of real-time applications and a complete upgrade of the ecosystem toolkits to the latest and greatest ecosystem, including Apache Spark, Kafka API, Apache Drill, and Apache Hive.
Anoop Dawar: With that, I'm going to pause and start taking questions. David? David: Yeah, so just a reminder, we're going to be doing some questions here. Submit your question through the chat window in the bottom left corner of your browser. We already do have a number of questions here, I'll get started and maybe Anoop and Vadiraj, you can kind of read through some of the questions that've come in. David: Let's start with "why would I use the object tiering to a third party store if MapR now provides an S3 interface?"
Anoop Dawar: So that's a good question, right, if MapR already has an S3 interface and has erasure coding, why would you ever want to use an object store, third party? The answers are two fold.
Anoop Dawar: One is, when operating in cloud often times you might have decided on the cloud object store as your default standard, so you want to be able to use that object store for storing persistent data.
Anoop Dawar: The second is the total cost of ownership of cloud based media stores is going to continue to be more and more cost effect as the cloud vendors invest a lot of time and energy and resources into reducing the cents per dollar per gigabyte stored. So customers would want to take advantage of that cost going down.
Anoop Dawar: And therefore in those cases using object tiering is an amazingly effective way of writing to the cloud without locking yourself into that particular cloud's implementation.
Anoop Dawar: The second question is does Apache Drill 1.14 support SRM-type sign? I know this was a popular question in the past. I'm happy to report that's happening in Drill 1.14.
Anoop Dawar: The other question is the reverse of the first question. First question was why would I use object tiering if I have S3 interface on MapR. The next question is why would I use MapR's S3 if I have cloud object store?
Anoop Dawar: Well I think the answers are sort of interrelated. Using MapR's S3 allows you to have independence in a multi-cloud environment. Using MapR's S3 allows you to also have the same deployment on-prem, as well as on the cloud, but the ability to use object stores like S3.
Anoop Dawar: Third, we have found that, depending on the analytical workloads you're running and the amount of data that you're accessing, especially when MapR is running in the cloud, if the data is brought into the cloud first and then object tiered to the cloud object store, you may have a total lower cost of ownership than just natively using cloud's analytically tools and cloud object store. This happening because we can aggressive cache out the data into the MapR layer and then store the rest of the data into the object tier and reduce your cost on the compute hours that you might have to use for analytical systems.
Anoop Dawar: The next question is does the archive data tier calculate it as part of MapR license? How does the license cost happen in this case? When you fetch the data back, do you need to calculate all this in your storage license?
Anoop Dawar: This is a great question. So first of all, all the data that's tiered out of MapR is not priced under the standard MapR license. So for instances, if you had a PB of data in MapR and then you moved half a PB off into a cloud tier, now you only have half a PB of data in MapR, right? So maybe because of that you reduce the nodes, let's say, so when you come up for your subscription renewal, you don't have to then pay MapR for the full PB, you only have to pay MapR for what's stored in MapR. Plus a small fee for the data tiered to the object store and the data tier to the object store is significantly smaller, an order of magnitude smaller fee than the data in MapR.
Vadiraj Hosur: The next question I can take it, Anoop. So the question is, is the object tiering really active replication? It seems like there would be a delay for large dataset stored on S3.
Vadiraj Hosur: About the active active replication, so the replication applies only to the MapR cluster. So when you create the volume you can select whatever replication policy you choose, that's applied to the volume. We do not control anything that happens in the cloud. So that-
Anoop Dawar: So to beyond that, right? I think the question behind the question is: A. Is it active active? And between the volume inside MapR and the object stored? The answer is no it's not active active, you schedule it, you schedule it based on the policy. So your policy may be based on the size of the data set or some other policy, how long the data set has been in MapR or modified in MapR and things like that. When that policy triggers and identifies an object to be tiered, that's when the tiering happens and of course it happens in the background, so what's in your S3 will typically be behind what's in your MapR cluster and therefore it's not active active.
Anoop Dawar: Chris Crawford's question MapR security instead Kerberos, is MapR security ticketing system going away? No, MapR's ticketing system is built into the MapR native fabric and it's always there, even when interacting with Kerberos, MapR can take Kerberos tickets and adapt internally to a ticket representation for internal re-certifications.
Anoop Dawar: What are the data at rest [inaudible 00:53:41] offered? Is volume level encryption type plug-able? Volume level encryption type is not plug-able at this time, it will be encryption by I think AES-256 by default. We may look at that in the future depending on customer requests.
Anoop Dawar: Hasir has a question, "Let's say if your replication factor is 2x for a mirrored volume, can you configure a cluster in a way that one copy will be in cold storage? Just to reduce the cost."
Anoop Dawar: So that's a great question. I would answer it this way, it sounds like what you want is one copy in the mirror and one copy in the object tier, to reduce the cost. If that's what you want you can certainly achieve that, you won't do that by making the replication factor on the mirror 2x and hoping that one copy makes it to the object tier. You would do that by making the mirror, with the replication factor of 1x, and then creating an object policy on the mirror to take that data and actually tier it to the object store. So you can certain achieve what you're looking for but not necessarily the way you're thinking about it.
Anoop Dawar: Does data tiering to Amazon also work with MapR binary tables?
Anoop Dawar: That's a great question. So data tiering featured today only works on files, does not work for Kafka event store or MapR database. Now having said that, MapR database and Kafka both support active active ... or rather MapR database supports active active as table replication, as well as active passive table replication and event store from Kafka also supports global tables with the capability to push the table out to the other end.
Anoop Dawar: We are looking at use cases were you might want to do object store copies of MapR database. We need to understand what would you want to ... what kind of policy would you use for a table? Because with files it's easy, the file is old you can move it. The concept of the table row is very different because some rows in the table might be new and some rows may be old and so on and so forth.
Anoop Dawar: The other question, I'm going to skip some questions and read one more question in the interest of time, and then we can always take the questions and answer them in a blog at a later time.
Anoop Dawar: Encryption at rest is available across the platform on drive, as well as cloud, that was another question.
Anoop Dawar: Then there was a question on the, I think there are two or three questions related to monitoring so I'll just take them together. MapR DB monitoring is available in 6.1, for MapR, event stream, with Kafka, there's also monitoring available today and we are continuing to enhance it and add more detailed monitoring as use case evolve. We are also looking at, in the future, more configuration, ease of use for our core and ecosystem. One of the biggest challenges with configuration management in ecosystem was setting up security and we've taken that burden away by just making security by default. However we are also looking at radically new ways of changing how configuration is done for the ecosystem component.
Anoop Dawar: With that I think I will give it back to David.
David: Thank you Anoop and thank you Vadiraj, and thank you everyone for joining us, that is all the time we have for today. We were recording this session and we will send out a link to the recording. And if we were not able to get to your question, we will follow up with you, directly, and try to address your questions.
David: Thank you again and for more information on this topic and others please visit mapr.com/resources. Thank you again and have a great rest of your day.