14 min read
The 6.1 release for the MapR Data Platform is coming out over the next few months. Being a point release, you might think that this release is minor, but the implications for users of MapR are immense. Indeed, some of the technology that landed in the product with the 6.0 release is only now being revealed for use. There are new capabilities in this release, and there are substantial extensions of previous capabilities. But the real story is not what our platform can do, but what you can do with it. We are working hard to provide the best possible platform for storing, moving, and analyzing data. We are working with our customers and partners to build the future of data.
With version 6.1, MapR becomes the premier platform for the full range of tasks needed for machine learning/artificial intelligence as well as the best data platform to augment Kubernetes. If you are adopting either of these hot systems, you need to look at MapR or risk being leap-frogged by your competition. Our 6.1 release makes it easier to access data stored in MapR by providing NFSv4 support and allowing access via an S3-Compatible API (in addition to existing support for POSIX, NFSv3, and HDFS). This means that even more existing programs can access data stored in MapR. To decrease the cost of storage, especially at scale, that data can now be transparently aged out and stored, using erasure coding or in external object stores, to decrease storage costs without changing any path names. That way, your programs don't have to change.
The new 6.1 release also takes some important steps that make it easier to build secure data systems, now offering encryption at rest, both for data stored in MapR as well as data that is in tiered storage. It is now possible to install a system that is "secure by default" from the very beginning with a data platform and a wide ecosystem of open source software, pre-configured to allow data to be protected at every step.
Let's walk through some of what you can do better in 6.1 and why in a bit more detail.
It is a big wakeup call for anybody who is just getting into machine learning to find out that much of the effort you put into these systems has nothing much to do with machine learning and everything to do with extracting features, figuring out what you knew and when, moving data here, and moving data there. Even once you have a model, getting it reliably into production is a major challenge. The news here is that the MapR Data Platform, more than ever, cuts through this mess and makes it all so much easier. Having a single platform that can serve as a system of record, supports all kinds of tools for feature extraction, and supports training and deployment of models is a big deal – and it even supports models in production. It simply isn't good enough to have different systems for different parts of a large-scale model development data pipeline. Copying data from system to system is error-prone and simply a waste of precious time that is right on your critical path.
Underneath each of these steps, you need a solid platform, and you need a platform that works across the entire value chain from archiving to analytics to live operations. Taking just one example of how MapR helps with machine learning, large-scale learning often requires that training data be extracted from very large historical records. The tools that are best at this are typically systems like Spark that read data from S3 or HDFS. Training a model, however, requires the use of special purpose machine learning software that is happiest reading data using standard file access methods and often needs to run on specialized GPU machines. With MapR, Spark, and GPU, programs run in the same cluster and can access the same data. You can deal with the scale of the raw data and the raw speed of the GPUs in the same system.
MapR users are taking advantage of MapR, combined with the power of machine learning with GPUs, right now to build new kinds of intelligent applications. They are building autonomous vehicles; they are controlling amazing machines; they are stopping fraud; and they are finding new business. You can, too.
Many machine learning systems run best in containers managed by Kubernetes. In fact, Kubernetes is a candidate for the software having the fastest widespread enterprise adoption in quite some time. Kubernetes, however, only manages the compute side of the problem. That computation is done by programs running in containers, and containers work best when they aren't bogged down by tons of data. What you have is something like in the following figure, where applications managed by Kubernetes need to use persistent data to communicate.
Kubernetes can orchestrate the execution of applications (1, 2, and 3, here), but these applications need to have state stored in the form of files, streams, and tables that are outside these applications.
That leaves you with a quandary in real-world systems. Where's that data going to live? How can you manage it? The containers forming the applications can be managed just fine, but who's going to manage the data?
Well, the answer is actually incredibly simple. We build a data platform that can manage all kinds of data. You get the sophisticated multi-tenancy and performance that you need, and using a consistent data platform lets you build a data fabric that spans across from edge to cloud to other cloud to on-premises cluster, any way your business requires. What you need is a system that looks like this:
The state in the previous diagram should be in a data platform that applications can access, using a variety of methods.
That is, Kubernetes should manage your containers, and MapR should manage your data.
That's a great vision, but with the 6.1 release, you can now do this. In fact, you can run existing Kubernetes-based applications transparently on MapR without changing the container images that make up the applications. Check out the MapR Data Fabric for Kubernetes.
One of the key reasons that people adopt MapR is that it provides incredibly high performance. We recently were able to demonstrate data access on a small GPU cluster at 18 GB/s. But what happens when data gets older, and you don't need absolutely maximum performance? With the 6.1 release, data in a MapR cluster that doesn't need maximum performance can be stored using erasure coding. This allows the data security provided by the normal triplication of data to be achieved with about half the storage. The cost of storage can be cut even more by moving cold data entirely out of the cluster into an object storage system. Such systems are optimized for low-cost, so moving data there decreases cost. This process of moving data to less expensive, lower-performance alternative storage is known as object tiering. Performance of these lower-cost tiers can be much lower than that of a MapR system, but if data has been moved to a low-cost tier, reading it will cause the data to be recalled, thus allowing high performance access again.
Note that we are talking about completely transparent tiering here. Users and applications accessing files will not have to take any special action to take advantage of tiering. Files, tables, and streams can even be tiered in whole or in part, and they are fully read/write even after tiering. Amazingly, files, tables, and streams can still be updated even after tiering to read-only object stores. This is a massive difference from, say, erasure coding in HDFS, where the only way to convert to erasure coding is to explicitly copy a file into a directory with erasure coding enabled and even the limited mutation operations supported by HDFS are lost when erasure coding is used. The way that the MapR platform separates the decisions about detailed storage for data from the use of the data by applications is an example of the separation of concerns that is so vital to team efficiency.
Object tiering is not nearly as easy as it appears on a high performance system like a MapR cluster. In particular, you can put more files onto a single MapR installation than all of Amazon S3 already stores (estimated to be about 64 trillion objects as of 2018). Some MapR users already store trillions of objects in a single cluster. Furthermore, single objects in a MapR system can be much larger than the maximum S3 object, and there can be (literally) billions or trillions of smaller objects, which would be ruinously slow to access on S3 if they were stored as individual objects. In addition, all data inserted into S3 must be encrypted. The new object tiering in the 6.1 release deals with all of these issues and selected customers are already using this new capability to provide a previously impossible mix of scale, performance, and cost.
In order to help users understand how and when tiering should be done, the 6.1 release introduces new metrics that help understand how much and what kind of I/O is being done on the different tiers of data.
In the 6.1 release, you get more options for encrypting data at rest. In particular, each MapR volume has the option to be stored in encrypted form on disk. Importantly, the key used for encrypting data is changed frequently so that even if some keys are intercepted, the damage will be very limited. When the system is very active, keys may be changed thousands of times per second.
Similar techniques are used to encrypt network traffic. Individual servers in a cluster negotiate short-term session keys with each of the other servers that they communicate with and manage these keys as they rotate.
A big difference between wire-level encryption and encryption at rest is that keys for data in motion only need to be maintained while the data is still in motion. Keys for data at rest, on the other hand, have to be managed for as long as the data might still be read. Ideally, users don't have to set up any additional services in order to make this work.
Encryption is also used in the object tiering system. As files and parts of files are designated as cold data and are written to objects, an additional layer of encryption is done. Again, key rotation is handled completely automatically.
Even though these requirements sounds fairly intimidating, the MapR Data Platform meets them and the user experience is actually very straightforward. Wire-level encryption is automatically turned on if the "secure by default" installation is selected and encryption of object tiering is automatically done no matter what. Encryption of hot data at rest can be enabled on a per volume basis as volumes are created. In any case, users have a hard time even telling that encryption is enabled.
This quick summary of how the new release supports machine learning, AI and Kubernetes better than ever before and how operations and security are enhanced by object tiering and encryption just scratches the surface of what is new in the 6.1 release. There are also advances in metrics (percentile latencies per table, audits and metrics as streams, and more), ease of development (lightweight clients for DB with simple pip and npm installation, end-to-end sample applications, change-data-capture as an easily parsed JSON stream, secondary indexes on complex objects), easier access for files (NFSv4, S3, and such), updated ecosystem (KSQL support, Kafka API updates), and more.
There has never been a better time to try out MapR.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.