30 min read
Editor’s Note: Jim Scott, MapR Director of Enterprise Strategy and Architecture, gave a presentation entitled “Cloudy with a Chance of On-Prem” at Strata + Hadoop World 2017 in San Jose. You can watch Jim’s presentation, or read his blog post below to learn more.
IT budgets are shrinking, and the move to next-generation technologies is upon us. The cloud is an option for nearly every company, but just because it is an option doesn’t mean it is always the right solution for every problem.
Most cloud providers would prefer that every customer be tightly coupled with their proprietary services and APIs to create lock-in with that cloud provider. The savvy customer will leverage the cloud as infrastructure and stay loosely bound to a cloud provider. This creates an opportunity for the customer to execute a multi-cloud strategy or even a hybrid on-premises and cloud solution.
In this post, I explore different use cases that may be best run in the cloud versus on-premises, point out opportunities to optimize cost and operational benefits, and explain how to get the data moved between locations. Along the way, I’ll discuss security, backups, event streaming, databases, replication, and snapshots across use cases that run in most businesses today.
When we look at the development of the cloud from the 1980s until today–from early databases and enterprise applications to the Internet and now digital transformation–what has driven these changes is data. As data has become the core of our businesses, we want to be able to focus on what brings value to our company. The cloud provides the infrastructure that’s necessary for running a successful business.
Today, data drives market disruption. Looking at the market and the types of disruption that are happening, we find three companies that started out without any effective physical capital in their hands: Amazon, Airbnb, and Uber. They had software, and their market was data.
These market disruptors have had an enormous business advantage, simply because they had access to more data. And they have used data to impact business outcomes in real time.
Of course, companies need a place to save their data; they need a place to process it. And they need to be able to focus on real-time activity. The cloud is the obvious answer.
When companies are dealing with 60 billion transactions a day running through their platform, and they start transitioning from batch to real time, they can’t get there fast enough. Once they have a taste of how quickly data can actually be processed, they want to know why they can’t have it now. It typically comes down to a company’s IT budget. Industry leaders, however, are investing in disruptive technology now.
If we look at the technology and costs in the industry, that red line is the IT spend budget. It's pretty flat; it's not really going anywhere. So in order to do anything new, you must figure out a way to give up something old or reduce the costs on it. You don’t have to drop everything you’ve already built, and go with all new stuff. You can live with the things you already have to run your business, while starting to move toward the new, thereby picking up the benefits from both sides.
Here’s a common scenario: starting off with a system, but then, suddenly, your boss comes along and says, “We need to deliver more for our customer base.” And now you have pressure to deal with the three Vs, which are the hallmark of big data: volume, variety, and velocity. In addition, you have likely been tasked with reducing the cost of your legacy business systems. You are told: “We pay for square footage. We pay for electricity. IT costs money. So why can’t we make sure we don’t lose any of our legacy data and make it all work together?” Now you have pressure to figure out the cloud and containers, choose a cloud provider (Amazon Web Services, Microsoft Azure, etc.), and decide whether a hybrid, on-premises or cloud-only solution would work for you.
One of the technologies that everybody looks at when they start moving in this direction is Hadoop. Problem is that it was not built to solve these problems. It was built for batch analytics.
To gain more knowledge about cloud capabilities, let’s review some terminology:
Data Gravity. Getting data into the cloud is easy, and it’s always free. Of course, to save it for you, cloud providers charge a monthly fee. And if you want to get your data out of the cloud, then you have to pay extra. In a nutshell, this is the concept of data gravity. The more data you put in, the heavier it gets. It is literally the black hole; it’ll suck it all in once you start going there. It’s not necessarily a bad thing, but it’s something to be keenly aware of. Never let this concept out of your sight, it is critical in every decision you make involving the cloud.
Cloud neutrality. Cloud providers are in a heated competition to steal you away from each other. They would have you believe there is a freedom of choice; in reality, the concept of cloud neutrality is to prevent vendor lock-in. The problem is, if you’ve written software to run on Amazon, you can’t move to it Azure without significant changes; it doesn’t work. If you’ve written software on Azure, you cannot move to Google Cloud. The same goes for all the rest of the cloud providers. While you’re considering this restriction, think back to mainframes. Mainframes are heralded as one of the biggest vendor lock-in opportunities to ever exist. People had no choice but to go with them, if they wanted real, pure power. Cloud is considered the BIGGEST vendor lock-in attempt since mainframe. But it doesn’t mean you have to suffer from it. You do, however, need to be painfully aware of it.
Container Image and Container Repository. Docker is the most pervasive and popular container product out there. It runs on multiple operating systems, but we are going to focus on Linux. A single server with Linux can run dozens of containers. Containers have the ability to take the core of the Linux operating system and build up little layers, making it extremely lightweight. It’s very much like virtual machines, but instead of having to carve off exact amounts of storage, memory, and compute, you basically say, “Let everything run, and I will make sure that however much workload I put there, I don’t overload or oversubscribe the capacity that I have.”
Recently, Amazon S3 failed. As the original tweet below points out, DevOps is difficult. But the takeaway lesson is not “don’t rely on S3.” The takeaway is “be prepared that anything can go wrong.”
Very simply put: prepare for the worst, hope for the best. Just because you’re in the cloud doesn’t mean something won’t ever break.
Personally, I advocate for a multi-cloud, multi-data center strategy. From my personal experience working in digital advertising technologies, I learned the importance of running multiple data centers. The company would hemorrhage money if we had a complete outage, so we ran an east coast data center and a west coast data center. One Saturday afternoon at three o’clock, I got a call, telling me that we had lost our east coast data center. The data center provider had accidentally turned the power off. Fortunately, we had the west coast data center, and we had load balancing between the two of them and were prepared for this disaster scenario. As a result, we did not lose anything more than a few thousand requests.
Let’s say I want to run my website in the cloud, and I deploy it on Amazon’s cloud. I want to capture the click stream that occurs as well as all the logs that come in from the web server, so I can perform analytics and build user profiles based on this information. Ultimately, the goal is to have a “sticky” user and drive the brand of my company. If someone hits a single page on my website and then bounces, I haven’t achieved that goal. But if someone visits a page and likes what is there, then clicks on some good recommendations for other content on the website and sticks around to explore further, it reinforces the brand. It helps to make the customer more “sticky” and more likely to spread the word to others. Effectively, it’s digital advertising for your website, so it’s a business advantage to build a user profile and recommendation engine for customers.
What happens, though, if my server on Amazon dies? My website would be done, plain and simple. I could load balance, and I could put multiple instances up, and that would help, but in the case of S3, there was an outage across a massive span of customers on Amazon. I’m in one of their data centers, so if something disastrous happens, I’m done.
To prevent the company and the brand in that scenario, I want to run in another data center. But that’s not typically as easy as it sounds. When you choose the right technologies, you can make plans to replicate everything from the first data center to the second. However, when you have created a user profile with recommendations attached to them, you have to ensure that it is in both data centers in sync. I’ve seen firsthand, in ad tech, what happens when a user comes in through one data center, and the ad-serve request comes into the second data center. Because of how internet routing works, we had to handle the request at both data centers. We synced the user profile between both data centers with a few hundred milliseconds of latency across the country. This was enough to ensure we didn’t have problems and knew what the user was supposed to be getting. On your own website, if you’re recommending additional content to your users, you want to ensure that you don’t recommend a page they just visited.
In order to get that data back and forth between the data centers, you could write a bunch of code or, when you’re starting to use streaming technologies, you’ve got the opportunity to replicate your stream of data and database between the two sides. MapR offers this replication service, which saved me (as a customer, twice) from having to code a lot of extra software. For the click streams, we used an open source piece of software, Divolte. With it, you can determine what types of actions you want to capture, as a customer is browsing the website, and it will stream them in.
What’s really important is the fact that I can load balance across these different cloud providers, which means that I have solid disaster recovery and business continuity in place, because I am across multiple clouds. They are, in fact, in different data centers in different parts of the country. But I like to take it a step further because, quite frankly, that’s not good enough for me. I don’t want people doing analysis up in my cloud providers because I have to operate and manage against a budget. I have to maintain the service level of operations for the website. As people in the business want to do more analysis on the data, I say, “Well, it just so happens that for my business, we already have an on-premises data center, and we already have the capital expenditure in place so I should use the investment I have when possible."
Now, this isn't for everybody, but the point here is that it doesn't matter if it's in the cloud, or if it's on-premises, or a mix, or any combination therein. I have a server somewhere else, so I actually replicate all of that data down to my on-premises data center. I can pick and choose which workloads to run in which places, based on cost. If my on-premises data center is already paid for, that's a pretty good cost. But if not, then for doing heavy workloads at intermittent intervals, I might want to go with some private cloud provider. All of these things are options. You're not locked in to any specific cloud provider.
When you look at the image above, you are in fact seeing a picture of cloud neutrality: there is no vendor lock-in; you pick and choose where you want to run, how you want to run, and what meets your needs.
Of course, it is extremely important to a Customer 360 use case that I can enable the sales team. In the on-premises side, I actually run analytics and build company profiles. I enrich individual profiles that have clicked on content on the website, given us their contact information, or downloaded a resource. We then have the ability to perform householding on groups of users from behind corporate firewalls and find all of the people from this company to identify which of my products they are interested in and then push that data to Salesforce.
There are point solutions available for all of the cloud processing scenarios shown in the previous image. You can build those out; you will get them to work if you put enough time into them. The problem, though, is that it is very expensive to stitch together. And it is rather fragile because you have a lot of “lines” between components in the system. Every connection between components in the system adds to the fragility of the platform. When you start looking at limitations for speed, scale, and reliability, whatever is the single weakest link in the chain will drive all of your scaling in this platform. If, for example, your database with user profiles has a 600-millisecond response time, but everything else has a 5-millisecond response time, you are going to have to scale the platform for the 600-millisecond server. It’s the nature of the beast. Keep this reality in mind because the circles are easy and, quite frankly, the lines are as difficult as anything you can imagine.
When we go from on-premises with spot solutions to the cloud, it's the same thing. Some people have a magical-utopian view that the cloud converges all of these circles into one, but it doesn't. If I'm on Amazon, and I put my data in S3, it's in S3. If I have it in my Elastic Block Storage, it's in Elastic Block Storage. If I have it in the Redshift, it’s in that data warehouse. You start seeing data being placed in different locations and being modeled different ways for different use cases. They are completely disconnected; you still have to connect the lines. They may be within one data center, or one virtual environment, but they are not converged data sets. You still have to move data around; you still have to perform ETL processes; you still have to build and execute the work. The lines are the hard part; I cannot stress it enough. You can build point solutions any day of the week and meet basic needs. Putting it in one distribution does not converge anything! To put it a different way, just because you put everything in a box doesn't mean they are actually connected to each other.
When you start looking at cloud processing requirements, you should be aware: location awareness is a really big deal, and global awareness is an even bigger deal. If I am considering multiple data centers, and I want to have a view of all of my data, be able to access it, and know its location, a single global namespace helps tremendously. Otherwise, I might have three data centers that don’t really know about each other, which may be fine for some, except there will not be continuous, coordinated data flows without a single global namespace. There will not be strong consistency by default. Consider omnidirectional replication between data centers. When in one data center, no big deal, but as soon as you go to another data center, the complexity level jumps up a magnitude or two. It’s a lot of extra work and a lot of extra overhead. Not having to think about those complexities is a big timesaver.
Organizations use containers to improve resource utilization, increase developer efficiency, and deploy microservices.
As you can see above, 35% of people in the 2016 Docker survey want to avoid cloud vendor lock-in. I suspect that it is probably higher at this point. Docker can run in your local data center, and it can run in the cloud, so it helps create application portability. When you’re on-premises, you can control everything and make it work the way you want, but when you go to the cloud, how do you move your software to the cloud from on-premises? Unfortunately, you don’t simply pick everything up, move, and be done. It’s a slow, long process of migrating things over. If you have your applications Dockerized, or containerized, it becomes very easy to make your applications portable, move them around, and have a good, solid DevOps process, so you can build them in one place. You could mirror your entire Docker repository to the cloud from on-premises, or even inter-cloud, so you don’t have to duplicate all of your DevOps efforts in multiple locations.
But there are challenges with containers:
Fortunately, there is a solution.
MapR recently announced the MapR Data Platform for Docker, which supports the containerization of existing and new applications by providing containers with persistent data access from anywhere. It solves the problem of persistence from containers. Previously, the rule of thumb was: if you have an application that needs to persist data, do not ever persist that data in the container, or it’s as good as gone; you’ll never get it back if the container dies. Since that’s the general expectation, it becomes very important to have a reliable place to put all of your data.
As you can see in the image below, MapR expands and simplifies container usage.
If you have a background in Hadoop and Spark, you’re accustomed to big data analytics on one side, but the MapR Data Platform opens up the door to all of the enterprise applications on the other side. You don’t have to pick and choose how you leverage the platform anymore. It works for everything. It can run on-premises and in the cloud; it simplifies how you move from one to the other. And all the APIs are standard, so there's no lock-in.
The MapR Persistent Application Client Container (PACC) has the following advantages:
The simplicity is evident in this diagram:
Security is at the core of MapR PACC, and depending on the person or even the company, it is often an afterthought. Sometimes the company cannot afford for it to be an afterthought, so it has to be step one. MapR PACC accounts for all of these contingencies upfront and removes the burden from the engineering and administration teams.
Solution: MapR PACC allows existing applications to store state in the MapR Data Platform, ensuring resilience and availability across application and infrastructure.
Let’s consider a very simple example of an existing application. I like to use NGINX or Apache as a web server. Web servers can be painful, operationally, in that when they get launched on a server, they generate logs. People want access to the logs to do analytics, so they have to resort to doing log shipping, which is how they get the logs off of that server to the server where they're going to do analytics. They might be moving 60 billion or more log records per day. It is a lot of work; it requires a lot of overhead; and, quite frankly, it’s an administrative nightmare to focus solely on that problem.
If, instead, you could write all of your logs directly through to the distributed platform, you could save all of the time, previously spent on log shipping. The logs get written directly to the MapR Data Platform, and line-by-line as they are being written, you have immediate access from the analytics side. This is a tremendously simple and powerful capability that is possible because of open API standards, like NFS and POSIX, and containerization tools like Docker.
Solution: MapR PACC establishes a secure connection to the MapR Data Platform, which provides all three types of persistence.
If we look at next-generation technologies, we're starting to see a migration over to microservices: creating smaller services and smaller components, then Dockerizing and deploying them.
It becomes very easy, but you are required to have a state facility to maintain the data. If you don't have it, this approach won't succeed. You've got to be able to drive the approach with decoupled messaging. As a result, streaming is absolutely critical. If you can't handle the scale for 50 services, what makes you think that when you turn those 50 services into 500, you're going to be able to handle the scale? Things will fall apart. Streaming, therefore, becomes critically instrumental.
Clearly, containers are a massive opportunity, and organizations are rapidly moving to containers. It is very telling that Docker has become as pervasive as it has, despite not having a standard solution for saving data in a safe, respectable way, so people can easily get back to the data if a container crashes. And now, with the MapR Data Platform for Docker, that issue will no longer be a problem. MapR provides a high value, highly differentiated solution for stateful application support. Our tactical solution is part of a much larger strategic initiative.
When you look at the MapR Data Platform, analytics and operations are the two parts of your business:
The analytics side is traditionally the space that Hadoop would fill; the operational side is where we would see point solutions with tools like Kafka and NoSQL databases like Mongo and Cassandra. They are all disparate solutions that meet different needs. But when you bring them all into one platform, where they have a common data fabric, they all have access to the same services, same reliability, and are linearly scalable. Instead of having to move all of this operational data out of your database management systems by extracting the data, transforming it, and loading it into some other schema like a data warehouse, you don't have to go through that process anymore. All the latency drops out; the extra overhead disappears.
As a result, you now have the opportunity to start creating applications that can leverage the long-tail history of a customer. You don’t have to put the old history in one database and the previous 30-day history in another database. That problem is solved: you have complete access to real-time and historical data in one platform.
On the developer side, when you start getting away from the relational database for new applications and start moving towards something like a document database, you’ll understand why document databases have been on a massive upswing. That reason is simplicity. One line of code is all it takes to get an entire data structure into the database, and one line of code is all it takes to get it back out, as compared to the hundreds of lines of code typically needed to persist data into a relational database.
From a platform perspective, we pride ourselves on the entire platform being completely open. The APIs are the most important piece. You write software, and your company builds software against this platform. All of the APIs are standard and accepted by the industry. We innovate below the API.
We took the HDFS API, and we built a better underlying implementation to support it. We also have standards for file systems. All of the software your businesses have written for the last 30 years that work on a standard file system can actually work directly on top of the MapR Platform. You do not have to change any code; they simply work. When you look at the NoSQL and the SQL-on-Hadoop technologies, the standards are there. When you look at our document database, it's an open JSON API. And then, there’s the Kafka API for streaming. All of these components are built on open APIs.
The reason I stress this so much is because when you look at the cloud, all of the service offerings in all of the cloud providers are not really standard APIs. The closest we see is with S3, because the S3 API has been opened, and there are implementations of it out there now. Aside from that, it's not really a standard. You can get the APIs, and you can program against it, but that's it.
In addition to its open API architecture, which assures interoperability and avoids lock-in, the MapR Data Platform has many benefits, as seen below.
The most important benefit, to me, is the last one: you can take snapshots of your operational data that’s constantly changing. This creates a point-in-time consistent snapshot, which you can then use to build data models. You can see the data without anything else changing. When you’re done with it, you can return to the real-time data. It works equally well with files, streams, or database.
Wherever you are on your big data journey, I would encourage you to take a look at the opportunity here. Put the power of MapR to work for your business.
Do not shy away from cloud. It delivers a signification value proposition if you leverage it as infrastructure-as-a-service (IaaS). Infrastructure-as-a-service has a great value to your business, which can converge, transform, and grow with MapR.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.