21 min read
Editor's Note: This is the 3rd blog post in a series on Cloud, Kubernetes and Dataware. The previous article is titled "Who Are You? Cloud Advantages Differ Depending on Your Needs."
You're going to cloud! The motivators to go are powerful and the potential rewards are large. To ensure the experience is what you're expecting, here are some things it's good to know before you go.
Working in a cloud environment changes some familiar assumptions. A big part of dealing with an on-premises data center is the lead time it can take to modify or expand a system. Typically it takes weeks or months to get new machines online. For a cloud cluster, the lead time could instead be measured in minutes. This immediacy is a great motivator to use cloud systems — But are you ready for that?
The familiar delay for getting new machines online with on-premises is an inconvenience and for certain situations can be a tangible disadvantage when there is urgency to get a new project underway or to respond to a growing customer base or new workloads. But a long lead time does have at least one advantage: it forces you to plan ahead, to adjust code and resources for the new situation.
In contrast, the ability to spin up a cloud cluster almost instantly can take people by surprise. It's a good thing but it's also a good idea to develop the habit of taking this speed into account ahead of time. Deploying systems in minutes when there should have been hours or days of planning could result in some cases in costly mistakes. Of course, you should absolutely take advantage of the speed and convenience cloud offers, but you will need to develop new habits to ensure that adequate planning is still part of your workflow. In aviation, there's a saying that the right response in almost any emergency is "don't just do something, sit there!". The same can be true in cloud operations … just because you can react in minutes doesn't mean you should react before planning what to do.
Multi-tenancy is what makes cloud really pay off. Huge cloud providers let you share in their economies of scale. It is a great convenience to be able to rent just a part of a cluster if that is all you need at the moment and to give it up when you don't need it.
This attractive aspect of cloud also involve some risks. Multi-tenancy has tradeoffs even for your own organization's data center, but remember: when you use a cloud deployment, there are others outside your organization sharing the same resources, and you don't get to see who they are.
From your point of view, your cloud cluster or part-of-a-cluster is a convenient system in which you are renting resources for only as long as you need them. But you don't see who else is sharing the same compute, storage and network resources, and generally you don't have control over when and what level of workloads others run on these shared resources.
Any time you share resources with others, there is some risk of leakage. That leakage can consist of information via advanced threats using vulnerabilities like Heartbleed, Meltdown, Spectre or row-hammering. Or it could simply be somebody else's instability leaking into your system. Cloud vendors do a very impressive job of isolating you from these different kinds of leakage, but sharing resources is inherently riskier in this respect than having a completely isolated set of resources.
But in some ways cloud has less risk: the basic architecture of a cloud-native application is likely to be fundamentally more secure than a traditional application architecture simply due to the fact that network access is more limited in cloud-native systems by design. Similarly, cloud vendors always initialize disk space and generally encrypt data by default. Both of these basic measures often are neglected with on-premises systems.
Another, often overlooked, way in which cloud architectures can reduce risk is that they often involve virtual machines and containers that are rebuilt and restarted fairly often. This constant flux (sometimes known as "plowing the field") makes it harder for attackers to maintain persistent threats without being detected, because they have to keep infecting new instances. As a result, this practice can actually make a cloud architecture more secure. You also can get this benefit as a result of cloud-native design, regardless of cloud vendor or even working on-premises.
It is hard to provide stable systems if you don't have access to stable resources. There is a large incentive for cloud providers to over-commit their physical resources by betting that most customers will not fully utilize the resources that they are asking for. Customers win because they (usually) get the resources they want for a lower price. Cloud vendors win because they can (effectively) sell the same hardware more than once. On average, things work out pretty well. Taken to the extreme, this leads to serverless computing where the idea of directly renting physical systems evaporates entirely.
This is fine as long as your application isn't trying to get the same resources at the same time as somebody else's application. All cloud vendors, however, allow you to reverse the economic incentives and pay to have access to an entire machine rather than using shared access to part of a physical machine or pay to get guaranteed I/O bandwidth. This doesn't have to be a large cost if you have enough scale so you can build systems that share resources well among just your own applications by using frameworks like Kubernetes.
In short, you generally can get some guarantees in terms of resources, but you should plan accordingly for the possible additional costs of doing so.
A common symptom of the way that cloud vendors use shared resources comes up when you hear people talk about inconsistent benchmarking results in the cloud -- the inconsistency usually is the result of sharing. Part of the problem may be attributed to selecting the lowest cost instance types which are, of course, heavily shared. Or it can be due to the fact that disk storage has to be initialized block by block before you can use it, thus making first writes on a block appear slower. Or it can be due to the cloud vendor allowing high I/O performance for short periods of time, but throttling it to limit the average rate that data can be accessed (unless you pay extra for consistent high performance). In general, the performance constraints imposed by cloud vendors are quite complex, and how those constraints interact with benchmark applications is even more complex. Forewarned is forearmed: be aware that it can be hard to reliably interpret benchmark results in cloud environments and don't be caught by surprise. Conversely, make sure you put good monitoring into your cloud systems so you can determine whether your (cloud) neighbors are causing problems.
The question of cloud isn't just an issue of going to a public cloud vendor versus staying on premises or doing a combination of both. It's also an issue of geo-location: where do you need data and services? Is this up to you or driven by regulatory issues? Cloud, like on-premises computing, has geographic considerations you'll want to explore in advance.
It can be helpful to place a cluster in a particular geographical location. One reason is to have the cluster near a data source or data consumer for low latency applications. This proximity can also reduce the risk of losing data before it is ingested. IoT systems are a classic example of this situation. How, then, does public cloud address the issue of geo-location?
With cloud vendors, you generally can specify where the machines that will handle your data are actually located. Amazon Web Services (AWS), for instance, has a couple of dozen availability zones from which to choose. With any public cloud service, however, you don't have a full range of options for where data and applications will be located, so it's useful to investigate what the options are ahead of time. If it's available where you need for it to be, cloud can be a huge convenience.
Geo-distributed data and computation, whether on-premises or in cloud deployments, is important to many modern enterprises. You should be able to easily and efficiently place data and computation where you need to, and a data platform with the right capabilities can handle this natively. (Image based on Figure 1 in "Data Where You Want It: Geo-Distribution of Big Data and Analytics" by Ted Dunning and Ellen Friedman. Used with permission.)
Sometimes regulatory requirements are a driver for setting up a new data center or locating it in a particular geographical location. Financial institutions are a typical example where there often are regulatory requirements for data to be located, and remain, in a particular country. If the desired location is one provided by the particular cloud vendor, cloud can make it easier to meet this requirement. Keep in mind, however, that regulations often require the financial company not to depend solely on a single cloud provider - that may be considered too risky. This requirement makes multi-cloud a desirable option. This is a different strategy from being able to switch vendors: it's about using more than one cloud vendor at the same time. Consider, however, what how using multiple vendors will interact with data sovereignty requirements. We will talk more about how to have an effective multi-cloud strategy in a later article in this cloud series.
Planning for disaster recovery can be a strong motivator to use cloud services but, as with specialized hardware, disaster recovery can be a reason to also keep an on-premises data center and/or not to limit yourself to just one cloud provider. Here's why.
To start with, we can separate the issue of individual machine failures from full-on disaster recovery. Modern data architectures can handle losses of individual machines, but it is still important to design with the knowledge that at some point in large systems one more machines will fail. If you manage your own system and have a data platform with a good failover capability, you won't be in danger of losing data and won't even see much of a disturbance. The MapR Data Platform is one such technology. Somebody still has to deal with replacing the bad disk or bad machine but with a cloud service, that maintenance will be handled for you.
But what about disaster recovery plans for larger problems? You don't know whether or not a disaster will ever occur, but you do know that the very high impact of such a disaster means that you need an advance plan so that you are protected in case of a catastrophic event that damages your data center. This need is still true if you go to cloud. Remember: the term "cloud" is an attractive marketing term, but with a cloud vendor, your data and applications are still running on hardware located somewhere. If a hurricane or earthquake or fire damages a cloud vendor's data center, your data and applications will be affected. As with on-premises data centers, it's good in a cloud-based design to have a secondary data center in a different geographical location to mitigate the impact of a disaster at one location. It's largely your responsibility to plan for and execute. While doing this make sure that you don't inadvertently build a system that will fail if either data center is compromised. If you do that, you are actually worse off than having a single data center in the first place.
A cloud-based cluster could serve as the back-up cluster for an on-premises data center in your disaster recovery plan. Or, if you choose a cloud-only strategy, set up a secondary cluster in a different availability zone of your cloud vendor. Continuing the example with AWS, you might choose to locate your main cluster in the New York area, another cluster in central US and a third cluster in South America. You do incur the extra costs for these additional clusters, but this burden can be mitigated by the size of the secondary clusters. If you are using one of more of these clusters primarily for disaster protection, you could take advantage of the elasticity of cloud by setting up just a limited size cluster with capacity for the most critical data and workloads and only expanding it if you have a need for bigger workloads. You could also do this expansion if you find at some point that you want to use a secondary cluster not only as part of your disaster planning but also as a sandbox for experimental projects. Elasticity makes it fast and fairly easy to expand to the size and computing power you will need. Once again, a potential "gotcha" need not be a problem if you have sufficient information beforehand and plan accordingly.
Do keep in mind that even if you set up separate clusters in different availability zones of a cloud provider, you still may have an inadequate disaster recovery plan in the case of a huge attack or outage against the vendor. This is a different situation than a localized problem, such as a flood, at a particular geo-location when your cloud vendor has machines housed. A large scale attack or problem with the vendor could affect multiple or even all of that particular provider's availability zones. While this situation is not a high probability event, it can be a very serious one. The constant threat of cyber attacks make it important to take this issue seriously. This argues once again in favor of a multi-cloud or hybrid approach so that you do not rely on a single vendor. In fact, regulatory requirements in certain sectors may require that you not be tied to a single cloud vendor. That's one more reason to look for designs and key technologies that support portability plus a standard deployment architecture. We talk more about a powerful duality of key technologies, Kubernetes for container orchestration and the MapR Data Platform to orchestrate data, in a later article in this series.
Portability is a key issue to weigh as you decide if a cloud strategy is right for you. What portability means in the world of cloud deployments is whether or not your application or service will run in a different cloud or on premises without substantial changes being necessary. In addition to the portability of programs, there's also the question of the portability of data. Think of it this way. If you are a bricklayer, you cannot do your work without bricks; if you have an application or service, you cannot run it without access to the appropriate data. One reason these different aspects of portability matter is one of freedom. Will you be free to take advantage of the offerings from a different cloud vendor? Will it be difficult to move applications and data from an on-premises data center or from an edge data source, such as an IoT sensor, to cloud? It's important to have the freedom to build a multi-cloud system without extreme inconvenience and without delays caused by programs having to be rewritten. Let's take a look at these challenges in more detail.
The pragmatic question of application portability breaks down into several aspects. One aspect is whether or not you can move your program to a different cloud, even in theory. This ability has to do with the extent to which the other services and software that your program depends on will even be available with a different vendor. And, if available, will they be close enough to exact compatibility to allow your program to run correctly? You need for data to be accessible through standard means and in a timely manner. For the issue of time, it can be helpful to adopt a streaming microservices style of architecture that provides temporal decoupling of services or to run batch programs for which an instant response is not necessary. But in some cases you do need a rapid response, and when you do, you should be able to move data where it is needed without large delays, expense and hassle. In best practice, you need to have a system that provides automated and efficient data transfer and stream replication to support streaming microservices.
Also consider vendor-specific services that vary from cloud to cloud. These vendor-specific cloud services are attractive for the convenience they may provide, but they also can be the enemy of portability. The more you use services specific to a particular cloud vendor, the more you are locked-in to that vendor. Some of these offerings may be special services that meet a particular need you have and thus are worth what you give up in portability. In other cases, however, the services are for fundamental functions you might expect to be standardized, yet they vary from cloud vendor to cloud vendor. For example, you find differences in S3 (Scalable Simple System), local files and Kubernetes (the orchestration framework for containerized applications) between different cloud vendors. Even if you are using a system like Kubernetes that is intended to provide a very standardized environment for computation, there are a number of places where low-level differences between cloud environments leak through. As an example of such an issue, simply assigning temporary file storage to a container in Kubernetes is done differently in AWS, Microsoft Azure and Google's Cloud Platform (GCP). There are cumbersome ways to work around these differences in many cases, but they add delays, wasted effort and expense, making switching sometimes unfeasible.
In summary, it is an advantage to choose a system with a dataware layer that provides portable infrastructure and capabilities that reduce switching costs in order to support global portability.
Try these free resources for more information on these and related topics:
Blog post: "Rent or Buy: Should You Go to Cloud (or Not)?" by Ellen Friedman, 1st article in the Cloud, Kubernetes and Dataware series
Blog post: "Who Are You? Cloud Advantages Vary Depending on Your Needs" by Ellen Friedman, 2nd article in the Cloud, Kubernetes and Dataware series
Whiteboard Walkthrough video with Ted Dunning "Big Data in the Cloud"
Blog post: "Unifying Streaming Data Using MapR" by Gokhan Simsek
eBook: Streaming Architecture: New Designs with Apache Kafka and MapR Streams by Ted Dunning and Ellen Friedman
eBook: AI and Analytics in Production by Ted Dunning & Ellen Friedman
Product description MapR Event Store for Apache Kafka
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.