18 min read
This is the first in a series of articles about observing, understanding, and troubleshooting the behaviors of large-scale distributed systems.
At a high level, computing is simple. You have input; you apply some computations; you have output. Unless you're writing code in assembly, that high-level logic is built on many layers of abstraction, making computing a complex task in practice. The scale of computing problems being solved today results in exceptional complexity.
While attending university, I recall more than a little time spent discussing the nuances and efficiency of various sorting algorithms. Those considerations seem largely irrelevant just barely a decade later. The differences between the sorting algorithms still very much exist, but the significance of those differences has decreased drastically. If the sorting of a data set takes a non-trivial amount of time for a single system or process to complete, then I split the sort amongst many systems and processes. If the number of data sets that need to be sorted will collectively take a non-trivial amount of time for a single system or process to complete, then I split the sorts amongst many systems and processes.
Splitting a sort amongst many systems and processes still involves all the complexity of sorting a single data set on a single system in a single process, but it also introduces the complexity of distributed data and computing systems. And that added complexity barely scratches the surface of what's being done in practice today, as sorting data sets is likely to be just one small step to the solution of a real-life computing problem.
Our computing capabilities have undergone constant evolution. By abstracting lower-level details, we can focus on higher-level concepts. Repeating that cycle over and over has now brought us to a point where a single, simple line of code can accomplish something beyond comprehension in the not-too-distant past.
First, the good. We can build massively scalable, distributed computing systems with relative ease. We have APIs that allow us to leverage that massive scale while keeping our application code concise and simple.
The bad: the APIs may be simple, but the complexity still exists; it's just abstracted behind layers of libraries and backend services. Built into those libraries and services are an even greater number of design assumptions and configuration switches with default values that won't be suitable for every possible situation. With this complexity, it is a significant challenge to estimate how long a computing task should take and an even more significant challenge to determine how and why a computing task is taking longer than that estimate.
Understanding and estimating performance of a single process running on a single server can be fairly straightforward. We can open a few terminals and monitor things like system CPU, memory, disk and network utilization as well as the per-thread CPU utilization of the target process, sample stack traces over times, etc. And because we are monitoring just one process and just one server, we can do these things fairly thoroughly in real time.
In a distributed computing environment, a single process on a single server can gate the progress of an application. Yet tracking down that single process and server amongst the hundreds, thousands, or tens of thousands of moving parts is a challenge. Doing so in real time can be difficult, using the same techniques as those used to troubleshoot a single process on a single system.
While the challenge is more significant, luckily the same computing principles apply in a distributed computing environment as for a single process on a single server. While there are many more servers and processes, the things we need to review to understand how each individual piece of the system is functioning are the same now as they were in the past. The challenge today is while there may be 100x or 1000x more servers and processes, we can not spend 100x or 1000x more time inspecting those parts. We must adjust our tools and techniques to gain orders of magnitude more efficiency in our own work.
"In every chain of reasoning, the evidence of the last conclusion can be no greater than that of the weakest link of the chain, whatever may be the strength of the rest."
- Thomas Reid, 1786
"... to seek out one line in all his books were to go look a needle in a meadow."
- Thomas More, 1532
The age-old idiom of the "weakest link" is quite relevant to our modern-day distributed computing systems. A single environment may include a distributed storage subsystem, a distributed processing subsystem, a distributed resource scheduling subsystem, a query language, and data warehouse abstraction, etc. The time it takes to complete a single computing task can be significantly affected by any layer or component in the distributed system. Further, each subsystem in the distributed computing system can include many servers and processes, and the time it takes to complete a single computing task can be significantly affected by just one server or process in just one subsystem.
Similarly, the even older idiom of the needle in the haystack still aptly describes the challenges we face, tracking down performance bottlenecks in our modern distributed computing systems.
Combine both idioms, and you end up somewhere in the ballpark of the scale of the challenge. Any link in the chain could be the problem, and determining why it's the problem is finding the needle in that haystack. But fear not, while the scale of the challenge has grown, the liberal use of disciplined technique still allows those needles to be found with a very reasonable amount of effort.
As a basic abstraction, computers consist of a central processing unit (CPU), memory, and input/output streams. The computer's operating system (OS) controls the work that the CPU is asked to execute. The design of general-purpose operating systems, such as many based on Linux, is such that the OS can be asked to execute arbitrary programs.
From a human perspective, computers are asked to perform abstract tasks. We might consider a computing task to start by issuing a command at the Linux shell and end once the shell indicates the command has returned. Or, we might consider a computing task to start when we hit a "submit" button in a web browser and end once the browser loads a subsequent page, indicating the results of the submission. In any case, the human definition of a computing task will generally not directly correlate with the simpler concepts conveyed in computer architecture and organization. However, when computing tasks take too long to complete from our human perspective, the cause is often rooted in simple computer architecture and organization concepts. Thus, a strong understanding of basic computer architecture is critically important to determine why a computing task is taking longer than we'd prefer.
At a high level, a program being executed by a computer will be in one of these states at any given time:
When considering why a program takes a certain amount of time to execute, it is extremely useful to determine how much time was spent in each of these three states and why.
For example, if a program ran for 100 seconds, and a total of 1 second each was spent waiting to read input and write output, that indicates 98% of the execution time was spent with the computer executing program instructions, acting upon memory. When considering how to reduce the execution time of the program, it would be of little use to attempt to reduce the amount of input or output data the program requires, since it spends only 2% of the execution time on those parts. Rather, it would be more useful to profile the execution of the program to determine how much time is spent executing the various sections of the program code. For example, upon profiling the program code, one might find significant time being spent iterating over a list while looking for a specific element and subsequently determine that use of a map would dramatically reduce the number of instructions that need to be executed to complete the lookup.
Note that the condition where the program is neither waiting to complete I/O nor being executed can occur when the operating system decides not to schedule the program for execution. For instance, if there are 100 processes/threads that need to be run on the computer, but the computer only has 32 cores, then at any given time there would be 68 processes/threads that are capable of execution but for which there is no core available on which to execute. This is a condition relevant to understanding the performance of an application.
Generally, a performance problem is first observed at a high level and then isolated down to specific hardware and software.
For example, once some particular data is available, a job is started to process that data and generate output that is consumed by another application. Users of that application may notice that the most recent data available was generated a couple hours ago, while it's generally expected that data is updated a couple times an hour.
Upon being notified of this situation, an administrator may find an instance of the job that generates the application data updates that has been running for hours, which is significantly longer than historic trends. The application may be designed such that it executes hundreds of individual processes spread across the nodes of the distributed computing system, and while most such processes for the running application finished long ago, there are a few processes associated with the application that have been running for hours.
By sampling and reviewing stack traces, it's determined the processes are spending most of the time in calls to write data to the distributed computing system. Diagnostics are gathered for those processes to detail the write operations being issued to the distributed computing system. The diagnostics show a clear pattern, where a large percentage of the running time of each process is spent with write operations outstanding to one particular cluster node and the response times for those operations is hundreds of milliseconds each, whereas similar write operations issued to other cluster nodes complete in less than ten milliseconds.
The distributed computing system utilizes a process that runs on each node of the cluster. That process is responsible for accepting requests from client processes and using the attached disk storage to service those requests. Diagnostics from that process indicate it is receiving the requests from the clients, performing the disk operations required to process the request, and sending back the responses to the clients. The response times in the hundreds of milliseconds range, as observed on the client side, are corroborated by diagnostics collected from the server side process. The elapsed time from the requests arriving from the client to the responses being sent as shown in the server side diagnostics match those observed from the client process diagnostics.
After receiving a client request, the server side process will perform the required disk operations and then respond to the client. Given the timestamps of when the client request was received, when the disk operation(s) were issued, when the disk operation(s) finished, and when the response was sent to the client, we can determine how much time was spent waiting for responses from the disk hardware vs. the time spent in the software. Upon gathering these diagnostics, it is determined that the vast majority of the time spent handling the client requests is during the period from when disk operations were issued to when they were finished. Thus, the long response times are likely due to a hardware problem.
Upon gathering performance metrics for the relevant disk(s), it's observed that utilization is ~100%, indicating the disk is always processing operations, and the average service time for operations served by the disk are in the hundreds of milliseconds range. The specifications for the disk drive, as published by the manufacturer, indicate this average service time is far outside the range within which the disk is designed to operate.
This example is contrived for the purpose of this article, but similar scenarios commonly manifest in real distributed computing systems. The key to finding that needle in the field of haystacks is isolating the problem down to specific software and hardware. In addition to understanding how to identify and characterize standard hardware problems, we must understand the nuances of software operations, utilizing layers of libraries and services. It is often critically important to characterize response times at all appropriate layers and locations in a distributed system.
A problem may first be characterized as calls to a particular software library taking a long time to return. In a highly layered and distributed computing system, the call to that software library may in turn issue a call to a subsequent library, which in turn may issue multiple requests to a variety of remote systems/services, each of which may repeat this process many times. To understand why a call takes a long time to return, we often need to trace that through many layers and systems, using observations of response times to eliminate pieces of software and hardware and thus isolate that which needs further inspection.
I've provided support for distributed computing and storage systems for about the last 14 years, during which I've worked on a huge variety of problems that required many different approaches and knowledge on innumerable technical topics. There were many times when I needed information and couldn't find it. There would be metrics required to characterize a behavior and I wouldn't be able to find any information on how to gather such metrics. There would be questions about the design and functionality of bits of software code.
If I was lucky, I'd come across a well-written article by someone who'd had a similar problem before and could clearly explain the metric needed to monitor, how to do so, and what the results indicate. When the issues were complex or seemingly obscure, more often than not, I wouldn't find anything helpful.
The purpose of this series of articles is to provide clear, high-quality information on a variety of troubleshooting techniques and metrics that I use to understand the functionality of distributed computing systems. I'm sure much of this information already exists somewhere, but given the challenges I've had finding that information in the past, I'd like to contribute yet another resource that I hope others will be able to find in their time of need.
The first entries in this blog series will focus on understanding the performance of the hardware on a single server. Subsequent articles will build on that foundation, covering the additional considerations, techniques, and topics related to running distributed computing software as layers on top of many servers.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.