29 min read
Note: This article references commands, behaviors, and outputs generated by Linux-based operating systems, such as CentOS or Ubuntu. Some information will not be relevant to other operating systems, such as Windows.
This is a group of four blog posts that provides an introduction to processor technologies, especially in the context of distributed computing environments. Part one covered instruction set architecture and instruction set representation; part two covered processor design and efficiencies. These blog posts are part of a larger in-depth series called A Study of Performance in Distributed Computing Environments.
With the Linux OS, a process scheduler decides which cores run which processes and the times at which that happens. When a process is assigned to run on a core, the core will either complete execution of the process, at which time the scheduler is notified that the process has relinquished the core, or the scheduler will allow the process to run for a period of time and interrupt it so that another process can be run on the core. When there are no processes that can be assigned to a core, the core is considered idle. Thus, the Linux process scheduler can track the amount of time that each core had a process assigned to it and the amount of time each core was idle.
The amount of time the cores were running processes or were idle is tracked in units called clock ticks. This unfortunate naming convention results in confusion with regards to what clock is being referenced. Up to this point in this article, I've referred to clock rate and clock cycles in relation to the processor clock. However, for a piece of software like the Linux kernel, it is impractical to use the fine granularity of the processor clock to track the time each CPU core was in use or idle. Instead, the Linux kernel tracks CPU usage in clock ticks that represent fractions of a second, for instance, 1/1000th of a second.
A common technique for monitoring CPU usage in Linux is through the
mpstat utilities. These utilities function by sampling the number of clock ticks spent by each core in each of the states that Linux tracks, such as time spent running userspace code, kernel space code, waiting for I/O to complete, idle, etc. These counter values are sampled periodically, and the amounts by which the counters change over time can be used to calculate the percentage of time spent in various states.
Unfortunately, these utilities are rudimentary. I don't mean that as an insult to their design or the people who wrote them. I mean that the details they report are not sufficient to understand a number of important aspects of processor functionality. The utilities can track the amount of time a processor core was assigned to execute a process, but it doesn't indicate the amount of time the various components of the core were busy; it doesn't indicate the amount of time the core was stalled while waiting for data to arrive from main memory, etc. The utilities do not (and can not) have an understanding of how simultaneous multithreading is affecting performance. There are numerous aspects of processor functionality that are impractical or impossible to track using software that runs on top of the processor.
To get these types of details, the processor itself has to track much of this information and provide an extended instruction set to expose this information to the operating system. In fact, this functionality already exists for some processors. A number of Intel processors support the Process Counter Monitor (https://github.com/opcm/pcm) to expose relevant details to the OS and in turn to users of such servers, and similar data is available when using various AMD processors through μProf (https://developer.amd.com/amd-uprof/).
Monitoring the percentage of time that cores are assigned processes to run is still useful. For instance, we can conclude that we can not ask the processors to do any additional work during a period of time where all the cores on a system have processes assigned to run on them. We may not know how efficiently each core is being used, but they are in fact all being used and hence can't be asked to do anything else during that time. That in and of itself is sufficient to conclude that the system has no idle cores, and hence even if there was a very efficient program that could be run, there is no core on which to run it. We can determine that the number of cores on the system vs. the number of processes represents a bottleneck.
However, in order to resolve such a bottleneck, additional details furnished directly by the processors may be necessary.
In common computer architecture, RAM chips (i.e., main memory) are connected to a memory bus, which in turn connects to a northbridge, which in turn connects to the front-side bus, which in turn connects to the CPU sockets. Almost universally, an additional cache exists inside the CPU with some of that cache space dedicated to storing program instructions and some dedicated to storing data upon which the instructions act. In CPUs with multiple processing cores, an additional cache shared amongst all cores is commonly used. Further, additional caches may exist at the per-core or whole CPU level.
For instance, the Intel Xeon E5-2650 v2 CPU used for the testing for this article is an 8-core chip that supports Intel's Hyper-Threading feature, which presents the OS with 16 cores per chip. Each of the 8 physical cores includes 32 KB of layer 1 instruction cache (L1i) and three 2 KB of layer 1 data cache (L1d) and 256 KB of layer 2 cache (L2). The chip as a whole includes 20 MB of layer 3 cache (L3) that's shared among all 8 cores. Note that these cache sizes can be obtained from a Linux system using a command like
When executing an instruction for which the content is present in the L1 cache, the core will require 4-5 cycles to retrieve the data. If the content is not present in L1 cache but is present in L2, the core will require 12 cycles to retrieve it. Similarly, it will require ~30 cycles to retrieve data from the L3 cache, and well over 100 cycles to retrieve data from main memory. Note that these specifications are dependent upon the processor design and are generally provided by the manufacturer.
These cache and memory access times are extremely important because a processor core will essentially be idle while waiting for data to arrive. Of course, going back to simultaneous multithreading (e.g., Hyper-Threading), when a cache miss occurs, if processes are assigned to execute on both logical cores, the underlying physical core may preempt execution of the current instruction and switch to executing the other program. Of course, the core will still be idle if the other program also encounters a cache miss or if the other logical core doesn't have a process assigned to it.
The reason for highlighting these cache and memory access times is to illustrate the difference between a true CPU bottleneck and a cache/memory bottleneck. It is entirely possible to have processes scheduled to every core, and for those cores to be idle for the vast majority of the time while waiting for data to arrive from main memory. Thus, while utilities like
iostat will show high CPU utilization, the true bottleneck of the system is cache and memory access times and bus speeds to transport data from one part of the system to another. To understand the nature of that type of bottleneck requires the use of tools like the Process Counter Monitor or μProf.
The cacheEffects.c sample program attached to this article is a very simple test that demonstrates the effects of memory caching in the processor on program performance. First, the program allocates memory to hold 256 M integers of 4 bytes each, which corresponds to a contiguous 1 GB chunk of memory.
The remainder of the program consists of two loops that access each position in the allocated array. The first loop accesses each position in the array sequentially and increments the value stored at that memory location. The second loop, "line skipping," accesses every 16th position in the array, starting from position 0 of the array, and increments the value stored at that memory location, then repeats this while starting from position 1, then from position 2, etc., until all positions in the array have been accessed.
Here is an example of the output the program produces:
Sequential: 5626089 us, Line skipping: 17117442 us
Despite accessing the same number of memory locations, the loop that increments by 16 integer positions at a time takes more than three times as long to complete.
Recall that the server used to generate the test results in this article uses a particular Intel Xeon processor that has 3 levels of cache built into the processor. All 3 cache layers will cache memory in units referred to as lines, with each line representing 64 bytes of memory.
When the first loop runs, it accesses the integers in the array in sequential order. The very first position in the array that the loop attempts to access will result in a cache miss in the processor, and thus the processor will retrieve a line's worth of content from main memory. The 64 bytes in the retrieved line is sufficient to store 16x integers of 4 bytes each. Thus, each of the 15 subsequent loop iterations will access a position in the array that is represented by the line that is in the processor cache.
When the second loop runs, each iteration accesses a position in the array that is 16 positions greater than the previous iteration. Thus, the memory location being accessed in each iteration is 64 bytes greater than the previous iteration. This causes each iteration to need to fetch content from main memory. In other words, every iteration of the second loop triggers a cache miss and a line fetch from main memory.
Both loops perform the same number of mathematical calculations; the second loop takes significantly longer to do so. The
mpstat programs will show the core assigned to run this program as 100% utilized for the duration of both loops, despite the core spending most cycles idle as the second loop executes. In this case, the simple metric of CPU utilization is insufficient to describe the bottleneck at hand.
The processor core will have few idle cycles while executing the first loop, and many idle cycles while executing the second loop. While that first loop can be aptly described as a CPU bottleneck, the second loop is better described as a memory bottleneck caused by cache churn, wherein the processor needs to retrieve a line from memory and place it in the processor cache, the line is later evicted without any further access, and then some time after it is pulled back into the cache again, and this process repeats.
The test program is a contrived example intended to demonstrate the worst-case scenario, and it shows a 3x degradation. Most applications won't be structured quite so haphazardly. However, if you consider the execution time of a CPU-intensive application to be too long, it can be worthwhile to profile the application execution and determine if optimizations can be made before addressing the problem by throwing more hardware at it.
On a final note, I mentioned the CPU used for testing in this article would take in excess of 100 cycles to retrieve content from main memory while a L1 cache hit would use 4-5 cycles. The first loop in the test program should have at least 15 L1 cache hits for every instance where content needs to be retrieved from main memory, whereas the second loop should never have any L1 cache hits, and each loop iteration should require a fetch from main memory.
The second loop only takes ~3x as long to complete, rather than 10x or more. This discrepancy can be largely accounted for by the cache prefetching function of the processor used for testing. In the case of cache misses, the processor is retrieving more than just the single line related to the cache miss.
We often use the terms process, thread, and light-weight process to describe things that can be run on a CPU core. While discussing CPU bottlenecks in the following sections, I will solely use the term process to refer to things that can be assigned to execute on a CPU core.
Processes are assigned states. Processes start in the running state, indicating to the operating system's process scheduler that the process can be assigned to a core for execution. Depending upon the instructions the process executes, it will either remain in the running state or transition to a state where it will not be scheduled to a core for execution until an event happens at a later time. Examples of states where a process will not be scheduled include sleeping, waiting, stopped, and various others.
As long as a process remains in the running state, the process scheduler will consider scheduling the process on a core for execution. For a given set of CPU cores and the set of processes that are allowed to be scheduled on those cores, we can consider a CPU bottleneck to be present while the number of processes in the running state exceeds the number of cores.
I'm intentionally using that type of awkward wording because, as mentioned previously, a system can be configured to restrict the processes allowed to execute on specific cores. Thus, a system may have 32 cores and 32 processes in the running state, but those processes may be restricted such that they can only run on core 0 through core 15, in which case only 16 cores are available to run 32 processes. Progress of the processes will be gated by a CPU bottleneck since some processes will be in a state where they could be executed but will not be, as there is no core available on which to execute.
In the context of a single process, CPU starvation occurs whenever the process is in the running state but is not assigned to execute on a core. The degree of starvation can be described as the amount of time a process is in the running state vs. the amount of time the process was actually executing on a core. For instance, let's consider that over the course of one second, a process remained in the running state the entire time, yet the process was only scheduled to execute on a core for 0.2 seconds during that period, with 0.8 second spent waiting for a core to be assigned to execute it. In this case, we can describe the process as experiencing a CPU starvation factor of 4 (i.e., 0.8 seconds spent waiting in running state is 4x the actual CPU time consumed).
NOTE: I'm unaware of any standard term used to describe the magnitude at which a process experiences CPU starvation, and so I've defined it here as the elapsed time in the running state but not scheduled on a core divided by the elapsed time assigned to a core, and I've named this "CPU starvation factor." Please let me know in the comments if there is an industry standard definition to describe CPU starvation experienced by a process.
CPU bottleneck and CPU starvation occur at the same time, with the distinction being that a CPU bottleneck happens at a system level, while CPU starvation occurs at a process level. Whether or not a CPU bottleneck or CPU starvation represents a problem is entirely dependent upon use case. We'll put some consideration into this aspect later in this article.
Let's consider a process with a design such that it maintains a queue that accepts events and processes these events serially, and processing events is a matter of CPU usage; no I/O is required to process the events. The events arrive at the queue at a rate of 1000 per second. In a given second, if 1000 events arrive, but only 900 events are processed, the backlog of events will grow by 100. In order to clear the backlog, 1100 events will need to be processed in the next second. The backlog grows whenever the rate that events processed falls below the rate of events arriving to the queue.
When no CPU bottleneck occurs, the process is able to complete processing of events as they arrive, with no events remaining in the queue for longer than a second. Once a CPU bottleneck occurs, the ability of the process to complete its work without generating a backlog depends upon the degree of starvation it encounters.
The starvationEffects.c program, attached to this article, provides a logical representation of this type of architecture. The program accepts two arguments. First, specify the target number of events per second the program should process. Second, specify a cost associated with processing each event, which controls the amount of CPU cycles required to process each event. We will use this program to demonstrate the effects of CPU starvation.
On my otherwise idle test system, running the program like
./starvationEffects 1000 100000 yields output like the following:
1000 events processed in 1.00 seconds, backlog: 0 events, time runnable: 31.42% 1000 events processed in 1.00 seconds, backlog: 0 events, time runnable: 31.31% 1000 events processed in 1.00 seconds, backlog: 0 events, time runnable: 31.25%
The program tracks the amount of time it spends in the POSIX nanosleep(2) function and subtracts that from the total elapsed time in order to determine the percentage of time it spent in the running state. In this case, the program runs for just a bit over 31% of the time, handling 1000 events of cost, 100,000 each.
As mentioned earlier in this article, the kernel tracks the running time of processes and how time was spent by each core, such as being idle or running user space code. The
mpstat common programs monitor CPU consumption by processes and core utilization. You may be asking, why did I have this test program track its sleeping/running time if the kernel already tracks this information? You can read a description of a metric and think you understand it, but that understanding is based on assumptions until you've tested it.
Let's consider for a moment: what does the
%CPU column in
top output indicate about a process' CPU consumption? What do the various columns of output from
mpstat indicate about how a core spent its time?
Here is an excerpt of
top output coinciding with the same period of time as the output of the
starvationEffects test program:
And here is
mpstat output for core 31, which was isolated from running any process other than this test process:
The test program's self-reported running time of ~31.3% and the ~32.3% reported by
top differ from each other by ~3%. That's not bad, but it is not particularly precise. Of course, that discrepancy pales in comparison to the user space utilization of core 31 reported by
mpstat, which is ~90% higher than what the test program self-reports.
The process level accounting performed by the Linux kernel does not have the same level of precision nor same type of design as the test program. That test program can sample the clock immediately before and after sleeping, whereas the kernel has to track utilization in units of clock ticks and does so by periodically looking at what each process is doing and what each core is doing. The kernel just doesn't have the same level of precision nor does it account in the same way as the test program.
But why is the kernel indicating core 31 is almost twice as busy as it actually is? This is an artifact of the
CONFIG_HZ kernel parameter being set to 1000 on my test system. Because the test program was sleeping 1000 times a second at regular intervals, and the kernel does performance accounting at the same multiple, a very skewed view of utilization occurs. In fact, if the test program is started at the right time, the kernel shows 0% CPU utilization on core 31 because every time it looks at how the core is being used, the core is actually idle, as the time coincides with when the test program is sleeping.
What's the takeaway? I'm not implying
mpstat or other similar utilities are doing anything wrong, nor am I implying the utilization accounting in the Linux kernel is broken. Precision has a cost, and the cost of high precision accounting is often unpalatable. While output from performance monitoring utilities has to be taken with a grain of salt, we can still reach useful inferences from it.
If you want to see the utilization reported by the test program align better with the utilization shown by the kernel for the process and the core, run the test program with a number of events set to a number that is not cleanly divisible by
CONFIG_HZ value. For instance, with
CONFIG_HZ=1000, you could use 1111 events per second for the test program, in which case my test system generates output like:
1111 events processed in 1.00 seconds, backlog: 0 events, time runnable: 31.04% 1112 events processed in 1.00 seconds, backlog: 0 events, time runnable: 30.76% 1111 events processed in 1.00 seconds, backlog: 0 events, time runnable: 31.09%
In this case, the
mpstat utilities show similar CPU usage to what is self-reported by the test program.
Back to the topic of CPU contention and starvation, when multiple processes need to run on the same core, the process scheduler will assign one process to run on the core, then after a period of time, it will halt execution of that process and assign another process. The scheduler decides the amount of time each process is allowed to run based on the per-process priority value, with lower numeric values receiving more time than those with higher values. The full details of process scheduling are complex and need to be discussed in a larger context.
Since the test program needs to run about ~31% of the time in order to handle 1111 events per second, if it needs to share a core with 3 equal weighted processes, then it should still be able to keep up, since the scheduler would allow each process to use ~1/3rd of the running time. However, if we have 4 equal weighted processes competing for running time on the core, then each will receive ~25% of the running time and the test program will generate a backlog every second.
For testing purposes, you can use the
testLongDivision program, referenced earlier in this article, to generate CPU load. I started 3 instances of
testLongDivision and 1 instance of
starvationEffects and assigned all to run on the isolated core 31. The
top utility reports each of the 4 processes consume 25% CPU, while the
starvationEffects program generates output like the following:
979 events processed in 1.02 seconds, backlog: 340 events, time runnable: 100.00% 996 events processed in 1.00 seconds, backlog: 473 events, time runnable: 100.00% 989 events processed in 1.01 seconds, backlog: 602 events, time runnable: 100.00%
Note how the test program is runnable for 100% of the elapsed time but is only scheduled to run on core 31 for 25% of the time. This is an indication of CPU starvation for the test process. That process has work to do, so it never sleeps; it stays in the running state for 100% of the time and can not catch up and clear the backlog.
For servers that run mixed-use workloads, some processes may be more important than others. In such a scenario, we could adjust the priority of the
starvationEffects program such that it is numerically lower than the
testLongDivision program instances. This will instruct the process scheduler to give
starvationEffects more running time.
After running the following command,
renice -n -10 -p `pidof starvationEffects`
The test program was scheduled to execute ~75% of the time and generated output like the following:
2932 events processed in 1.00 seconds, backlog: 6277 events, time runnable: 100.00% 2930 events processed in 1.00 seconds, backlog: 4460 events, time runnable: 100.00% 2938 events processed in 1.00 seconds, backlog: 2629 events, time runnable: 100.00%
After clearing its backlog, the test program was scheduled to execute ~30% of the time and maintained a backlog of 0. By lowering the numeric priority of the test process, the scheduler gave it the running time it needed to keep up with its workload, at the expense of the
testLongDivision program instances, which received less running time.
My next blog post will cover real-time vs. batch-oriented processing.
These blog posts are part of a larger in-depth series called A Study of Performance in Distributed Computing Environments.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.