28 min read
Note: This article references commands, behaviors, and outputs generated by Linux-based operating systems, such as CentOS or Ubuntu. Some information will not be relevant to other operating systems, such as Windows.
This is a group of four blog posts that provides an introduction to processor technologies, especially in the context of distributed computing environments. Part one covered instruction set architecture and instruction set representation. This blog post is part of a larger in-depth series called A Study of Performance in Distributed Computing Environments.
The instruction set architecture defines what a computer can do, but says nothing of how it will do it. In fact, the details of how a processor is implemented are often well guarded trade secrets. You might be able to find a few patents from AMD or Intel, but whether they are used in any given processor or where/how isn't going to be spelled out in those documents.
Consider our simple adding example for which we generated the assembly in which we ask the processor to execute the
0x48 0x01 0xd0 instruction. While this type of instruction is virtually universal among all processors, the exact circuitry used can vary from one processor to another.
There are various circuitry designs for digital adders, including ripple-carry, carry-lookahead, carry-save, dual-rail domino, etc. It's also not simple to evaluate the merit of each circuit design in the context of a processor. Each design requires different components, which will take different amounts of space on a chip, consume different amounts of power, and generate different amounts of heat. And a processor will likely have many adders for different purposes. One adder might be used to handle the actual
addq instruction, while instructions that pertain to multiplication may have an entire dedicated set of adders.
Since we don't know what circuits are used in a processor, we generally can't determine exactly how long a processor will take to complete a given instruction. Many processor manufacturers provide details about how long each instruction takes to complete, with some instructions taking a constant amount and others taking variable amounts.
On the subject of time, when talking about processors, two time scales are used. First, and most obviously, we can very generally measure time with processors in the same way we measure time with everything else. Without getting too philosophical, let's just say: that version of time is the one put forth by Einstein's Annus Mirabilis papers.
Second, we can measure time, specifically in relation to computer processors, in units of clock cycles. In relation to a computer processor, a clock cycle is the period it takes for the processor to perform a single unit of work. A large amount could be written on this topic. There is quite complex physics that becomes highly relevant when designing circuitry that is just a few dozen atoms wide and is meant to carry signals that can change at a rate of billions of times per second.
For the purpose of this article, let's just say that computer processors have a signal that oscillates and that we refer to as the processor clock. The oscillation of that signal triggers the processor to begin a new cycle of computation. To relate the cycle of the processor clock to our conventional usage of time, we commonly operate the processor clocks in the gigahertz frequency range (i.e., billions of cycles per second). The design of a processor and clock is such that the processor will be able to complete one unit of work in each clock cycle. Manufacturers of computer processors provide specifications for the clock frequency at which their processors will operate reliably. We can adjust the frequency of the clock outside of those specifications, and the processor will be able to complete more or less units of work per second. If we raise the clock frequency sufficiently beyond the design specifications, the processor will cease to run reliably and may produce inaccurate computation results or may undergo thermal damage.
It is important to understand what constitutes a single unit of work for a processor. The instruction set architecture defines what a computer/processor can do, but not how it does it. The clock cycle represents the smallest units of work a processor does, but that doesn't imply that unit is an instruction. In fact, many instructions require multiple clock cycles to complete. The number of clock cycles required to complete an instruction is a result of the processor design.
For example, if a processor was designed to handle the
addq instruction using a ripple-carry adder circuit, then the addition will always complete in a single clock cycle. However, algorithms and associated circuitry for performing division involve an iterative process. This circuitry can not produce a result in a single clock cycle, or at least, not an accurate result. Further, processors are generally designed such that an instruction is read in during one cycle, then some decoding of the instruction may be required, taking another cycle, then the actual work to perform the instruction may take one or more cycles, such as having the division circuitry calculate the quotient, then in the final cycle the result is stored somewhere. In all, four or more clock cycles may elapse from when a processor began processing an instruction to when it finished. This is the case even when the instruction is a simple addition for which a ripple-carry adder could complete the calculation in a single clock cycle because it takes time to determine which circuitry is required to handle the instruction, get the inputs into the right places, and store the output in the right place. The number of clock cycles required to complete an instruction is commonly referred to as instruction latency or cycles per instruction.
Let's consider a processor designed such that a 64-bit integer addition instruction takes four clock cycles to complete, starting from the instruction being read by the CPU to the result being stored in the appropriate location. If the processor's clock operates at 1000 cycles per second (i.e., 1 KHz), we can deduce the processor will complete 250 addition instruction per second.
However, processors are commonly designed to perform pipelining of instructions. Which is to say, if the processor needs four clock cycles to complete an addition, it does not imply that all the circuits in the processor are busy for all four of those clock cycles. In fact, modern processors are designed such that the circuits needed for each of those four clock cycles are independent. For example, in clock cycle 1, instruction 1 may be read into the CPU. In clock cycle 2, instruction 1 may undergo a decoding process while instruction 2 may be read into the CPU. In clock cycle 3, instruction 1 may reach the adder circuitry, instruction 2 may enter the decoding process, and instruction 3 may be read into the CPU, and so on. Thus, an instruction may complete every clock cycle despite each individual instruction taking four cycles from start to finish.
Pipelining can greatly increase the rate at which a processor completes instructions. However, pipelining is not always possible. In some cases, multiple instructions at various stages in the CPU pipeline might require the same resource. For instance, the content of a particular CPU register may be used to hold the output of one instruction in the pipeline and may be used to hold the input of another instruction in the pipeline. The behavior of the processor must be deterministic, and hence the instruction pipeline may be stalled so that one instruction can complete using the register before a subsequent instruction begins to use that same register.
Pipelining is not the only type of optimization that processors perform. While the programs we write may appear as a series of sequential instructions with strict control flow, this does not imply that processors need to execute those instructions in the same sequence nor does it imply they will solely execute the instructions which the programmer designed to be executed.
Modern processors frequently implement a mechanism that allows for instructions to be executed in a different order than was written in the program code. For instance, consider an instruction that places its result in register
rax, a second instruction that uses the result in register
rax as input, and a third instruction that does not use the register
rax at all. If these instructions were to be executed in the order in which they were written in the program, due to the register dependency, the second instruction would be stalled in the CPU until the first instruction completed, and hence the third instruction would also be stalled until the second instruction was allowed to progress. This is a waste of cycles.
With out-of-order execution, the processor can notice that the second instruction will stall until the first has completed, and hence it will opt to introduce the third instruction into the pipeline immediately after the first, thus allowing the third instruction to complete before the second instruction.
Another common programming construct is a control flow branch such as an if-then-else statement. Until the if condition is evaluated, the processor can not determine whether it needs to execute the instructions in the then branch of the else branch. However, doing nothing while waiting for the if condition to be evaluated is a waste of cycles. Thus, many modern processors will begin to execute instructions from one of the resultant branches before the if condition has been evaluated. In the worst case, the additional work the processor completes while evaluating the if condition needs to be ignored. This technique is referred to as speculative execution, as the processor is speculating on which instructions it should execute.
Add in simultaneous multithreading, memory cache design aspects, and various other optimizations, and it is extremely difficult to determine a precise number of clock cycles required to complete a given series of instructions.
Note that up until this point, I have very generally used the term processor to refer to the component that executes instructions. This definition is not sufficient for the subsequent discussions in this article. A single modern processor will commonly include multiple processing cores, with each core capable of executing different instructions concurrently and independently from the other cores. Aside from cores, a processor will commonly contain hardware dedicated to handling certain types of I/O and caching memory (in addition to the memory cache directly built into each core).
When discussing things like the number of cycles required to execute an instruction, it is in reference to a single core of a processor.
While we generally can't calculate precisely how many clock cycles it will take to execute a series of instructions, we can usually get an estimate using empirical evidence from running the instructions. A simple way of estimating the required clock cycles is to measure the elapsed time as the instructions execute and multiply that by the clock rate of the core that performed the execution. There are some caveats to this approach.
Operating systems generally allow users to start an arbitrary number of processes. The process scheduling mechanism in the OS will make decisions about which process runs on which core and for how long. Thus, with default behavior, it's possible that the series of instructions we are trying to time will be run on different cores over the course of execution. It is also possible that the OS will stop running the instructions we wish to time, schedule something else to run on the core for a period of time, then schedule our instructions to run again. In such a case, the elapsed time we measure does not accurately represent the time it takes to run the instruction set since it includes time spent performing unrelated work.
Many modern processors support dynamic scaling of clock rate and voltage. Power consumption of a processor core is directly related to the clock rate and voltage at which it operates. While processors are advertised with particular maximum clock rates, running a core at that clock rate when it is idle much of the time can be considered a waste of power. We can determine the elapsed time from when a set of instructions starts to when it finishes; however, on systems that support dynamic scaling, the clock rate of the core(s) that ran the instructions may have varied during that time. Since the clock rate varies, we can not determine how many clock cycles occurred over the elapsed time.
Many modern processors also support simultaneous multithreading (SMT). Intel refers to this feature as Hyper-Threading. When enabled, a single physical processor core is exposed to the operating system as two distinct, logical cores. Operating systems can only request a core to run one process at any given time. However, the process may need to perform various instructions that can not be immediately executed by the core. For instance, a process may need to access values stored in the main memory of the computer. Transferring bytes from main memory to a processor core takes a significant number of clock cycles. Hence, a core that needs to read from main memory can spend a significant amount of time idle, waiting for the data to arrive. By presenting each physical core as two logical cores to the operating system, the operating system can provide each physical core with two distinct processes to run. When one process is stalled waiting for data to arrive from main memory, the core can execute instructions from the second process, thus keeping the core busy in a situation where it would have otherwise been idle.
While SMT is useful for reducing idle time for cores, it also makes it more challenging to determine how many clock cycles are needed to complete a set of instructions. Just like the process scheduler in the operating system can switch between running the instructions we wish to time and an unrelated workload, the physical core will switch between executing the programs assigned to each of the logical cores it presented to the OS. Hence, we can not determine how many of the elapsed cycles were used to execute the instructions we wish to time vs. the instructions for the process assigned to the other logical core sharing the physical core.
For scenarios where we really want to measure the clock cycles required to execute a series of instructions, various steps can be taken to compensate for or disable these functions. In most cases, SMT can be disabled through a BIOS setting. Dynamic scaling can usually be disabled through a BIOS setting or through the OS. We can use an otherwise idle system to avoid resource conflicts introduced by running multiple processes concurrently. We can specifically restrict the cores on which processes are allowed to run.
Even after all these tweaks, there are still things that will affect our ability to measure the clock cycles required to run specific instructions. For instance, a hardware interrupt signal may be delivered to the core on which our instructions are running, or the scheduler code will need to be run, even when it is solely to confirm that no other process should be run other than the one we are specifically trying to time. In most cases, on our otherwise idle system, these additional sources of interruption will generally have only a small impact.
To get the most accurate estimates for the number of clock cycles required to execute a series of instructions, you would need to address all of the following and likely more:
In most cases, however, we can get a reasonably close estimate if we ensure our system is otherwise idle and the program we wish to profile is run for a sufficient period of time, even when we don't go through all the steps listed above.
Most modern processors can complete operations like addition and subtraction of integers in a single clock cycle. More specifically, it may take 4 or 5 clock cycles to execute a single integer addition or subtraction instruction, but with pipelining, a single core can complete one such instruction each clock cycle because the actual step of performing the mathematical calculation takes a single clock cycle.
On the other end of the spectrum, operations like division of integers or longs require an iterative approach to execute and the math takes numerous clock cycles. As a general rule, a single CPU core will not be able to complete a long division in a single clock cycle. However, the number of clock cycles required to complete an instruction like division of long values is highly dependent upon processor architecture and can vary significantly between processor designs.
In fact, in the latter 80s and through the 90s, computers were heavily marketed to consumers based on the clock rate of the processors and yet that had little to do with the amount of useful work those processors could complete in a given period of time. This distortion, often referred to as the "megahertz myth," was often used by Intel to give the impression of technical superiority over AMD and PowerPC processors.
The real metric we want is something along the lines of how many 64-bit division operations can be performed per second. Let's consider a processor core that can complete such an operation in 10 clock cycles at a clock rate of 3 GHz (i.e., 3 * 10^9), yielding 300 million (i.e., 3 * 10^8) operations per second. A different processor core may be able to complete such an operation in 20 clock cycles at a clock rate of 4 GHz, yielding 200 million operations per second. Thus, while the second processor has a clock rate 33% greater than the first, the first processor completes 50% more 64-bit division operations in the same period of time. Clock speed alone tells us very little about processor functionality.
The testLongDivision.c sample program attached to this article provides a simple method for estimating the time it takes to complete a 64-bit signed integer (i.e., long) division. There are a few key aspects to this program which need to be understood.
First, the program generates random values to use for the division. This is important as processor designs will often require varying numbers of cycles to complete division instructions based on the divisor and the dividend. Processors implement division using an iterative approach, with each iteration requiring a clock cycle, and the division being considered complete once certain conditions are met. The number of iterations it takes to reach a state where the required conditions are met will vary based on the divisor and dividend.
Second, the program generates a set of random longs of a particular size. The set size used in the sample program is 10,000. The size of a long in the system used for testing is 64 bits (i.e., 8 bytes), and hence the random set consumes approximately 80 KB. This size is important as we want the entire random set to fit in the L1 cache of the core that runs the program. If the set size exceeds that which can be stored in the L1 cache, then the test program will be at least partially measuring the speed of one or more memory busses.
Note: CPU cache will be discussed in further detail later in this article.
Third, the program must perform a large enough number of operations to amortize the cost of setting up the test, such as taking timestamps, making calls into the test functions, and getting relevant data into the CPU cache. The sample program is coded to perform 1 billion operations, which requires approximately 1 minute to complete a single iteration of the test on the system I used for this article. You can adjust this, but you should ensure the program runs for at least multiple seconds; a minute would be a good starting point.
Lastly, the program times execution of a baseline function and compares that to the execution time of two variations on that baseline function. The
longDivisionBase function performs iterations of this line of C code (all values are long ints):
v = v1 / v2;
Compilation of that statement generates the assembly code:
movq -16(%rbp), %rax cqto idivq -24(%rbp) movq %rax, -32(%rbp)
The first two instructions place one of the long values in the
rax register and converts it to the appropriate format needed to perform signed division. The
idivq instruction performs the division operation, and the final
movq instruction saves the quotient of the division into a register.
longDivisionV1 function differs from the base function by calling the C code:
v = v1 / v2 / v2;
Which corresponds to the assembly code (changes in bold):
The changes are such that the quotient of the first division, stored in register
rax, is again converted to the appropriate format needed to perform signed division through the
cqto instruction. Then, the
idivq instruction is called a second time before the final
movq instruction saves the quotient to the appropriate register.
longDivisionV2 function uses the C code:
v = v1 / v2; v = v1 / v2;
Which corresponds to the assembly code:
As you can see, we repeated the same line of C code, and this effectively repeats the same 4 assembly instructions as seen with the base function.
Essentially, the V1 function differs from the baseline by performing an extra division, and the V2 function differs by performing an extra division and an extra assignment. Thus, the base function uses 4 instructions, the V1 function uses 6 instructions, and the V2 function uses 8 instructions.
It seems reasonable that the baseline function completes in the least time, followed by the V1 function, followed by the V2 function. And this is precisely why we use empirical measurements to understand processor performance. It is easy to make incorrect assumptions without detailed schematics of processor circuitry (and a thorough understanding of computer and electrical engineering).
To run the test program, a server was configured with an isolated CPU core and the sample program was assigned to run solely on that core. The CPU clock rate was fixed at ~3.482 GHz. The program produced the following output:
avg v1opTime: 10.175772 ns, avg v2opTime: 8.365086 ns, base time: 12507242 us, v1 time: 22683014 us, v2 time: 20872328 us, num ops: 1000000000
As shown in the output, the time required to execute the V1 function exceeds the time required to execute the V2 function. To understand this behavior, we must remember that modern processors use instruction pipelining to maximize the rate at which instructions are completed. In the case of the V1 function, the output of the first division is required as input to the second division. Hence, despite any pipelining, the second division can not begin to be calculated until the first has completed. In the V2 function, the second division operation has no dependency on the output of the first division, and hence pipelining works effectively and the 8 instructions of the V2 function complete in less cycles than the V1 function.
With regards to the clock cycles required to complete the operations, at a clock rate of 3.482 GHz, each clock cycle corresponds to ~0.287 nanoseconds. The difference in execution time for the base function and V1 function was ~10.18 seconds, during which time an extra 1 billion
idivq instructions were executed. Thus, each iteration required an extra ~10.18 nanoseconds to complete those two instructions. Given the 0.287 nanosecond clock cycle, this yields ~35.48 clock cycles per iteration. The V2 function requires an additional ~8.37 seconds, which corresponds to ~29.15 clock cycles per iteration. Both V1 and V2 perform the same number of long division instructions, but the instructions of the V2 function can be more effectively pipelined by the processor and hence complete the same number of calculations in a shorter period of time.
My next blog post will cover monitoring CPU usage and bottlenecks.
These blog posts are part of a larger in-depth series called A Study of Performance in Distributed Computing Environments.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.