34 min read
Note: This article references commands, behaviors, and outputs generated by Linux-based operating systems, such as CentOS or Ubuntu. Some information will not be relevant to other operating systems, such as Windows.
This blog post is the second in a series of blog posts that will go over distributed computing and high-performance computing areas. The series will describe the challenges of running these systems, ways of troubleshooting these systems and will eventually describe how the MapR Data Platform solves key issues in these areas.
To understand how drive performance affects program execution, it is important to consider the design and function of storage systems.
The drives used in our computers execute single operations at a time. Many programs may issue read or write requests to a single drive concurrently, but that single drive will only execute one of those operations at any given time with the other requests being queued. This is obviously different from how CPUs function today, with multiple cores capable of executing instructions from multiple programs concurrently. Thus, the I/O capabilities of a drive can be fully utilized regardless of whether the read/write queues are almost always empty or filled with requests, so long as the drive is always executing a request.
When executing a read or write request, the number of bytes to be read or written will affect the time it takes for the drive to complete the request, with the elapsed time increasing as the number of bytes increases. Just like a network interface has a defined rate at which bytes can be received or transmitted (such as 10 gigabits per second), so too does a drive. With a 10 gigabit per second network interface, reading 1 gigabit off the interface would take no less than one tenth of a second. Various standards define the rates at which data can be transferred to or from a drive. For instance, the SATA III standard defines a maximum data rate of 6 gigabits per second. Thus, reading 1 KB (8 kb) from a SATA III connected drive will take no less than 1.27 microseconds, while reading 128 KB (1 Mb) will take no less than 162 microseconds. In practice, there is non-trivial overhead involved in executing a read or write request beyond the number of bytes being read or written, so those 1 KB and 128 KB read operations will generally take much longer than the 1.27 and 162 microseconds, respectively. But the premise that requests with larger read or write payloads will take longer to execute is very much still true.
With spinning disk drives, the difference in the offsets at which subsequent read and/or write operations are applied will have a significant impact on the time it takes to complete the request. Which is to say, reading some bytes from the beginning sectors of the disk and then subsequently reading some bytes from the sectors near the end of the disk will result in longer time taken to complete the requests because the head (the disk component used to read/write bytes) must physically reposition itself (seek) to the other edge of the disk platter. The time it takes to reposition the head often exceeds the time it takes to read or write the number of bytes in the request. Thus, seek times are exceedingly important when performing any type of random I/O load with a spinning disk.
With solid state drives, there are no physical components, and hence there is no seek time. Whether an I/O workload is sequential or random has no bearing on the performance of an SSD.
Two types of metrics are used to describe drive performance: the number of operations completed over a period of time, often expressed as operations per second or op/s, and the number of bytes or bits of payload in the operations completed over a period of time, often expressed as a rate of throughput, such as megabytes per second or MB/s. In general, the throughput rate will have a direct relationship with the payload size of operations within a certain range of payload sizes. As the payload size decreases, so too will the throughput rate. As the payload size increases, so too will the throughput rate. For instance, a drive may be able to complete reads at a rate of 2,000 operations per second of 128 KB each (e.g., 250 MB/s). However, when issuing reads for 1 KB each, a higher rate of 20,000 operations per second may be completed but that will amount to a throughput rate of ~20 MB/s.
For data-driven applications, storage-based bottlenecks can occur, based on the rate of operations per second, the throughput rate, or both. For instance, applications that read back the entire contents of large files will generally bottleneck on the throughput rate of the underlying storage system. Other applications, such as databases, will often need to perform many small reads at disparate offsets, in which case the number of operations per second the storage system can complete will likely be the most significant factor gating performance.
Data-driven applications can generally be described using the following characteristics:
Some applications will only perform reads, some will only perform writes, and some will perform both. The rate at which data can be read from a drive is not necessarily the same rate at which data can be written. Therefore, it is important to understand the rates and sizes of reads and writes that an application will need to complete as well as the capabilities of the storage hardware to complete those operations.
The offsets against which consecutive operations are executed can be sequential or random. Consider a 1 TB disk drive against which a read request is issued for 200 bytes at the 1000th byte offset of that disk. Immediately after issuing the first request, a second read request is issued for 200 bytes at the 1200th byte offset of a disk. Those two read requests are considered sequential, and the disk will be able to complete the second request with minimal effort. However, if the second read request was for 200 bytes from a position 999 GB in, the disk will need to reposition mechanical components over a non-trivial physical distance. Those two read requests are considered random, and the disk will require significantly more time to complete the second operation compared to the sequential example. The repositioning of the mechanical components is referred to as seek time and modern disk drives will generally have seek times under 20 ms.
Solid state drives have no physical components to reposition, and hence there will be little to no discernible difference to complete consecutive requests, regardless of whether the requests are sequential or random.
Applications make system calls to the OS to read or write data to drives. The OS generally provides generic and flexible APIs for reading and writing arbitrary amounts of data, from a single byte up to gigabytes or more. Generally, the rate of bytes per second that can be transferred to/from a drive will increase as the payload size of requests issued by the application increases. For example, an application will generally be able to read a 1 MB extent from a disk using 128 KB read requests in less time than if it issued 1 KB read requests.
While the OS may provide APIs to read and write arbitrary payload sizes, disk drives have minimum and maximum payload sizes for operations. As a result, an application may issue a single read or write call to the OS, and the OS may need to handle that by making several calls to the drive.
Disks are organized into units referred to as sectors, and the OS reads or writes disk data as sectors, not individual bytes. As an effect of the Integrated Drive Electronics (IDE) standard, from the late 1980s to the early 2010s, the vast majority of hard drives were manufactured with a set sector size of 512 bytes. As an effect of the Advanced Format, most disk drives manufactured this decade support a 4096 byte sector size. Even with the new standard, these disks generally provide abstractions that allow OS/software to interact with them as if they used a 512 byte sector size, and many systems in use today still use 512 byte sector sizes.
Sector size is important because it represents the units in which the operating system will need to read and write data to a drive. While an application may ask the OS to read a single byte from a drive, the OS will need to read an entire sector from the drive. Similarly, when an application asks the OS to write a single byte to a drive, the OS may need to handle this by reading an entire sector from the drive into memory, setting the content of the appropriate byte in memory as requested by the application, then writing the memory content back to the sector of the drive.
Application performance can be significantly affected by how the application software was designed. For instance, the OS may need to issue an excessive number of requests to a drive if an application issues multiple distinct read or write requests for ranges of bytes in a single sector. While an application may appear to be bottlenecked based on drive performance, it is important to consider whether the application’s design generates efficient read/write requests. In some cases, the number of read or write requests the OS issues to a drive can be significantly reduced by changing a design aspect of an application, and making an application change can be more cost-efficient than operating additional hardware.
In order to understand drive performance, it is useful to observe how various metrics are affected by distinct workloads. There are various sample programs available alongside this article that can be used to generate those workloads.
All the sample programs use the POSIX functions defined in the unistd library to open/read/write/seek/close. The sample programs are single threaded, and the POSIX functions they call are synchronous, which means, for instance, the program may issue a read call to the disk and will not issue a subsequent read call until the first has returned. The programs also use the O_DIRECT flag with the open function and hence use DMA to transfer data to/from the devices. Note that DMA imposes requirements on memory alignment and data transfer sizes. As such, the number of bytes requested in read and write calls will always be a multiple of the device’s logical sector size, regardless of the exact number of bytes the user has requested when running the program. For instance, for drives with 512 byte sector sizes, all read and writes will be for multiples of 512 bytes. A user may ask the program to use 500 byte reads, and the program handles this by issuing a 512 byte read.
This program utilizes the openand read functions from the unistd library to generate fixed size read requests at sequential offsets for a single device. The program can be compiled using the command "gcc sequentialRead.c -o sequentialRead" and then run like "./sequentialRead /dev/
This program utilizes the open, lseek,and read functions from the unistd library to generate read requests at random offsets to a single device. The amount of data requested in each read call will be calculated based on lower and upper bounds provided at run time. The program can be compiled using the command "gcc randomRead.c -o randomRead" and then run like "./randomRead /dev/
This program utilizes the open and write functions from the unistd library to generate fixed size write requests at sequential offsets for a single device. The program can be compiled using the command "gcc sequentialWrite.c -o sequentialWrite" and then run like "./sequentialWrite /dev/
This program utilizes the open, lseek ,and write functions from the unistd library to generate write requests at random offsets to a single device. The amount of data written in each write call will be calculated based on lower and upper bounds provided at run time. The program can be compiled using the command "gcc randomWrite.c -o randomWrite" and then run like "./randomWrite /dev/
My preferred tool for monitoring drive performance on a single computer is iostat as developed through the sysstat (https://github.com/sysstat) project.
The following is an excerpt of output generated by the command "iostat -d -t -m -x 1 /dev/sde" run on a system with CentOS 7.4.1708 and sysstat package version 10.1.5-12. The "-d" flag instructs iostat to print device utilization statistics. The "-t" flag instructs iostat to print timestamps. The "-m" flag instructs iostat to print field values in megabytes per second where relevant. The "-x" flag instructs iostat to print what are referred to as extended statistics. The argument "1" instructs iostat to print statistics once a second. The argument "/dev/sde" instructs iostat to print statistics only for that device path.
The output was generated while a test program was running to generate load on the device. Specifically, there were 4 instances of the randomRead program running concurrently, with the read size set to 1024 bytes for each instance.
01/18/2019 10:10:05 AM Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sde 0.00 0.00 27617.00 0.00 26.97 0.00 2.00 3.80 0.14 0.14 0.00 0.04 100.00
01/18/2019 10:10:06 AM Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sde 0.00 0.00 27328.00 0.00 26.69 0.00 2.00 3.79 0.14 0.14 0.00 0.04 100.00
These lines of output show 27,617 and 27,328 read requests per second completed against the device, respectively, as indicated by the "r/s" field. During both periods, 100% of the elapsed time was spent with requests executing against the device as shown by the "%util" field, and hence no additional operations beyond those ~27K could be completed, regardless of how many additional system calls that programs running on the system could have made against the device during this time. Those statistics represent the maximum throughput the device was capable of delivering.
Note that the term utilization is overloaded in relation to drives. In the context of iostat, utilization refers to the percentage of a period of time during which the drive was servicing requests – e.g., if a total of 9 seconds was spent by a drive, servicing read and write requests over a 10 second period of time, then we would refer to the drive as being 90% utilized in terms of throughput. In the context of data storage, a drive may be able to store 1000 gigabytes and may have 600 gigabytes of values that were explicitly written to various locations on that drive, and hence we may describe it as being 60% utilized in terms of storage capacity. For the purposes of this article, all references to drive utilization are in relation to throughput, not storage capacity.
The "avgrq-sz" field shows the average number of logical device sectors accessed per request, aggregated from both read and write operations, though there were no write operations at the time this output was generated. To retrieve the logical sector size, the ioctl system call can be used with the BLKSSZGET flag. A common command to run that will print the value returned by this system call is "fdisk -l" while providing the path to the device of interest. In the case of the system that generated this output, the logical sector size of the /dev/sde device is 512 bytes. The randomRead program issued read requests for 1024 bytes each, which corresponds to the value of 2 sectors as shown in the iostat output for this field.
The "rMB/s" field shows the rate at which data was read from the drive in megabytes per second. While this value is being printed explicitly in the output, it can also be inferred from the reads per second and writes per second values and the average request size value. In the case of the first output, there are 27,617 read requests per second, 0 write requests, and an average request size of 2 logical sectors of 512 bytes each. Thus, 2 sectors multiplied by 512 bytes each multiplied by 27,617 requests per second equals 28,279,808 bytes per second, or ~26.97 megabytes per second, matching the value shown for the "rMB/s" field. This math works in this case because only read requests were issued to the device during this time. The avgrq-sz field includes both reads and write.
The "svctm" field shows the average time in milliseconds the device took to complete each request. With an average service time of 0.04 milliseconds per request, the device would be capable of completing approximately 25,000 such requests per second. Working backwards, if the device is 100% utilized and completed 27,617 reads and 0 writes during a 1 second period, then the average time spent per request is ~0.0362 milliseconds. Note that the value of the "svctm" field is rounded to hundredths of a millisecond and does not have good precision for the types of statistics that can be generated by an SSD. In this case, that rounded value of 0.04 is more than 10% off from the actual value.
The "await" field shows the average amount of time in milliseconds from when a request to read or write the device was passed to the kernel to when the kernel returned the response. Similarly, the "r_await" and "w_await" fields show the same statistics but exclusively for read and write requests, respectively. In the case of this output, only read requests were issued for the device, and hence the w_await field is shown as 0, while the r_await and await fields show the same value. The value of the await field will always be greater than or equal to the value of the svctm field, as the await field includes the time that requests were waiting in the queue for their turn to be executed and the time it took to actually execute the requests. In the case of my test program, with an await/r_await of 0.14 ms, it indicates that for each of the 4 threads, the average time elapsed from when the read call was made to when it returned was 0.14 ms.
The "avgqu-sz" field shows the average number of outstanding requests to the device. The average number of outstanding requests as shown in the iostat output is 3.8 and 3.79, respectively. The logic of the test program is such that it spawned 4 threads; each thread enters a loop in which it calls lseek to adjust the offset to a random position of the device and then calls read to read 1024 bytes from that position. With 4 threads issuing reads to the device, if there are an average of 3.8 read requests outstanding for the device over the period of time, we can conclude that ~95% of the running time of each thread was spent waiting for the read call to return, with the other ~5% of the running time of each thread doing other work, such as calculating a random offset to use for the subsequent read call it will make.
Lastly, the "rrqm/s" and "wrqm/s" fields show the number of read and write requests that were merged per second. These fields can be non-zero when the kernel receives requests for overlapping/sequential extents of the device that can be serviced by treating the requests as a single request instead of executing them independently. For instance, if a request was issued to read bytes 1 through 10 from a device, and before that request could be executed, a subsequent request was issued to read bytes 11 through 20 from that same device, then the kernel may decide to issue a single read request to the device for bytes 1 through 20 instead of using two individual requests, thus merging the requests. In such a scenario, the counter to track the number of merged read requests would be incremented.
iostat, like other utilities in the sysstat project, works by periodically reading content from procfs and comparing that content to content that was read previously. For instance, at 12:00:00, a value of 100 may be read for a particular field, then at 12:00:01, the value of 105 may be read for that same field, and hence it can be determined that the value of the field incremented by 5 over the period of time from 12:00:00 to 12:00:01. If, for instance, that field represented the number of read requests that were issued to a drive, with 1 second elapsed over that period of time, we can conclude the rate of read requests issued to the drive was 5 per second.
iostat in particular will read per device statistics from the diskstats entry of procfs, which is generally available by reading the content of the path /proc/diskstats. The values read from that entry are recorded in memory, and upon the next polling iteration, the amounts by which those values change will be used to generate the lines of iostat output.
As discussed earlier, there are a variety of factors that affect how many operations per second and bytes per second of throughput a drive can deliver. To qualify and quantify how these factors affect performance, you can run a variety of tests using various combinations of:
The sample programs linked in this article can be used to generate the read/write requests, and the iostat utility can be used to monitor the disk performance.
First, let’s look at how the payload size of the read(2) function from the unistd library affects the number of operations per second and bytes per second of throughput that an SSD can deliver.
For this test, the randomRead program was used to generate read requests of fixed sizes at random offsets. The iostat output was monitored to ensure the disk was 100% utilized for the duration of each test iteration.
In this graph, the x-axis represents the payload size requested when calling the read(2) function in the unistd library. The left y-axis represents the number of read requests sent by the kernel to the drive per second. The right y-axis represents the number of megabytes per second read from the drive per second. Note that, in this environment, the maximum payload that can be read from the disk by the kernel in a single request is 128 KB, regardless of the payload size requested by the read(2) function. Thus, if the read(2) function is called with a 256 KB payload, the kernel will need to handle this request by issuing two read requests to the disk for 128 KB each. Thus, once the payload size for the read(2) function goes beyond 128 KB, the number of operations per second completed by the disk stays roughly the same as those disk operations will always be for 128 KB payloads.
When issuing read requests for small payloads, such as 1 KB, the SSD is able to complete over 25,000 reads per second, which equates to ~25 MB/s. When issuing read(2) requests for 128 KB each, the SSD completed ~2,455 operations per second, which equates to ~307 MB/s. This demonstrates that as the size of the payload being requested increases, the bytes per second returned by the disk increases while the operations per second decreases. This behavior holds true until the payload size of the read(2) calls exceeds the maximum size the kernel can request from the disk in a single call (e.g., 128 KB). Notice that when issuing read(2) requests for 256 KB each, the SSD completed ~2722 operations per second, which equates to ~340 MB/s. This represents about a 10% increase over the test using 128 KB read(2) sizes.
You may be wondering, if the kernel simply executes the 256 KB read(2) as two 128 KB disk reads, why would the throughput differ from the test where the read(2) calls were made for 128 KB each?
The answer lies with the kernel’s imprecise method of measuring the disk performance. The utilization statistic indicates the percentage of the running time that the kernel had one or more operations that are executing or waiting to be executed against a disk. It does not indicate the actual percentage of time the disk spent executing requests. The kernel will send a request to a disk; the disk will execute the request and then notify the kernel of its completion through an interrupt request on an IRQ (or rather, the disk controller will issue the interrupt request, but that’s a topic for another article). When the kernel receives the IRQ, it knows the disk is available to service another request. There can be varying amounts of delay from when a disk signals an operation has completed via an interrupt request to when the kernel submits the next request to the disk. There will always be some amount of delay, even when multiple requests for the disk are queued in the kernel. The longer the delay, the lower the disk throughput.
In the case of the 256 KB test, the kernel has to issue twice the number of disk read operations per read(2) call vs. the 128 KB test. The effect of the increased number of operations is decreased turn around time from receipt of an interrupt request to dispatching the subsequent disk request.
Just how significant is this effect? With a single instance of the randomRead program running against the SSD using a read(2) size of 128 KB, the disk shows ~95% utilization and a throughput of ~172 MB/s. Extrapolating, if 172 MB/s represents 95% of the disk’s ability to service 128 KB read requests, then 100% utilization should correlate to 181 MB/s. With 2 instances of that randomRead program running, both issuing 128 KB reads, the disk utilization shows as 100% and a throughput of ~240 MB/s, which is almost 33% higher than what we would expect the maximum throughput to be. However, as we’ve already observed in the test results shown in the previous graph, this disk is capable of delivering ~400 MB/s of throughput.
With 4 instances of the program running, the disk reaches ~310 MB/s. Eight instances yield ~340 MB/s, 16 instances yields ~371 MB/s, 32 instances yields ~385 MB/s. In all these test cases, from 2 instances up to 32 instances, the disk utilization shows 100%. And in all cases, the requests being issued by the kernel to the disk are 128 KB reads. Thus, while the kernel may indicate a disk was 100% utilized, the kernel may not actually have I/O requests issued to the disk 100% of the time or even 60% of the time. The design of the kernel has profound effects on what we can observe about utilization of hardware devices like disk drives.
But, I digress; we’re here to discuss the effects of workload characteristics on disk performance, not get mired in these types of idiosyncrasies.
Let’s look at how SSD performance differs from HDD. For the first test, we will issue sequential read requests for varying payload sizes and observe the throughput the disks can deliver.
The graph demonstrates that SSDs and HDDs provide similar throughput for sequential read workloads up to the point where either drive reaches its maximum throughput. In this case, the slowish HDD has a maximum read rate of 200 MB/s that can be reached by issuing read(2) calls for 32 KB each. The SSD provides very similar performance up to that point as well, but read(2) size increases.
However, while the SSD and HDD deliver similar performance for sequential reads, performing random reads highlights the significant difference between these types of storage devices.
SSDs can complete random read requests at a rate hundreds of times greater than HDDs.
Similar to the read results, SSDs can complete random write operations at significantly higher rates than HDDs, and sequential writes generate comparable performance.
The takeaway here is that with any notable degree of randomness to an I/O workload, an SSD will provide significantly better performance. For workloads that generate significant amounts of sequential I/O, HDDs are a cost-effective solution.
Disks seem simple: it’s all just reads, writes, and seeks. However, in practice, user space applications make one type of I/O request, kernels make a different type of I/O request, and different types of disks have vastly different abilities to handle those requests. The tools and techniques we use to monitor disk performance are imperfect. It’s not as simple as it seems on the surface, but being aware of all these caveats and behaviors will greatly increase your chances of constructing a meaningful understanding of real-life storage performance.
In later articles in this series, we will build upon the basics laid out here. Among other things, we will look at how in-memory caching adds complexity to understanding storage performance and how the performance of individual disks manifests in distributed storage systems.
Stay ahead of the bleeding edge...get the best of Big Data in your inbox.