The Deceptively Simple Act of Writing to Disk
Author: ScyllaDB
Originally Sourced from: https://www.scylladb.com/2026/02/18/the-deceptively-simple-act-of-writing-to-disk/
read_bw/read_bw_max + write_bw/write_bw_max + read_iops/read_iops_max + write_iops/write_iops_max <= 1
The internal mechanics of how the IO Scheduler works are detailed in the blog post linked above.
Peak and sustained throughput
We observed some bursty behavior in the SSDs under test. It wasn’t much; iops/bandwidth would be lower with around 5% than the values measured by running the benchmark for the default 2 minutes. The iops/bandwidth values would start stabilizing at around 30 minutes and that’s what we call sustained io-properties.
We thought our IOTune runs might have recorded peak disk io-properties (i.e., we ran the tool for too short of a duration – the default is 120 seconds). With a realistic workload, we are actually testing the disks at their sustained throughput, so the IO Scheduler builds an inflated model of the disk and ends up sending more requests than the disk can actually handle. This would cause the overload we saw in the charts.
We then tested with newly measured sustained io-properties (with the IOTune duration configured to run 30 minutes, then 60 minutes). However, there wasn’t any noticeable improvement in the throughput degradation problem…and the charts were still spiky.
Disk saturation length
The disk saturation length, as defined and measured by IOTune, is the smallest request length that’s needed to achieve the maximum disk throughput.
All the disks that we’ve seen so far had a measured saturation length of 128K. This means that it should be perfectly possible to achieve maximum throughput with 128K requests.
We noticed something quite odd while running tests on these performant NVMes: the Seastar IOTune tool would report a saturation length of 1MB. We immediately panicked because there are a few important assumptions that rely on us being able to saturate disks with 128K requests.
The issue matched the symptoms we were chasing. A disk model built with the assumption that saturation length is 1MB would trick the IO Scheduler into allowing a higher number of 128K requests (the length Seastar uses for sequential writes) than the disk controller can handle efficiently.
In other words, the IO Scheduler would try to achieve the high throughput measured with 1MB request length, but using 128K requests. This would make the disk appear overloaded, as we saw in the charts.
Assume you’re trying to reach the maximum throughput on a common disk with 4K requests, for instance. You won’t be able to do it. And since the throughput would stay below the maximum, the IO Scheduler would stuff more and more requests into the disk. It’s hoping to reach the maximum – but again, it won’t be reached. The side effect is that the IO Scheduler ends up overloading the disk controller with requests, increasing in-disk latencies for all the other processes trying to use it.
As is typical when you’re navigating muddy waters like these, this turned out to be a false lead. We were stepping on some bugs in our measuring tools, IOTune and io_tester. IOTune was running with lower parallelism than those disks needed for saturation. And io_tester was measuring overwriting a file rather than writing to a pre-truncated new file.
The saturation length of this type of disks was still 128k, like we had seen in the past. Fortunately, that meant we didn’t need to make potential architectural changes in Seastar in order to accommodate larger requests.
A nice observation we can make here based on the tests we ran trying to dis/prove this theory is that extent allocation is a rather slow process. If the allocation group is already quite busy (a few big files already exist under the same directory, for instance), the effect on throughput when appending and extending a file is quite dramatic.
Internal parallelism
Another interesting finding was that the disk/filesystem seemed to suffer from internal parallelism problems. We ran io_tester with 128k requests, 8 fibers per shard with 8 shards and 64 shards. The results were very odd. The expected bandwidth was ~12.7GB/s, but we were confused to see it drop when we increased the number of shards.
Generally, the bandwidth vs. latency dependency is defined by two parts. If you measure latency and bandwidth while you increase the app parallelism, the latency is constant and the bandwidth grows as long as the internal disk parallelism is not saturated. Once the disk is throughput loaded, the bandwidth stops growing or grows very little, while the latency scales almost linearly with the increase of the input.
However, in the table above, we see something different. When the disk is overloaded, latency increases (because it holds more requests internally), but the bandwidth drops at the same time. This might explain why IOTune measured bandwidths around 12GB/s, while under the real workload (capped with IOTune-measured io-properties.yaml), the disk behaved as if it was overloaded (i.e., high latency and the actual bandwidth below the maximum bandwidth).
When IOTune measures the disk, shards load the disk one-by-one. However, the real test workload sends parallel requests from all shards at once.
The storage device was a RAID0 with 4 NVMes and a RAID chunk size of 1MB. The theory here was that since each shard writes via 8 fibers of 128k requests, it’s possible that many shards could end up writing to the same disk in the array.
The explanation is that XFS aligns files on 1MB boundaries. If all shards start at the same file offset and move at relatively the same speed, the shards end up picking the same drive from the array. That means we might not be measuring the full throughput of the raid array.
The measurements confirm that the shards do not consistently favor the same drive. Single disk throughput was measured at 3.220 GB/s while the entire 4-disk array achieved a throughput of 10.9 GB/s. If they were picking the same disk all the time, the throughput of the entire 4-disk array would’ve been equal to that of a single disk (i.e. 3.2GB/s).
This lead ended up being a dead end. We tried to prove it in a simulator, but all requests ended up shuffled evenly between the disks. Sometimes, interesting theories that you bet can explain certain effects just don’t hold true in practice. In this case, something else sits at the base of this issue.
XFS formatting
Although the previous lead didn’t get us very far, it opened a very nice door.
We noticed that the throughput drop is reproducible on
64 shards, rather than 8 shards
RAID mounted with the scylla_raid_setup script (a ScyllaDB script which does preparation work on a new machine, e.g., formats the filesystem, sets up the RAID array), not on a raw block device and not on a RAID created with default parameters
Comparing the running mkfs.xfs commands, we spotted a few differences.
In the table below, notice how the XFS parameters differ between default-mounted XFS and XFS mounted by scylla_raid_setup. The 1K vs 4K data.bsize difference stands out.
We also spotted – and this is an important observation– that truncating the test files to some large size seems to bring back the stolen throughput. The results coming from this observation are extremely surprising. Keep reading to see how this leads us to actually figuring out the root cause of the problem.
The table below shows the throughput in MB/s when running tests on files that are being appended and extended and files that are being pre-truncated to their final size (both cases were run on XFS mounted with the Scylla script).
We’ve experimented with Seastar’s sloppy_size=true file option, which truncates the file’s size to double every time it sees an access past the current size of the file. However, while it improved the throughput numbers, it unfortunately still left half of the missing throughput on the table.
RWF_NOWAIT and non-blocking IO
The first lead that we got from here was by running our tests under strace. Apparently, all of our writes would get re-submitted out of the Seastar Reactor thread to XFS threads.
Seastar uses RWF_NOWAIT to attempt non-blocking AIO submissions directly from the Seastar Reactor thread. It sets aio_rw_flags=RWF_NOWAIT on IOCBs during initial io_submit calls from the Reactor backend.
On modern kernels, this flag requests immediate failure (EAGAIN) if the submission may block, preserving reactor responsiveness.
On io_submit failure with RWF_NOWAIT, Seastar clears the flag and queues the IOCBs for retrying. The thread pool then executes the retry submissions without the RWF_NOWAIT flag, as these can tolerate blocking.
We thought reducing the number of retries would increase throughput. Unfortunately, it didn’t actually do that when we disabled the flag.
The cause for the throughput drop is uncovered next. As for the RWF_NOWAIT issue, it’s still unclear why it doesn’t affect throughput. However, the fix was a kernel patch by our colleague Pavel Emelyanov which fiddles with inode time update when IOCBs carry the RWF_NOWAIT flag. More details on this would definitely exceed the scope of this blog post.
blktrace and offset alignment distribution
Returning to our throughput performance issue, we started running io_tester experiments with blktrace on, and we noticed something strange.
For around 25% of the requests, io_tester was submitting 128k requests and XFS would queue them as 256 sector requests (256 sectors x 512 bytes reported physical sector size equals to 128k). However, it would split the requests and complete them in 2 parts. (Note the Q label on the first line of the output below; this indicates the request was queued)
The first part of the request would finish in 161 microseconds, while the second part would finish in 5249 microseconds. This dragged down the latency of the whole request to 5249 microseconds (the 4th column in the table is the timestamp of the event in seconds; the latency of a request is max(ts_completed – ts_queued)).
The remaining 75% of requests were queued and completed in one go as 256 sector requests. They were also quite fast: 52 microseconds, as shown below.
The explanation for the split is related to how 128k requests hit the file, given that XFS lays it out on disk, considering a 1MB RAID chunk size.
The split point occurs at address 21057335224 + 72, which translates to hex 0x9CE3AE00000. This reveals it is, in fact, a multiple of 0x100000 – the 1MB RAID chunk boundary.
We can discuss optimizations for this, but that’s outside the scope of this article. Unfortunately, it was also out of scope for our throughput issue.
However, here are some interesting charts showing how request offsets alignment looks based on the blktrace events we collected.
default formatted XFS with 4K block size
scylla_raid_setup XFS: 1K block
scylla_raid_setup XFS: 1K block + truncation
This result is significant!
For XFS formatted with a 4K block size, most requests are 4K aligned.
For XFS formatted with scylla_raid_setup (1K block size), most requests are 1K or 2K aligned.
For XFS formatted with scylla_raid_setup (1K block size) and with test files truncated to their final size, all requests are 64K aligned (although in some cases we also saw them being 4k aligned).
It turns out that XFS lays out files on disk very differently when the file is known in advance compared to the case when the file is grown on-demand. That results in IO unaligned to the disk block size.
Punchline
Now here comes the explanation of the problem we’ve been chasing since the beginning. In the first part of the article, we saw that doing random writes with 1K requests produces worse IOPS than with 4K requests on these 4K optimized NVMes. This happens because when executing the 1K request, the disk needs to perform Read-Modify-Write to land the data into chips.
When we submit 128k requests (as ScyllaDB does) that are 1K or 2K aligned (see the alignment distributions in the charts above), the disk is forced to do RMW on the head and tail of the requests. This slows down all the requests (unrelated, but similar in concept to the raid chunk alignment split we’ve seen above).
Individually, the slowdown is probably tiny. But since most requests are 1k and 2k aligned on XFS formatted with 1k block size (no truncation), the throughput hit is quite significant.
It’s very interesting to note that, as shown in the last chart above, truncation also improved the alignment distribution quite significantly, and also improved throughput. It also appeared to significantly shorten the list of extents created for our test files.
For ScyllaDB, the right solution was to format XFS with a 4K block size. Truncating to the final size of the file wasn’t really an option for us because we can’t predict how big an SSTable will grow. Since sloppy_size’ing the files didn’t provide great results, we agreed that 4K-formatted XFS was the way to go.
The throughput degradation we got using the higher io-properties numbers seems to be solved. We initially expected to see “improved performance” compared to the original low io-properties case (i.e., a higher measured write throughput). The success wasn’t obvious, though. It was rather hidden within the dashboards, as shown below.
Here’s what disk utilization and IO flow ratio charts look like. The disk is fully utilized and clearly not overloaded anymore.
And here are memtable and commitlog charts. These look very similar to the charts we got with the initial low io-properties numbers from the “When bigger instances don’t scale” article. Most likely, this means that’s what the test can do.
The good news was hidden here. While the test went full speed ahead, the compaction job filled all the available throughput, from ~10GB/s (when commitlog+memtable were running at 4.5GB/s) to 14.5GB/s (when commitlog and memtable flush processes were done).
The only thing left to check was whether the filesystem formatted with the 4K block size would cause read amplification on older disks with 512 bytes physical sector size. It turns out it didn’t. We were able to achieve similar IOPS on a RAID with 4 NVMes of the older type.
Next up: Part 3, Common performance pitfalls of modern storage I/O.