Bcache in writes direct with fsync, IOPS were limited?

mwadriano · May 2022

Hello guys.

I'm trying to set up a flash disk NVMe as a disk cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Linux 5.4.174 (Proxmox node).

Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.

root@pve-20:~# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio

  write: IOPS=32.9k, BW=129MiB/s (135MB/s)(1286MiB/10001msec); 0 zone resets
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.01%, 20=0.01%, 50=99.73%, 100=0.12%, 250=0.01%
  lat (usec)   : 500=0.02%, 750=0.11%, 1000=0.01%
  cpu          : usr=11.59%, sys=18.37%, ctx=329115, majf=0, minf=14
  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,329119,0,329118 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1286MiB (1348MB), run=10001-10001msec

But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.

root@pve-20:~# fio --filename=/dev/bcache0 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio

  write: IOPS=1548, BW=6193KiB/s (6342kB/s)(60.5MiB/10001msec); 0 zone resets
  lat (usec)   : 50=0.41%, 100=31.42%, 250=66.20%, 500=1.01%, 750=0.31%
  lat (usec)   : 1000=0.15%
  lat (msec)   : 2=0.20%, 4=0.08%, 10=0.08%, 20=0.15%
  cpu          : usr=3.72%, sys=11.67%, ctx=44541, majf=0, minf=12

Run status group 0 (all jobs):
  WRITE: bw=6193KiB/s (6342kB/s), 6193KiB/s-6193KiB/s (6342kB/s-6342kB/s), io=60.5MiB (63.4MB), run=10001-10001msec

Disk stats (read/write):
    bcache0: ios=0/30596, merge=0/0, ticks=0/8492, in_queue=8492, util=98.99%, aggrios=0/16276, aggrmerge=0/0, aggrticks=0/4528, aggrin_queue=578, aggrutil=98.17%
  sdb: ios=0/2, merge=0/0, ticks=0/1158, in_queue=1156, util=5.59%
  nvme0n1: ios=1/32550, merge=0/0, ticks=1/7898, in_queue=0, util=98.17%

As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.

This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.

I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.

The commands used to configure bcache were:

# echo writeback > /sys/block/bcache0/bcache/cache_mode
# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
##
## Then I tried everything also with the commands below, but there was no improvement.
##
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).

This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.

With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.

root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k

--- /dev/bcache0 (block device 931.5 GiB) ioping statistics ---
9 requests completed in 11.9 ms, 36 KiB written, 754 iops, 2.95 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 968.6 us / 1.33 ms / 1.60 ms / 249.1 us

-------------------------------------------------------------------

root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k

--- /dev/nvme0n1p2 (block device 300 GiB) ioping statistics ---
9 requests completed in 739.2 us, 36 KiB written, 12.2 k iops, 47.6 MiB/s
generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s
min/avg/max/mdev = 63.5 us / 82.1 us / 95.1 us / 10.0 us

The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.

But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.

Using writeback improves a lot, but still doesn't use the full speed of NVMe.

But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.

Processing doesn't seem to be going up like the tests.

Please would anyone know, what could be causing these limits?

Tanks

mwadriano · May 2022

I think it must be some fine tuning.

I think that this tool of kernel (bcache) is not used much, at least not in this way, because I'm having difficulties getting feedback on the Internet. I didn't even know where to get help.

One curious thing I noticed, is that writing is always taking place on the flash, never on the spinning disk. This is expected and should give the same fast response as the flash device. However, this is not what happens when going through bcache.

But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point!

And without fsync, ioping tests also speed up, albeit less. In this case, I can see that the latency drops to something around 600~700us.

Nothing compared to the 84us (4k ioping write) obtained when writing directly to the flash device (with or without fsync). But it's still much better than the 1.5ms you get in bcache when you add the fsync flag to wait for the write response on the same bcache device.

That is, what it looks like is that there is a wait placed by the bcache layer between the write being sent to it, it waiting for the disk response, and then sending the response to the application. This is increasing latency and consequently reducing performance. I think it must be some fine tuning (or no?).

In fact, the use of writes in small blocks with fsync and direct flags is not very common. It is commonly used in database servers and other data center storage tools that need to make sure that the data is physically written to the device immediately after each write operation. The problem is that these applications need to guarantee that the writes were actually performed and the disk caches are made of volatile memory, which does not guarantee the write, because a power failure can occur and then the data that was only in the cache is lost. That's why the request in each operation that the data be written directly, without going through the cache and that the response comes immediately.

This makes write operations very slow in nature.

And everything is even slower when each operation has the small size of only 4K, for example. That is, for each requested write operation of only 4K, an instruction is sent along with it requesting that the data is not stopped in the disk cache (suspecting that the cache is a volatile memory) and that the data is immediately written, with confirmation being of such recording coming from the device afterwards. This significantly increases latency.

And that's why in these environments it is recommended to use RAID cards with cache and batteries that ignore the direct and fsync instructions, but guarantee data saving, even in cases of power failure precisely because of the batteries.

But still, nowadays with enterprise flash devices, containing tantalum capacitors that act as true built-in UPS, RAID arrays, besides being expensive, are no longer considered so fast.

In this sense, flash devices with built-in supercapacitors also work by ignoring fsync flags and guaranteeing recording, even in cases of power failure.

Thus, writings on these devices become so fast that it doesn't even seem like a physical write confirmation request was sent for each operation. The operations are fast for the databases as well as any simple writes that would naturally occur to the cache of a consumer flash disk.

But enterprise data center flash disks are very expensive! So the idea would be to use spinning disks for write space, but use enterprise datacenter flash disks (NVMe) as cache with bcache. So, theoretically, bcache would divert writes (especially small ones) always directly to the NVMe drive and I would benefit from all the low latency, high throughput, and IOPs of the drive, on most writes and reads.

Unfortunately something is not working out as I imagined. Because something is limiting IOPS and increasing latency a lot, much more than produced by the enterprise flash disk.

I think it might be something I'm doing wrong in the configuration. Or some fine tuning I don't know how to do.

Someone help?

Tanks,

Bcache in writes direct with fsync, IOPS were limited?

Comments

Categories

Upcoming Training

Kubernetes Administration (LFS458)

Linux System Administration (LFS301)

Open Source Virtualization (LFS462)

Linux Kernel Debugging and Security (LFD440)