Bcache in writes direct with fsync, IOPS were limited?
Hello guys.
I'm trying to set up a flash disk NVMe as a disk cache for two or three isolated (I will use 2TB disks, but in these tests I used a 1TB one) spinning disks that I have on a Linux 5.4.174 (Proxmox node).
Testing the fio in NVME, it performs well enough at 4K random writes, even using direct and fsync flags.
root@pve-20:~# fio --filename=/dev/nvme0n1p2 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio write: IOPS=32.9k, BW=129MiB/s (135MB/s)(1286MiB/10001msec); 0 zone resets lat (nsec) : 1000=0.01% lat (usec) : 2=0.01%, 20=0.01%, 50=99.73%, 100=0.12%, 250=0.01% lat (usec) : 500=0.02%, 750=0.11%, 1000=0.01% cpu : usr=11.59%, sys=18.37%, ctx=329115, majf=0, minf=14 IO depths : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,329119,0,329118 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=129MiB/s (135MB/s), 129MiB/s-129MiB/s (135MB/s-135MB/s), io=1286MiB (1348MB), run=10001-10001msec
But when I do the same test on bcache writeback, the performance drops a lot. Of course, it's better than the performance of spinning disks, but much worse than when accessed directly from the NVMe device hardware.
root@pve-20:~# fio --filename=/dev/bcache0 --direct=1 --fsync=1 --rw=randwrite --bs=4K --numjobs=1 --iodepth=1 --runtime=10 --time_based --group_reporting --name=journal-test --ioengine=libaio write: IOPS=1548, BW=6193KiB/s (6342kB/s)(60.5MiB/10001msec); 0 zone resets lat (usec) : 50=0.41%, 100=31.42%, 250=66.20%, 500=1.01%, 750=0.31% lat (usec) : 1000=0.15% lat (msec) : 2=0.20%, 4=0.08%, 10=0.08%, 20=0.15% cpu : usr=3.72%, sys=11.67%, ctx=44541, majf=0, minf=12 Run status group 0 (all jobs): WRITE: bw=6193KiB/s (6342kB/s), 6193KiB/s-6193KiB/s (6342kB/s-6342kB/s), io=60.5MiB (63.4MB), run=10001-10001msec Disk stats (read/write): bcache0: ios=0/30596, merge=0/0, ticks=0/8492, in_queue=8492, util=98.99%, aggrios=0/16276, aggrmerge=0/0, aggrticks=0/4528, aggrin_queue=578, aggrutil=98.17% sdb: ios=0/2, merge=0/0, ticks=0/1158, in_queue=1156, util=5.59% nvme0n1: ios=1/32550, merge=0/0, ticks=1/7898, in_queue=0, util=98.17%
As we can see, the same test done on the bcache0 device only got 1548 IOPS and that yielded only 6.3 KB/s.
This is much more than any spinning HDD could give me, but many times less than the result obtained by NVMe.
I've noticed in several tests, varying the amount of jobs or increasing the size of the blocks, that the larger the size of the blocks, the more I approximate the performance of the physical device to the bcache device. But it always seems that the amount of IOPS is limited to somewhere around 1500-1800 IOPS (maximum). By increasing the amount of jobs, I get better results and more IOPS, but if you divide the total IOPS by the amount of jobs, you can see that the IOPS are always limited in the range 1500-1800 per job.
The commands used to configure bcache were:
# echo writeback > /sys/block/bcache0/bcache/cache_mode # echo 0 > /sys/block/bcache0/bcache/sequential_cutoff ## ## Then I tried everything also with the commands below, but there was no improvement. ## # echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us # echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us
Monitoring with dstat, it is possible to notice that when activating the fio command, the writing is all done in the cache device (a second partition of NVMe), until the end of the test. The spinning disk is only written after the time has passed and it is possible to see the read on the NVMe and the write on the spinning disk (which means the transfer of data in the background).
This means that the writeback cache mechanism appears to be working as it should, except for the performance limitation.
With ioping it is also possible to notice a limitation, as the latency of the bcache0 device is around 1.5ms, while in the case of the raw device (a partition of NVMe), the same test is only 82.1us.
root@pve-20:~# ioping -c10 /dev/bcache0 -D -Y -WWW -s4k --- /dev/bcache0 (block device 931.5 GiB) ioping statistics --- 9 requests completed in 11.9 ms, 36 KiB written, 754 iops, 2.95 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 968.6 us / 1.33 ms / 1.60 ms / 249.1 us ------------------------------------------------------------------- root@pve-20:~# ioping -c10 /dev/nvme0n1p2 -D -Y -WWW -s4k --- /dev/nvme0n1p2 (block device 300 GiB) ioping statistics --- 9 requests completed in 739.2 us, 36 KiB written, 12.2 k iops, 47.6 MiB/s generated 10 requests in 9.00 s, 40 KiB, 1 iops, 4.44 KiB/s min/avg/max/mdev = 63.5 us / 82.1 us / 95.1 us / 10.0 us
The cache was configured directly on one of the NVMe partitions (in this case, the first partition). I did several tests using fio and ioping, testing on a partition on the NVMe device, without partition and directly on the raw block, on a first partition, on the second, with or without configuring bcache. I did all this to remove any doubt as to the method. The results of tests performed directly on the hardware device, without going through bcache are always fast and similar.
But tests in bcache are always slower. If you use writethrough, of course, it gets much worse, because the performance is equal to the raw spinning disk.
Using writeback improves a lot, but still doesn't use the full speed of NVMe.
But I've also noticed that there is a limit on writing sequential data, which is a little more than half of the maximum write rate shown in direct tests by the NVMe device.
Processing doesn't seem to be going up like the tests.
Please would anyone know, what could be causing these limits?
Tanks
Comments
-
I think it must be some fine tuning.
I think that this tool of kernel (bcache) is not used much, at least not in this way, because I'm having difficulties getting feedback on the Internet. I didn't even know where to get help.
One curious thing I noticed, is that writing is always taking place on the flash, never on the spinning disk. This is expected and should give the same fast response as the flash device. However, this is not what happens when going through bcache.
But when I remove the fsync flag in the test with fio, which tells the application to wait for the write response, the 4K write happens much faster, reaching 73.6 MB/s and 17k IOPS. This is half the device's performance, but it's more than enough for my case. The fsync flag makes no significant difference to the performance of my flash disk when testing directly on it. The fact that bcache speeds up when the fsync flag is removed makes me believe that bcache is not slow to write, but for some reason, bcache is taking a while to respond that the write is complete. I think that should be the point!
And without fsync, ioping tests also speed up, albeit less. In this case, I can see that the latency drops to something around 600~700us.
Nothing compared to the 84us (4k ioping write) obtained when writing directly to the flash device (with or without fsync). But it's still much better than the 1.5ms you get in bcache when you add the fsync flag to wait for the write response on the same bcache device.
That is, what it looks like is that there is a wait placed by the bcache layer between the write being sent to it, it waiting for the disk response, and then sending the response to the application. This is increasing latency and consequently reducing performance. I think it must be some fine tuning (or no?).
In fact, the use of writes in small blocks with fsync and direct flags is not very common. It is commonly used in database servers and other data center storage tools that need to make sure that the data is physically written to the device immediately after each write operation. The problem is that these applications need to guarantee that the writes were actually performed and the disk caches are made of volatile memory, which does not guarantee the write, because a power failure can occur and then the data that was only in the cache is lost. That's why the request in each operation that the data be written directly, without going through the cache and that the response comes immediately.
This makes write operations very slow in nature.
And everything is even slower when each operation has the small size of only 4K, for example. That is, for each requested write operation of only 4K, an instruction is sent along with it requesting that the data is not stopped in the disk cache (suspecting that the cache is a volatile memory) and that the data is immediately written, with confirmation being of such recording coming from the device afterwards. This significantly increases latency.
And that's why in these environments it is recommended to use RAID cards with cache and batteries that ignore the direct and fsync instructions, but guarantee data saving, even in cases of power failure precisely because of the batteries.
But still, nowadays with enterprise flash devices, containing tantalum capacitors that act as true built-in UPS, RAID arrays, besides being expensive, are no longer considered so fast.
In this sense, flash devices with built-in supercapacitors also work by ignoring fsync flags and guaranteeing recording, even in cases of power failure.
Thus, writings on these devices become so fast that it doesn't even seem like a physical write confirmation request was sent for each operation. The operations are fast for the databases as well as any simple writes that would naturally occur to the cache of a consumer flash disk.
But enterprise data center flash disks are very expensive! So the idea would be to use spinning disks for write space, but use enterprise datacenter flash disks (NVMe) as cache with bcache. So, theoretically, bcache would divert writes (especially small ones) always directly to the NVMe drive and I would benefit from all the low latency, high throughput, and IOPs of the drive, on most writes and reads.
Unfortunately something is not working out as I imagined. Because something is limiting IOPS and increasing latency a lot, much more than produced by the enterprise flash disk.
I think it might be something I'm doing wrong in the configuration. Or some fine tuning I don't know how to do.
Someone help?
Tanks,
0
Categories
- All Categories
- 215 LFX Mentorship
- 215 LFX Mentorship: Linux Kernel
- 767 Linux Foundation IT Professional Programs
- 347 Cloud Engineer IT Professional Program
- 173 Advanced Cloud Engineer IT Professional Program
- 74 DevOps Engineer IT Professional Program
- 142 Cloud Native Developer IT Professional Program
- 136 Express Training Courses
- 136 Express Courses - Discussion Forum
- 6.1K Training Courses
- 44 LFC110 Class Forum - Discontinued
- 70 LFC131 Class Forum
- 42 LFD102 Class Forum
- 225 LFD103 Class Forum
- 18 LFD110 Class Forum
- 36 LFD121 Class Forum
- 18 LFD133 Class Forum
- 7 LFD134 Class Forum
- 18 LFD137 Class Forum
- 71 LFD201 Class Forum
- 4 LFD210 Class Forum
- 5 LFD210-CN Class Forum
- 2 LFD213 Class Forum - Discontinued
- 128 LFD232 Class Forum - Discontinued
- 2 LFD233 Class Forum
- 4 LFD237 Class Forum
- 24 LFD254 Class Forum
- 692 LFD259 Class Forum
- 111 LFD272 Class Forum
- 4 LFD272-JP クラス フォーラム
- 12 LFD273 Class Forum
- 134 LFS101 Class Forum
- 1 LFS111 Class Forum
- 3 LFS112 Class Forum
- 2 LFS116 Class Forum
- 4 LFS118 Class Forum
- 4 LFS142 Class Forum
- 5 LFS144 Class Forum
- 4 LFS145 Class Forum
- 2 LFS146 Class Forum
- 3 LFS147 Class Forum
- LFS148 Class Forum
- 15 LFS151 Class Forum
- 2 LFS157 Class Forum
- 23 LFS158 Class Forum
- 6 LFS162 Class Forum
- 2 LFS166 Class Forum
- 4 LFS167 Class Forum
- 3 LFS170 Class Forum
- 2 LFS171 Class Forum
- 3 LFS178 Class Forum
- 3 LFS180 Class Forum
- 2 LFS182 Class Forum
- 5 LFS183 Class Forum
- 31 LFS200 Class Forum
- 737 LFS201 Class Forum - Discontinued
- 3 LFS201-JP クラス フォーラム
- 18 LFS203 Class Forum
- 125 LFS207 Class Forum
- 2 LFS207-DE-Klassenforum
- 1 LFS207-JP クラス フォーラム
- 302 LFS211 Class Forum
- 56 LFS216 Class Forum
- 52 LFS241 Class Forum
- 48 LFS242 Class Forum
- 38 LFS243 Class Forum
- 15 LFS244 Class Forum
- 2 LFS245 Class Forum
- LFS246 Class Forum
- 48 LFS250 Class Forum
- 2 LFS250-JP クラス フォーラム
- 1 LFS251 Class Forum
- 150 LFS253 Class Forum
- 1 LFS254 Class Forum
- 1 LFS255 Class Forum
- 7 LFS256 Class Forum
- 1 LFS257 Class Forum
- 1.2K LFS258 Class Forum
- 10 LFS258-JP クラス フォーラム
- 118 LFS260 Class Forum
- 159 LFS261 Class Forum
- 42 LFS262 Class Forum
- 82 LFS263 Class Forum - Discontinued
- 15 LFS264 Class Forum - Discontinued
- 11 LFS266 Class Forum - Discontinued
- 24 LFS267 Class Forum
- 21 LFS268 Class Forum
- 30 LFS269 Class Forum
- 201 LFS272 Class Forum
- 2 LFS272-JP クラス フォーラム
- 1 LFS274 Class Forum
- 4 LFS281 Class Forum
- 9 LFW111 Class Forum
- 259 LFW211 Class Forum
- 181 LFW212 Class Forum
- 13 SKF100 Class Forum
- 1 SKF200 Class Forum
- 1 SKF201 Class Forum
- 795 Hardware
- 199 Drivers
- 68 I/O Devices
- 37 Monitors
- 102 Multimedia
- 174 Networking
- 91 Printers & Scanners
- 85 Storage
- 757 Linux Distributions
- 82 Debian
- 67 Fedora
- 16 Linux Mint
- 13 Mageia
- 23 openSUSE
- 148 Red Hat Enterprise
- 31 Slackware
- 13 SUSE Enterprise
- 353 Ubuntu
- 468 Linux System Administration
- 39 Cloud Computing
- 71 Command Line/Scripting
- Github systems admin projects
- 93 Linux Security
- 78 Network Management
- 102 System Management
- 47 Web Management
- 61 Mobile Computing
- 18 Android
- 31 Development
- 1.2K New to Linux
- 1K Getting Started with Linux
- 370 Off Topic
- 114 Introductions
- 173 Small Talk
- 22 Study Material
- 767 Programming and Development
- 302 Kernel Development
- 447 Software Development
- 1.8K Software
- 236 Applications
- 183 Command Line
- 3 Compiling/Installing
- 987 Games
- 317 Installation
- 93 All In Program
- 93 All In Forum
Upcoming Training
-
August 20, 2018
Kubernetes Administration (LFS458)
-
August 20, 2018
Linux System Administration (LFS301)
-
August 27, 2018
Open Source Virtualization (LFS462)
-
August 27, 2018
Linux Kernel Debugging and Security (LFD440)