Welcome to the Linux Foundation Forum!

Bcache writeback - cache all used.

Hey guys,

I'm using bcache to support Ceph. Ten Cluster nodes have a bcache device each consisting of an HDD block device and an NVMe cache. But I am noticing what I consider to be a problem: My cache is 100% used even though I still have 80% of the space available on my HDD.

It is true that there is more data written than would fit in the cache. However, I imagine that most of them should only be on the HDD and not in the cache, as they are cold data, almost never used.

I noticed that there was a significant drop in performance on the disks (writes) and went to check. Benchmark tests confirmed this. Then I noticed that there was 100% cache full and 85% cache evictable. There was a bit of dirty cache. I found an internet message talking about the garbage collector, so I tried the following:

echo 1 > /sys/block/bcache0/bcache/cache/internal/trigger_gc

That doesn't seem to have helped.

Then I collected the following data:

--- bcache ---
Device /dev/sdc (8:32)
UUID 38e81dff-a7c9-449f-9ddd-182128a19b69
Block Size 4.00KiB
Bucket Size 256.00KiB
Congested? False
Read Congestion 0.0ms
Write Congestion 0.0ms
Total Cache Size 553.31GiB
Total Cache Used 547.78GiB (99%)
Total Unused Cache 5.53GiB (1%)
Dirty Data 0B (0%)
Evictable Cache 503.52GiB (91%)
Replacement Policy [lru] fifo random
Cache Mode writethrough [writeback] writearound none
Total Hits 33361829 (99%)
Total Missions 185029
Total Bypass Hits 6203 (100%)
Total Bypass Misses 0
Total Bypassed 59.20MiB
--- Cache Device ---
Device /dev/nvme0n1p1 (259:1)
Size 553.31GiB
Block Size 4.00KiB
Bucket Size 256.00KiB
Replacement Policy [lru] fifo random
Discard? False
I/O Errors 0
Metadata Written 395.00GiB
Data Written 1.50 TiB
Buckets 2266376
Cache Used 547.78GiB (99%)
Cache Unused 5.53GiB (0%)
--- Backing Device ---
Device /dev/sdc (8:32)
Size 5.46TiB
Cache Mode writethrough [writeback] writearound none
Readhead
Sequential Cutoff 0B
Sequential merge? False
state clean
Writeback? true
Dirty Data 0B
Total Hits 32903077 (99%)
Total Missions 185029
Total Bypass Hits 6203 (100%)
Total Bypass Misses 0
Total Bypassed 59.20MiB

The dirty data has disappeared. But the cache remains 99% utilization, down just 1%. Already the evictable cache increased to 91%!

The impression I have is that this harms the write cache. That is, if I need to write again, the data goes straight to the HDD disks, as there is no space available in the Cache.

Shouldn't bcache remove the least used part of the cache?

Does anyone know why this isn't happening?

I may be talking nonsense, but isn't there a way to tell bcache to keep a write-free space rate in the cache automatically? Or even if it was manually by some command that I would trigger at low disk access times?

Thanks!

, ou

Comments

  • Whenever the cache decides to keep some new data and there is no free space left in the cache, it makes space according to the current replacement policy. Your output shows that you use the "lru" policy. This means that the least recently used item is evicted from the cache when space is needed.

    Having lots of cold data in the cache is not a problem for speed as long as the data you consider hot is actually hotter than the cold data, i.e. used more often. The LRU policy then ensures that cold data is evicted from the cache first when new data arrives. If the new data is cold data, the idea is that the LRU policy should pick it for eviction fairly soon and keep the hot data as the hot data is frequently requested. This fails, however, if the amount of cold data written is very large and the hot data is not requested often enough, causing the hot data to be less recently used than the cold data and hence to be evicted from the cache. Also, there is the problem that writing a lot of cold data to the cache unnecessarily wears down your SSD so this may still be undesirable even if the hot data manages to stay in the cache, e.g. 20 GB of hot data, 500 GB cache but only 450 GB of cold data written. You could address this problem by switching to the "random" policy. As most data in your cache is cold data, new data is then likely to replace cold data.

    To reduce the amount of new data that is added to the cache, you can set the sequential_cutoff parameter of your bcache device. Your current cutoff of zero means that there is no cutoff and all data is added to the cache, explaining the high cache utilisation. Depending on your load, a low but non-zero cutoff can cause some or most of your hot data to not make it into the cache. Some hot data may come in a large continues read or write operation, e.g. if Ceph bundles metadata writes. Furthermore, bcache tries to detect tasks such as backup tasks that read lots of hot and cold data based on the average read request size and does not add this data to the cache. This task detection may be a problem if bcache treats all of the Ceph system as a single task and cannot see the underlying tasks that may be running on a different cluster node.

    To experiment, you'd want to detach, stop, reset and re-attach the caches and reset the Ceph nodes, in particular Cephfs metadata nodes, and compute nodes (that also cache some cephfs filesystem data), and you'd need a realistic test load. This can be quite a challenge. A more practical solution will be to leave the threshold at 0 and to ensure the SSDs are large enough to not lose substantial amounts of hot data when lots of cold data arrives or is being read.

    Also, if you haven't done so already, you should look into your Ceph configuration to separate storage of cephfs metadata and filedata so that you can use different bcache settings for metadata and filedata (or use SSDs exclusively for metadata).

  • CONFIG_SYSVIPC=y
    CONFIG_POSIX_MQUEUE=y

    CONFIG_CROSS_MEMORY_ATTACH is not set

    CONFIG_NO_HZ_IDLE=y
    CONFIG_HIGH_RES_TIMERS=y
    CONFIG_IKCONFIG=y
    CONFIG_IKCONFIG_PROC=y
    CONFIG_NAMESPACES=y

    CONFIG_UTS_NS is not set

    CONFIG_PID_NS is not set

    CONFIG_BLK_DEV_INITRD=y
    CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3=y
    CONFIG_EMBEDDED=y
    CONFIG_PERF_EVENTS=y

    CONFIG_VM_EVENT_COUNTERS is not set

    CONFIG_SLUB_DEBUG is not set

    CONFIG_COMPAT_BRK is not set

    CONFIG_ISA_ARCOMPACT=y
    CONFIG_MODULES=y
    CONFIG_MODULE_FORCE_LOAD=y
    CONFIG_MODULE_UNLOAD=y
    CONFIG_MODULE_FORCE_UNLOAD=y
    CONFIG_PARTITION_ADVANCED=y
    CONFIG_ARC_PLAT_AXS10X=y
    CONFIG_AXS101=y
    CONFIG_ARC_CACHE_LINE_SHIFT=5
    CONFIG_ARC_BUILTIN_DTB_NAME="axs101"
    CONFIG_PREEMPT=y

    CONFIG_COMPACTION is not set

    CONFIG_NET=y
    CONFIG_PACKET=y
    CONFIG_UNIX=y
    CONFIG_NET_KEY=y
    CONFIG_INET=y
    CONFIG_IP_PNP=y
    CONFIG_IP_PNP_DHCP=y
    CONFIG_IP_PNP_BOOTP=y
    CONFIG_IP_PNP_RARP=y

    CONFIG_INET_XFRM_MODE_TRANSPORT is not set

    CONFIG_INET_XFRM_MODE_TUNNEL is not set

    CONFIG_INET_XFRM_MODE_BEET is not set

    CONFIG_IPV6 is not set

    CONFIG_DEVTMPFS=y

    CONFIG_STANDALONE is not set

    CONFIG_PREVENT_FIRMWARE_BUILD is not set

    CONFIG_SCSI=y
    CONFIG_BLK_DEV_SD=y
    CONFIG_NETDEVICES=y

    CONFIG_NET_VENDOR_ARC is not set

    CONFIG_NET_VENDOR_BROADCOM is not set

    CONFIG_NET_VENDOR_INTEL is not set

    CONFIG_NET_VENDOR_MARVELL is not set

    CONFIG_NET_VENDOR_MICREL is not set

    CONFIG_NET_VENDOR_NATSEMI is not set

    CONFIG_NET_VENDOR_SEEQ is not set

    CONFIG_STMMAC_ETH=y

    CONFIG_NET_VENDOR_VIA is not set

    CONFIG_NET_VENDOR_WIZNET is not set

    CONFIG_NATIONAL_PHY=y

    ADD Al , 6

    MOVE AX[SI+5]

    MOVE AX , T[si]

    .DATA

    CONFIG_USB_NET_DRIVERS is not set

    CONFIG_INPUT_EVDEV=y
    CONFIG_MOUSE_PS2_TOUCHKIT=y
    CONFIG_MOUSE_SERIAL=y
    CONFIG_MOUSE_SYNAPTICS_USB=y

    CONFIG_LEGACY_PTYS is not set

    CONFIG_SERIAL_8250=y
    CONFIG_SERIAL_8250_CONSOLE=y
    CONFIG_SERIAL_8250_DW=y
    CONFIG_SERIAL_OF_PLATFORM=y

    CONFIG_HW_RANDOM is not set

    CONFIG_I2C=y
    CONFIG_I2C_CHARDEV=y
    CONFIG_I2C_DESIGNWARE_PLATFORM=y

    CONFIG_HWMON is not set

    CONFIG_DRM=m
    CONFIG_DRM_I2C_ADV7511=m
    CONFIG_DRM_ARCPGU=m
    CONFIG_FB=y
    CONFIG_FRAMEBUFFER_CONSOLE=y
    CONFIG_LOGO=y

    CONFIG_LOGO_LINUX_MONO is not set

    CONFIG_LOGO_LINUX_VGA16 is not set

    CONFIG_LOGO_LINUX_CLUT224 is not set

    CONFIG_USB_EHCI_HCD=y
    CONFIG_USB_EHCI_HCD_PLATFORM=y
    CONFIG_USB_OHCI_HCD=y
    CONFIG_USB_OHCI_HCD_PLATFORM=y
    CONFIG_USB_STORAGE=y
    CONFIG_MMC=y
    CONFIG_MMC_SDHCI=y
    CONFIG_MMC_SDHCI_PLTFM=y
    CONFIG_MMC_DW=y

    CONFIG_IOMMU_SUPPORT is not set

    CONFIG_EXT3_FS=y
    CONFIG_MSDOS_FS=y
    CONFIG_VFAT_FS=y
    CONFIG_NTFS_FS=y
    CONFIG_TMPFS=y
    CONFIG_NFS_FS=y
    CONFIG_NFS_V3_ACL=y
    CONFIG_NLS_CODEPAGE_437=y
    CONFIG_NLS_ISO8859_1=y

    CONFIG_ENABLE_MUST_CHECK is not set

    CONFIG_STRIP_ASM_SYMS=y
    CONFIG_SOFTLOCKUP_DETECTOR=y
    CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=10

    CONFIG_SCHED_DEBUG is not set

    CONFIG_DEBUG_PREEMPT is not set

    CONFIG_FTRACE is not set

    ADD _ALL

    CMP BL , 55

    move ADD_ALL

    MOV AX, DS: [BP]

    DATA

    MOV AX @DATA

    ADD AL , BX

    .data CAPORIGA DB 13,10,"$"

    MOVE ALL , 8

    INT 21H

    MOVE AH,01H

    DATA DB N,?,N DUP (?)

    MOVE CX,DS: [BP]

    .COMPACT

    HUGE 64K

    MOVE ax,@DATA

Categories

Upcoming Training