There have been a couple of recent inquiries from the field around OneFS read cache persistence, and felt it might be of interest to a wider audience.
Firstly, a quick OneFS caching architecture review…
The Level 1 cache (L1), or front-end cache, is memory that is nearest to the protocol layers (e.g. NFS, SMB, etc) used by clients, or initiators, connected to that node. The primary purpose of L1 cache is to prefetch data from remote nodes.
Level 2 cache (L2), or back-end cache, refers to local memory on the node on which a particular block of data is stored. L2 cache is globally accessible from any node in the cluster and is used to reduce the latency of a read operation by not requiring a seek directly from the disk drives
Also known as SmartFlash, the level 3 cache (L3) refers to a subsystem which caches evicted L2 blocks on one or more SSDs on the node owning the L2 blocks. Unlike L1 and L2, not all nodes or clusters have an L3 cache, since it requires solid state drives (SSDs) to be present and exclusively reserved and configured for caching use.
L3 serves as a cost-effective method of extending a node’s cache from gigabytes to terabytes. This allows clients to retain a larger working set of data in cache, before being forced to retrieve data from higher latency spinning disk (HDD). The L3 cache is populated with “interesting” L2 blocks that are being dropped from memory. Since L3 is based on persistent flash storage, once the cache is populated, or warmed, it’s highly durable and persists across node reboots, etc.
Having an efficient eviction and replacement policy is absolutely vital for cache performance. This is evident in OneFS, where each level of the cache hierarchy employs a different strategy for eviction, tailored to the attributes of that cache type.
For L1 cache in storage nodes, cache aging is based on a drop-behind algorithm. The L2 cache uses a Least Recently Used algorithm (LRU), since it is relatively simple to implement, low-overhead, and performs well in general. By contrast, the L3 cache employs a first-in, first-out eviction policy (or FIFO), since it’s writing to what is effectively a specialized linear filesystem on SSD.
Note: One drawback of LRU is that it is not scan resistant. For example, a OneFS Job Engine job or backup process that scans a large amount of data can cause the L2 cache to be flushed. This can be mitigated to a large degree by the L3 cache.
Effective caching is all about keeping hot data hot! So the most frequently accessed data and metadata on a node should just remain in L2 cache and not get evicted to L3.
For the next tier of cached data that’s accessed frequently enough to live in L3, but not frequently enough to always live in RAM, there’s a mechanism in place to keep these semi-frequently accessed blocks in L3.
To maintain this L3 cache persistence, when the kernel goes to read a metadata or data block, the following process occurs:
1) First, L1 cache is checked. Then, if no hit, L2 cache is consulted.
2) If a hit is found in memory, it’s done.
3) If not in memory, L3 is then checked.
4) If there’s an L3 hit, and that item is near the end of the L3 FIFO (last 10%), a flag is set on the block which causes it to be evicted into L3 again when it is evicted out of L2.
This marking process helps guard against the chronological eviction of blocks that are accessed while they are in the last 10% of the cache, and serves to keep most of the useful data in cache.
Read process on local node
Read process on remote node