NOTE: This topic is part of the Uptime Information Hub.
The caching of data and metadata is a common practice used to deliver high performance and low latency in storage systems. Caching typically involves keeping the most frequently accessed data in memory, and providing faster access by intelligently predicting how content will be accessed and the parts of a dataset that will be required.
For versions of OneFS prior to OneFS 7.1.1, the EMC Isilon OneFS caching infrastructure exclusively leveraged the system memory (RAM) and nonvolatile memory (NVRAM) in each node. However, with OneFS 7.1.1, it's now possible to enable the caching subsystem—known as EMC Isilon OneFS SmartFlash or level 3 cache (L3 cache)—to use solid-state drives (SSDs) and make use of the increased capacity, affordability, and storage persistence that they offer.
OneFS Caching Overview
OneFS uses up to three levels of read cache, plus an NVRAM-backed write cache, or write coalescer, and endurant cache. These caching subsystems, and their high-level interaction, are illustrated in the following OneFS caching hierarchy diagram.
The first two types of read cache, level 1 (L1) and level 2 (L2), are memory RAM-based and analogous to the cache used in CPU processors. These two cache layers are present in all Isilon storage nodes.
The optional third tier of read cache, L3 cache, is also configurable on nodes that contain solid-state drives (SSDs). L3 cache contains file data and metadata blocks evacuated from L2 cache through L2's least recently used (LRU) cache eviction algorithm, effectively increasing L2 cache capacity. Like L2 cache, L3 cache is node-specific and caches only data associated with the specified node.
All metadata reads and nonsequential data reads less than or equal to 128 K are cached in L3. Because the cache is on SSD and not in RAM, unlike L2 cache, L3 cached data is durable and survives a node reboot without requiring repopulating the data.
L3 Cache Benefits
L3 cache delivers good performance over time right out of the box for a wide variety of workloads. Since L3 cache provides a persistent cache that survives reboots, this can help repopulate L2 without first having to go to disk.
Although the benefit of L3 caching is highly workflow-dependent, the following advantages are universal:
- L3 cache typically provides more benefits for random and random I/O workflows than for sequential workloads. Random I/O workflows benefit most from L3 cache because these workflows contain large numbers of repeated random read I/Os smaller than 128K.
- Repeated random read workloads—such as VMDK virtualization access, database file access, web hosting (for example, news sites: previous week's news in L3 cache, today's headline in L2 cache), and home directories—will typically benefit most from L3 cache through latency improvements.
- The file system-based SmartPools SSD metadata-write strategy may be the better choice for heavy namespace write workloads, for example, EDA and some HPC workloads.
- L3 cache can help workloads that contain streaming, sequential, and random I/O by serving the random I/O from L3. This will free up the spindles to serve the streaming, sequential I/O.
L3 Cache Sizing
Using SSDs for L3 cache makes sizing SSD capacity more straightforward and less error-prone. This approach requires considerably less management overhead compared with configuring certain portions of a dataset to use SSDs as a storage tier within the file system.
Figuring out the size of the active data, or working set, for your environment is the first step in an L3 cache/SSD-sizing exercise. An L3 cache sizing exercise involves calculating the correct amount of SSD space to fit the working data set. This can be done by using the isi_cache_stats command to periodically capture L2 cache statistics on an existing cluster.
Run the following commands based on the workload activity cycle, at the start of your cluster's workload jobs and again at job end. Initially run isi_cache_stats –c in order to reset, or zero out, the counters. Then run isi_cache_stats –v at workload activity completion and save the output. This will help you determine the size of the working dataset, by looking at the L2 cache miss rates for both data and metadata on a single node.
L2 cache miss counters are displayed as 8 KB blocks. So an L2_data_read.miss value of 1024 blocks represents 8 MB of actual missed data.
The formula for calculating the working set size is:
(L2_data_read.miss + L2_meta_read.miss) = working_set size
When the working set size has been calculated, a good rule of thumb is to size L3 SSD capacity per node according to the following formula:
L2 capacity + L3 capacity >= 150% of working set size
L3 Cache Tips and Best Practices
- All new clusters with SSDs running OneFS 7.1.1 and later enable L3 cache by default. Note that if you upgrade nodes to OneFS 7.1.1 from an earlier version, you'll need to enable L3 cache manually.
- When L3 cache becomes full and new metadata or user data blocks are loaded into L3 cache, the oldest existing data blocks, followed by metadata blocks, are flushed from L3 cache.
- Analyze workflows that currently use SmartPools to place user data onto SSDs before using the SSDs for L3 cache. Make sure to understand the read/write ratio of access to the data placed on SSD by SmartPools. L3 is a read cache, so it can provide similar read performance for that data, but will not cache writes.
- As a rule, L3 cache performs better with more available SSD space. For best performance, add a larger capacity SSD rather than multiple smaller SSDs. It is better to use a small number (ideally, no more than two) of large-capacity SSDs, rather than multiple small SSDs.
- L3 cache uses all available SSD space over time. This means that the benefits of L3 caching will take some time to be fully realized. How long depends on your workflow.
- There are diminishing returns for L3 cache after a certain point. With too high an SSD-to-working-set-size ratio, the cache hits decrease and fails to add greater benefit. Conversely, when compared to SmartPools SSD strategies, another benefit of using SSDs for L3 cache is that performance will degrade much more gradually if metadata does happen to exceed the SSD capacity available.
- Because there are diminishing returns after a certain point for L3 cache, adding more SSD space after you have enough to hold the working set will no longer increase the cache hit rate. Therefore, L3 cache is not applicable for nodes containing 16 or more SSDs. Additionally, node pools comprising all SSD X200 (or other 12-drive nodes) are not eligible for L3 cache because L3 relies upon HDD storage to cache from.
- If your workload requests are greater than 128 K chunks of a file, that data will not be cached in L3 cache.
- If your workload requests are less than or equal to 128 K at a time, but the requests eventually read the entire file in those small chunks sequentially, L3 cache will detect this and stop caching the reads.
- Workloads that include a high level of latency-sensitive metadata read activity can benefit from configuring the SSDs for use as Metadata Read or L3 cache. Although Metadata Read guarantees that there will be a copy of metadata on SSDs, L3 is a cache, so depending on the workflow, metadata may move on and off SSDs over time. Yet even with very small data hit rates, L3 cache can reduce HDD accesses significantly. This is especially true for mixed workloads.
- You also need to consider the metadata working set size. If the available SSD capacity is larger than required to hold a complete copy of the metadata, then L3 cache will enable the use of that additional capacity to accelerate data reads. If the available SSD capacity is smaller than is required to hold a complete copy of the metadata, L3 will almost always outperform Metadata Read. This is why HD400 and certain NL configurations are forced to use SSDs for L3.
- You can leverage the cache statistics reports in InsightIQ, or the following isi_cache_stats command on the cluster to help monitor L3 performance.
isi_for_array –s isi_cache_stats –v | grep –A3 l3
To learn more about L3 cache in OneFS, see EMC Isilon OneFS SmartFlash.