The freshly minted OneFS 8.2.1 release introduces in-line deduplication to Isilon’s portfolio as part of the in-line data reduction suite, and is available on a cluster with the following characteristics:
- F810 cluster or node pool
- 40 Gb/s Ethernet backend
- Running OneFS 8.2.1
When in-line data reduction is enabled on a cluster, data from network clients is accepted as is and makes its way through the OneFS write path until it reaches the BSW engine, where it is broken up into individual chunks. The in-line data reduction write path comprises three main phases:
- Zero Block Removal
- In-line Deduplication
- In-line Compression
If both in-line compression and deduplication are enabled on a cluster, zero block removal is performed first, followed by dedupe, and then compression. This order allows each phase to reduce the scope of work each subsequent phase.
The in-line data reduction zero block removal phase detects blocks that contain only zeros and prevents them from being written to disk. This both reduces disk space requirements and avoids unnecessary writes to SSD, resulting in increased drive longevity.
Zero block removal occurs first in the OneFS in-line data reduction process. As such, it has the potential to reduce the amount of work that both in-line deduplication and compression need to perform. The check for zero data does incur some overhead. However, for blocks that contain non-zero data the check is terminated on the first non-zero data found, which helps to minimize the impact.
The following characteristics are required for zero block removal to occur:
- A full 8KB block of zeroes
- A partial block of zeroes being written to a sparse or preallocated block
The write will convert the block to sparse if not already. A partial block of zeroes being written to a non-sparse, non-preallocated block will not be zero eliminated.
While Isilon has offered a native file system deduplication solution for several years, until OneFS 8.2.1 this was always accomplished by scanning the data after it has been written to disk, or post-process. With in-line data reduction, deduplication is now performed in real time as data is written to the cluster. Storage efficiency is achieved by scanning the data for identical blocks as it is received and then eliminating the duplicates.
When a duplicate block is discovered, in-line deduplication moves a single copy of the block to a special set of files known as shadow stores. OneFS shadow stores are file system containers that allow data to be stored in a sharable manner. As such, files on OneFS can contain both physical data and pointers, or references, to shared blocks in shadow stores.
Shadow stores were first introduced in OneFS 7.0, initially supporting Isilon OneFS file clones, and there are many overlaps between cloning and deduplicating files. The other main consumer of shadow stores is OneFS Small File Storage Efficiency. This feature maximizes the space utilization of a cluster by decreasing the amount of physical storage required to house the small files that comprise a typical healthcare dataset.
Shadow stores are similar to regular files but are hidden from the file system namespace, so cannot be accessed via a pathname. A shadow store typically grows to a maximum size of 2GB, which is around 256K blocks, with each block able to be referenced by 32,000 files. If the reference count limit is reached, a new block is allocated, which may or may not be in the same shadow store. Additionally, shadow stores do not reference other shadow stores. And snapshots of shadow stores are not permitted because the data contained in shadow stores cannot be overwritten.
When a client writes a file to an F810 node pool on a cluster, the write operation is divided up into whole 8KB blocks. Each of these blocks is then hashed and it’s cryptographic ‘fingerprint’ compared against an in-memory index for a match. At this point, one of the following operations will occur:
1) If a match is discovered with an existing shadow store block, a byte-by-byte comparison is performed. If the comparison is successful, the data is removed from the current write operation and replaced with a shadow reference.
2) When a match is found with another LIN, the data is written to a shadow store instead and replaced with a shadow reference. Next, a work request is generated and queued that includes the location for the new shadow store block, the matching LIN and block, and the data hash. A byte-by-byte data comparison is performed to verify the match and the request is then processed.
3) If no match is found, the data is written to the file natively and the hash for the block is added to the in-memory index.
In order for in-line deduplication to be performed on a write operation, the following conditions need to be true:
- In-line dedupe must be globally enabled on the cluster.
- The current operation is writing data (ie. not a truncate or write zero operation).
- The ‘no_dedupe’ flag is not set on the file.
- The file is not a special file type, such as an alternate data stream (ADS) or an EC (endurant cache) file.
- Write data includes fully overwritten and aligned blocks.
- The write is not part of a rehydrate operation.
- The file has not been packed (containerized) by SFSE (small file storage efficiency).
OneFS in-line deduplication uses the 128-bit CityHash algorithm, which is both fast and cryptographically strong. This is in contrast to post-process SmartDedupe, which uses SHA-1 hashing.
Each F810 node in a cluster with in-line dedupe enabled has its own in-memory hash index that it compares block ‘fingerprints’ against. The index lives in system RAM and is allocated using physically contiguous pages and accessed directly with physical addresses. This avoids the need to traverse virtual memory mappings and does not incur the cost of translation lookaside buffer (TLB) misses, minimizing deduplication performance impact.
The maximum size of the hash index is governed by a pair of sysctl settings, one of which caps the size at 16GB, and the other which limits the maximum size to 10% of total RAM. The strictest of these two constraints applies. While these settings are configurable, the recommended best practice is to use the default configuration. Any changes to these settings should only be performed under the supervision of Isilon support.
Since in-line dedupe and SmartDedupe use different hashing algorithms, the indexes for each are not shared directly. However, the work performed by each dedupe solution can be leveraged by each other. For instance, if SmartDedupe writes data to a shadow store, when those blocks are read, the read hashing component of inline dedupe will see those blocks and index them.
When a match is found, in-line dedupe performs a byte-by-byte comparison of each block to be shared to avoid the potential for a hash collision. Data is prefetched prior to the byte-by-byte check and then compared against the L1 cache buffer directly, avoiding unnecessary data copies and adding minimal overhead. Once the matching blocks have been compared and verified as identical, they are then shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.
In-line dedupe samples every whole block written and handles each block independently, so it can aggressively locate block duplicity. If a contiguous run of matching blocks is detected, in-line dedupe will merge the results into regions and process them efficiently.
In-line dedupe also detects dedupe opportunities from the read path, and blocks are hashed as they are read into L1 cache and inserted into the index. If an existing entry exists for that hash, in-line dedupe knows there is a block sharing opportunity between the block it just read and the one previously indexed. It combines that information and queues a request to an asynchronous dedupe worker thread. As such, it is possible to deduplicate a data set purely by reading it all. To help mitigate the performance impact, the hashing is performed out-of-band in the prefetch path, rather than in the latency-sensitive read path.
Since in-line deduplication configuration is binary, either on or off across a cluster, it can be easily controlled via the OneFS command line interface (CLI). For example, the following syntax will enable in-line deduplication and verify the configuration:
# isi dedupe inline settings view
# isi dedupe inline settings modify –-mode enabled
# isi dedupe inline settings view
Note that in-line deduplication is disabled by default on new F810 cluster running OneFS 8.2.1.
While there are no visible userspace changes when files are deduplicated, if deduplication has occurred, both the ‘disk usage’ and the ‘physical blocks’ metric reported by the ‘isi get –DD’ CLI command will be reduced. Additionally, at the bottom of the command’s output, the logical block statistics will report the number of shadow blocks. For example:
Metatree logical blocks:
zero=260814 shadow=362 ditto=0 prealloc=0 block=2 compressed=0
OneFS in-line data deduplication can be disabled from the CLI with the following syntax:
# isi dedupe inline settings modify –-mode disabled
# isi dedupe inline settings view
OneFS in-line data deduplication can be paused from the CLI with the following syntax:
# isi dedupe inline settings modify –-mode paused
OneFS in-line data deduplication can be run in assess mode from the CLI with the following syntax:
# isi dedupe inline settings modify –-mode assess
Problems with in-line dedupe may generate the following OneFS events and alerts. These include:
Inline dedupe index allocation failed
Inline dedupe index allocation in progress
Inline dedupe not supported
Inline dedupe index is smaller than requested
Inline dedupe index has non standard layout
In the event that in-line deduplication encounters an unrecoverable error, it will restart the write operation with in-line dedupe disabled. If any of the above alert conditions occur, please contact Isilon Technical Support for further evaluation.