The new OneFS 8.2.2 release extends the in-line data reduction suite to the H5600 deep-hybrid platform, in addition to the all-flash F810.


inline-dedupe_0-1.png

 

With the H5600 the in-line compression and deduplication are both performed in software within each node. This is in contrast to the F810 platform, where each node includes a FPGA hardware adapter, which off-loads compression from the CPU. The software compression code path is also used as fallback in the event of an F810 hardware failure, and in a mixed cluster, for use in nodes without a hardware offload capability. Both hardware and software compression implementations are DEFLATE compatible.

 

The H5600 deep-hybrid chassis is available in the following storage configurations:

 

Hard Drive Capacity

SSD Capacity

Encryption (SED)

Chassis Capacity (Raw)

10 TB

3.2TB SSD

No

800 TB

12 TB

2 x 3.2TB SSDs

No

960 TB

10 TB SED

3.2TB SSD

Yes

800 TB

 

Similarly, the F810 all-flash chassis is available with the following storage options:

 

Drive Capacity

Storage Medium

Encryption (SED)

Chassis Capacity (Raw)

3.8 TB

Solid state drive (SSD)

No

228 TB

7.7 TB

Solid state drive (SSD)

No

462 TB

15.4 TB

Solid state drive (SSD)

No

924 TB

15.4 TB SED

Solid state drive (SSD)

Yes

924 TB

 

When in-line data reduction is enabled on a cluster, data from network clients is accepted as is and makes its way through the OneFS write path until it reaches the BSW engine, where it is broken up into individual chunks. The in-line data reduction write path comprises three main phases:

 

inline-dedupe_1.png

 

If both in-line compression and deduplication are enabled on a cluster, zero block removal is performed first, followed by dedupe, and then compression. This order allows each phase to reduce the scope of work each subsequent phase.

 

The in-line data reduction zero block removal phase detects blocks that contain only zeros and prevents them from being written to disk. This reduces disk space requirements, reduces the amount of work that both in-line deduplication and compression need to perform, and avoids unnecessary writes to SSD, resulting in increased drive longevity.

 

Next in the pipeline is in-line dedupe. While Isilon has offered  native file system deduplication solution for several years, this was always accomplished by scanning the data after it has been written to disk, or post-process. With in-line data reduction, deduplication is now performed in real time as data is written to the cluster. Storage efficiency is achieved by scanning the data for identical blocks as it is received and then eliminating the duplicates.

 

When a duplicate block is discovered, in-line dedupe moves a single copy of the block to a special set of files known as shadow stores. These are file system containers which can contain both physical data and pointers, or references, to shared blocks. Shadow stores are similar to regular files but are hidden from the file system namespace, so cannot be accessed via a pathname. A shadow store typically grows to a maximum size of 2GB, which is around 256K blocks, with each block able to be referenced by 32,000 files. If the reference count limit is reached, a new block is allocated, which may or may not be in the same shadow store. Additionally, shadow stores do not reference other shadow stores. Plus, snapshots of shadow stores are not permitted because the data contained in shadow stores cannot be overwritten.

 

When a client writes a file to a node pool configured for in-line deduplication on a cluster, the write operation is divided up into whole 8KB blocks. Each of these blocks is then hashed and its cryptographic ‘fingerprint’ compared against an in-memory index for a match. At this point, one of the following operations will occur:

 

1.     If a match is discovered with an existing shadow store block, a byte-by-byte comparison is performed. If the comparison is successful, the data is removed from the current write operation and replaced with a shadow reference.

 

2.     When a match is found with another LIN, the data is written to a shadow store instead and replaced with a shadow reference. Next, a work request is generated and queued that includes the location for the new shadow store block, the matching LIN and block, and the data hash. A byte-by-byte data comparison is performed to verify the match and the request is then processed.

 

3.     If no match is found, the data is written to the file natively and the hash for the block is added to the in-memory index.

 

In order for in-line deduplication to be performed on a write operation, the following conditions need to be true:

 

  • In-line dedupe must be globally enabled on the cluster.
  • The current operation is writing data (ie. not a truncate or write zero operation).
  • The ‘no_dedupe’ flag is not set on the file.
  • The file is not a special file type, such as an alternate data stream (ADS) or an EC (endurant cache) file.
  • Write data includes fully overwritten and aligned blocks.
  • The write is not part of a ‘rehydrate’ operation.
  • The file has not been packed (containerized) by SFSE (small file storage efficiency).

 

OneFS in-line deduplication uses the 128-bit CityHash algorithm, which is both fast and cryptographically strong. This contrasts with OneFS’ post-process SmartDedupe, which uses SHA-1 hashing.

 

Each F810 or H5600 node in a cluster with in-line dedupe enabled has its own in-memory hash index that it compares block ‘fingerprints’ against. The index lives in system RAM and is allocated using physically contiguous pages and accessed directly with physical addresses. This avoids the need to traverse virtual memory mappings and does not incur the cost of translation lookaside buffer (TLB) misses, minimizing deduplication performance impact.

 

The maximum size of the hash index is governed by a pair of sysctl settings, one of which caps the size at 16GB, and the other which limits the maximum size to 10% of total RAM.  The strictest of these two constraints applies.  While these settings are configurable, the recommended best practice is to use the default configuration. Any changes to these settings should only be performed under the supervision of Dell EMC support.

 

Since in-line dedupe and SmartDedupe use different hashing algorithms, the indexes for each are not shared directly. However, the work performed by each dedupe solution can be leveraged by each other.  For instance, if SmartDedupe writes data to a shadow store, when those blocks are read, the read hashing component of inline dedupe will see those blocks and index them. 

 

When a match is found, in-line dedupe performs a byte-by-byte comparison of each block to be shared to avoid the potential for a hash collision. Data is prefetched prior to the byte-by-byte check and then compared against the L1 cache buffer directly, avoiding unnecessary data copies and adding minimal overhead. Once the matching blocks have been compared and verified as identical, they are then shared by writing the matching data to a common shadow store and creating references from the original files to this shadow store.

 

In-line dedupe samples every whole block written and handles each block independently, so it can aggressively locate block duplicity.  If a contiguous run of matching blocks is detected, in-line dedupe will merge the results into regions and process them efficiently.

 

In-line dedupe also detects dedupe opportunities from the read path, and blocks are hashed as they are read into L1 cache and inserted into the index. If an existing entry exists for that hash, in-line dedupe knows there is a block sharing opportunity between the block it just read and the one previously indexed. It combines that information and queues a request to an asynchronous dedupe worker thread.  As such, it is possible to deduplicate a data set purely by reading it all. To help mitigate the performance impact, the hashing is performed out-of-band in the prefetch path, rather than in the latency-sensitive read path.

 

Compression, the third and final phase of the in-line data reduction pipeline, occurs as files are written to from a node in the cluster via a connected client session. Similarly, files are re-inflated on demand as they are read by clients. 

 

Unlike the H5600, the F810 nodes’ FPGA-based hardware offload engine resides on the backend PCI-e network adapter to perform real-time data compression, using a proprietary implementation of DEFLATE with the highest level of compression, while incurring minimal to no performance penalty for highly compressible datasets.

 

The compression engine comprises three main components:

 

Engine Component

Description

Search Module

LZ77 search module analyzes in-line file data chunks for repeated patterns.

Encoding Module

Performs data compression (Huffman encoding) on target chunks.

Decompression Module

Regenerates the original file from the compressed chunks.

 

In addition to dual-port 40Gb Ethernet interfaces, each F810 node’s data reduction off-load adapter contains an FPGA chip, which is dedicated to the compression of data received via client connections to the node. These cards reside in the backend PCI-e slot in each of the four nodes. The two Ethernet ports in each adapter are used for the node’s redundant backend network connectivity.

 

The table below illustrates the relationship between the effective to usable and effective to raw ratios for the F810 and H5600 platforms:


inline-dedupe3_0.png

 

When a file is written to OneFS using in-line data compression, the file’s logical space is divided up into equal sized chunks called compression chunks. Compaction is used to create 128KB compression chunks, with each chunk comprising sixteen 8KB data blocks. This is optimal since 128KB is the same chunk size that OneFS uses for its data protection stripe units, providing simplicity and efficiency, by avoiding the overhead of additional chunk packing.

 

For example, consider the following 128KB chunk:

 

compression_1.png

 

After compression, this chunk is reduced from sixteen to six 8KB blocks in size. This means that this chunk is now physically 48KB in size. OneFS provides a transparent logical overlay to the physical attributes. This overlay describes whether the backing data is compressed or not and which blocks in the chunk are physical or sparse, such that file system consumers are unaffected by compression. As such, the compressed chunk is logically represented as 128KB in size, regardless of its actual physical size. The orange sector in the illustration above represents the trailing, partially filled 8KB block in the chunk. Depending on how each 128KB chunk compresses, the last block may be under-utilized by up to 7KB after compression.

 

Efficiency savings must be at least 8KB (one block) for compression to occur, otherwise that chunk or file will be passed over and remain in its original, uncompressed state. For example, a file of 16KB that yields 8KB (one block) of savings would be compressed. Once a file has been compressed, it is then protected with Forward Error Correction (FEC) parity blocks, reducing the number of FEC blocks and therefore providing further overall storage savings.

 

Note that compression chunks will never cross node pools. This avoids the need to de-compress or recompress data to change protection levels, perform recovered writes, or otherwise shift protection-group boundaries.

 

Compression and deduplication can significantly increase the storage efficiency of data. However, the actual space savings will vary depending on the specific attributes of the data itself.

 

Configuration of OneFS in-line data reduction is via the command line interface (CLI), using the ‘isi compression’ and ‘isi dedupe inline’ commands. There are also utilities provided to decompress, or rehydrate, compressed and deduplicated files if necessary. Plus, there are tools for viewing on-disk capacity savings that in-line data reduction has generated.


The ‘isi_hw_status’ CLI command can be used to confirm and verify the node(s) in a cluster. For example:


# isi_hw_status –i | grep Product

Product: F810-4U-Single-256GB-1x1GE-2x40GE SFP+-24TB SSD

 

Since Compression configuration is binary, either on or off across a cluster, it can be easily controlled via the OneFS command line interface (CLI). For example, the following syntax will enable compression and verify the configuration:


# isi compression settings view

    Enabled: No

# isi compression settings modify --enabled=True

# isi compression settings view

    Enabled: Yes

 

Be aware that in-line compression is enabled by default on new H5600 and F810 clusters.

 

In a mixed cluster containing other node styles in addition to compression nodes, files will only be stored in a compressed form on H5600 and F810 node pool(s). Data that is written or tiered to storage pools of other hardware styles will be uncompressed on the fly when it moves between pools. A non-in-line compression supporting node on the cluster can be an initiator for compressed writes in software to a compression node pool. However, this may generate significant CPU overhead for lower powered nodes, such as the A-series hardware and provide only software fallback based compression with lower compressibility.

 

While there are no visible userspace changes when files are compressed, the ‘isi get’ CLI command provides straightforward method to verify whether a file is compressed. If compression has occurred, both the ‘disk usage’ and the ‘physical blocks’ metric reported by the ‘isi get –DD’ CLI command will be reduced. Additionally, at the bottom of the command’s output, the logical block statistics will report the number of compressed blocks. For example:


Metatree logical blocks:

zero=260814 shadow=0 ditto=0 prealloc=0 block=2 compressed=1328

 

For more detailed information, the –O flag, which displays the logical overlay, can be used with the ‘isi get’ command.

OneFS in-line data compression can be disabled from the CLI with the following syntax:


# isi compression settings modify --enabled=False

# isi compression settings view

    Enabled: No

 

Since in-line deduplication configuration is binary, either on or off across a cluster, it can be easily controlled via the OneFS command line interface (CLI). For example, the following syntax will enable in-line deduplication and verify the configuration:


# isi dedupe inline settings view

    Mode: disabled

    Wait: -

   Local: -


# isi dedupe inline settings modify –-mode enabled

# isi dedupe inline settings view

    Mode: enabled

    Wait: -

   Local: -

 

Note that in-line deduplication is disabled by default on new H5600 and F810 clusters.


If deduplication has occurred, both the ‘disk usage’ and the ‘physical blocks’ metric reported by the ‘isi get –DD’ CLI command will be reduced. Additionally, at the bottom of the command’s output, the logical block statistics will report the number of shadow blocks. For example:


Metatree logical blocks:

zero=260814 shadow=362 ditto=0 prealloc=0 block=2 compressed=0

 

OneFS in-line data deduplication can be disabled from the CLI with the following syntax:


# isi dedupe inline settings modify –-mode disabled

# isi dedupe inline settings view

    Mode: disabled

    Wait: -

   Local: -

 

Additionally, OneFS in-line data deduplication can also be paused from the CLI with the following syntax:


# isi dedupe inline settings modify –-mode paused