As we've seen in the last couple of articles, compression and deduplication can significantly increase the storage efficiency of data. However, the actual space savings often can and will vary dramatically depending on the specific attributes of the data itself.

 

The following table illustrates the relationship between the effective to usable and effective to raw ratios for the three drive configurations that the F810 chassis is available in (3.8 TB. 7.6 TB, and 15.4 TB SSDs):


inline-dedupe3_1.png


Let's take a look at descriptions for the various OneFS reporting metrics, such as those returned by the ‘isi statistics data-reduction’ command described in the previous blog article. The following attempts, where appropriate, to equate the Isilon nomenclature with more general industry terminology:


inline-dedupe3_2.png


The interrelation of the data capacity metrics described above can be illustrated in as such:

 

inline-dedupe3_3.png

 

The preprotected physical (usable) value is derived by subtracting the protection overhead from the protected physical (raw) metric. Similarly, the difference in size between preprotected physical (usable) and logical data (effective) is the efficiency savings. If OneFS SmartDedupe is also licensed and running on the cluster, this data reduction savings value will reflect a combination of compression, in-line deduplication and post-process deduplication savings.


As with most things in life, data efficiency is a compromise. To gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to execute the compressing and deduping and re-inflating of files. As such, the following factors can affect the performance of in-line data reduction and the I/O performance of compressed and deduplicated pools:


  • Application and the type of dataset being used
  • Data access pattern (for example, sequential versus random access, the size of the I/O)
  • Compressibility and duplicity of the data
  • Amount of total data
  • Average file size
  • Nature of the data layout
  • Hardware platform: the amount of CPU, RAM, and type of storage in the system
  • Amount of load on the system
  • Level of protection


Clearly, hardware offload compression will perform considerably better, both in terms of speed and efficiency, than the software fallback option – both on F810 nodes where the hardware compression engine has been disabled, and on all other nodes types where software data reduction is the only available option.

Another important performance impact consideration with in-line data efficiency is the potential for data fragmentation. After compression or deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache.


Because in-line data reduction is a data efficiency feature rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front and from the data reduction execution perspective, as additional cluster resources are consumed when shrinking and inflating files.


With in-line data reduction enabled on F810 nodes, highly incompressible data sets may experience a small performance penalty. Conversely, for highly compressible and duplicate data there may be a performance boost. Workloads performing small, random operations will likely see a small performance degradation.

Since they reside on the same card, the compression FPGA engine shares PCI-e bandwidth with the node’s backend Ethernet interfaces. In general, there is plenty of bandwidth available. However, a best practice is to run incompressible performance streaming workflows on F810 nodes with in-line data reduction disabled to avoid any potential bandwidth limits. In general, rehydration requires considerably less overhead than compression.


When considering effective usable space on a cluster with in-line data reduction enabled, bear in mind that every capacity saving from file compression and deduplication also serves to reduce the per-TB compute ratio (CPU, memory, etc). For performance workloads, the recommendation is to size for performance (IOPS, throughput, etc) rather than effective capacity.


Similarly, it is challenging to broadly characterize the in-line dedupe performance overhead with any accuracy since it is dependent on various factors including the duplicity of the data set, whether matches are found against other LINs or SINs, etc. Workloads requiring a large amount of deduplication might see an impact of 5-10%, although enjoy an attractive efficiency ratio. In contrast, certain other workloads may see a slight performance gain because of in-line dedupe. If there is block scanning but no deduplication to perform, the overhead is typically in the 1-2% range.

 

In-line data reduction is included as a core component of Isilon OneFS 8.2.1 on the F810 hardware platform and does not require a product license key to activate. In-line compression is enabled by default and in-line deduplication can be activated via the following command: 


# isi dedupe inline settings modify --enabled=True


Note that an active Isilon SmartQuotas license is required to use quota reporting. An unlicensed cluster will show a SmartQuotas warning until a valid product license has been purchased and applied to the cluster. License keys can be easily added via the ‘Activate License’ section of the OneFS WebUI, accessed by navigating via Cluster Management > Licensing.


Below are some examples of typical space reclamation levels that have been achieved with OneFS in-line data efficiency. These data efficiency space savings values are provided solely as rough guidance. Since no two data sets are alike (unless they’re replicated), actual results can and will vary considerably from these examples.

 

Workflow / Data Type

Typical Efficiency Ratio

Typical Space Savings

Home Directories / File Shares

1.3:1

25%

Engineering Source Code

1.4:1

30%

EDA Data

2:1

50%

Genomics data

2.2:1

55%

Oil and gas

1.4:1

30%

Pre-compressed data

N/A

No savings

 

To calculate the data reduction ratio, the ‘logical data’ (effective) is divided by the ‘preprotected physical’ (usable) value. From the output above, this would be:

 

339.50 / 192.87 = 1.76        Or a Data Reduction ratio of 1.76:1

 

Similarly, the ‘efficiency ratio’ is calculated by dividing the ‘logical data’ (effective) by the ‘protected physical’ (raw) value. From the output above, this yields:

 

339.50 / 350.13 = 0.97        Or an Efficiency ratio of 0.97:1