For the final article in this in-line data reduction series, we’ll turn our attention to deduplication and compression efficiency estimation tools.
Firstly, OneFS includes a dry-run Dedupe Assessment job to help estimate the amount of space savings that will be seen on a dataset. Run against a specific directory or set of directories on a cluster, the dedupe assessment job reports a total potential space savings. The job uses its own separate configuration, does not require a product license, and can be run prior to purchasing F810 hardware to determine whether deduplication is appropriate for a particular data set or environment.
The dedupe assessment job uses a separate index table to both in-line dedupe and SmartDedupe. For efficiency, the assessment job also samples fewer candidate blocks and does not actually perform deduplication. Using the sampling and consolidation statistics, the job provides a report which estimates the total dedupe space savings in bytes.
The dedupe assessment job can also be run from the OneFS command line (CLI):
# isi job jobs start DedupeAssessment
Alternatively, in-line deduplication can be enabled in assessment mode
# isi dedupe inline settings modify –mode assess
One the job has completed, review the following three metrics from each node:
# sysctl efs.sfm.inline_dedupe.stats.zero_block
# sysctl efs.sfm.inline_dedupe.stats.dedupe_block
# sysctl efs.sfm.inline_dedupe.stats.write_block
The formula to calculate the estimated dedupe rate from these statistics is:
dedupe_block / write_block * 100 = dedupe%
Note that the dedupe assessment does not differentiate the case of a fresh run from the case where a previous SmartDedupe job has already performed some sharing on the files in that directory. Isilon recommends that the user should run the assessment job once on a specific directory, since it does not provide incremental differences between instances of the job.
Similarly, the Dell Live Optics Dossier utility can be used to estimate the potential benefits of Isilon’s in-line data compression on a data set. Dossier is available for Windows and has no dependency on an Isilon cluster. This makes it useful for analyzing and estimating efficiency across real data in situ, without the need for copying data onto a cluster. The Dossier tool operates in three phases:
Users manually browse and select root folders on the local host to analyze.
Once the paths to folders have been selected, Dossier will begin walking the file system trees for the target folders. This process will likely take up to several hours for large file systems. Walking the filesystem has a similar impact to a malware/anti-virus scan in terms of the CPU, memory, and disk resources that will be utilized during the collection. A series of customizable options allow the user to deselect more invasive operations and govern the CPU and memory resources allocated to the Dossier collector.
Users upload the resulting .dossier file to create a PowerPoint report.
To obtain a Live Optics Dossier report, first download, extract and run the Dossier collector. Local and remote UNC paths can be added for scanning. Ensure you are authenticated to the desired UNC path before adding it to Dossier’s ‘custom paths’ configuration. Be aware that the Dossier compression option only processes the first 64KB of each file to determine its compressibility. Additionally, the default configuration samples only 5% of the dataset, but this is configurable with a slider. Increasing this value improves the accuracy of the estimation report, albeit at the expense of extended job execution time.
The compressibility scan executes rapidly, with minimal CPU and memory resource consumption. It also provides thread and memory usage controls, progress reporting, and a scheduling option to allow throttling of scanning during heavy usage windows, etc.
When the scan is complete, a ‘*.dossier’ file is generated. This file is then uploaded to the Live Optics website:
Once uploaded and processed, a PowerPoint report is generated in real time and delivered via email.
Compression reports are easy to comprehend. If multiple SMB shares or paths are scanned, a summary is generated at the beginning of the report, followed by the details of each individually selected path.
Live Optics Dossier can be found at URL: https://app.liveoptics.com/tools/dossier
Documentation is at: https://support.liveoptics.com/hc/en-us/articles/229590207-Dossier-User-Guide
When running the Live Optics Dossier tool, please keep the following considerations in mind. Doesn’t provide exactly the same algorithm as the OneFS hardware in-line compression. It also looks at the software compression, not the hardware compression. So actual results will generally be better than Dossier report.
Note that there will be some data for which Dossier overestimates compression, for example with files whose first blocks are significantly more compressible than later blocks. It is intended to be run against any SMB shares on any storage array or DAS and has no NFS export support. The Dossier tool can also take a significant amount of time to run against a large data set. By default, it only samples a portion (first 64KB) of the data, so results can be inaccurate. Dossier only provides the size of the uncompressed and compressed data. It does not provide performance estimates of different compression algorithms. It doesn’t attempt to compress files with certain known extensions which are generally uncompressible.