Find Communities by: Category | Product

In OneFS 8.2, SmartQuotas now has four principle resources used in quota accounting:


Accounting Resource

Description

Physical Size

This includes all the on-disk storage associated with files and directories, with the exception of some metadata objects including the LIN tree, snapshot tracking files (STFs). For deduplicated data and file clones, each file’s 8KB reference to a shadow store is included in the physical space calculation.

File system logical size

File system logical size calculation approximates disk usage on ‘typical’ storage arrays by ignoring the erasure code, or FEC, protection overhead that OneFS employs. For regular files, the logical data space is the amount of storage required to house a particular file if it was 1x mirrored. Logical space also incorporates a file’s metadata resources.

Application Logical Size

Reports total logical data store across different tiers, including CloudPools. This allows users to view quotas and free space as an application would view it, in terms of how much capacity is available to store logical data regardless of data reduction or tiering technology.

Inodes

SmartQuotas counts the number of logical inodes, which allows accounting for files without any ambiguity from hard links or protection.

 

When configuring a quota, these are accounting options are available as enforcement limits. For example, from the OneFS WebUI:


quotas_applogical_1.png

 

Application logical size quotas are introduced in OneFS 8.2, allowing users to view quotas and free space as an application sees it – regardless of data reduction, stubbing, sparse blocks, etc.


Existing quotas can easily be configured to use application logical size upon upgrading to 8.2. The benefits of application logical size quotas include:


  • Snapshots, protection overhead, dedupe, compression, and location of files all have no effect on quota consumption
  • Removes previous limitation where SmartQuotas only reported on-cluster storage, ignoring cloud consumption
  • Presents view that aligns with Windows storage accounting
  • Enables accounting and enforcing quota on actual file sizes
  • Precisely accounts for small files
  • Enables enforcing quotas on a path irrespective of the physical location of file.


The following table describes how SmartQuotas accounts for a 1KB file with the various data types:

 

Data Type

Accounting

File: physical size

Every non-sparse 8KB disk block a file consumes including protection

File: file system logical size

Every non-sparse 8KB disk block a file consumes excluding protection

File: application logical size

Actual size of file (rather than total of 8KB disk blocks consumed)

CloudPools file: file system logical size

Size of CloudPools SmartLink stub file (8KB)

CloudPools file: application logical size

Actual size of file on cloud storage (rather than local stub file)

Directories

Sum of all directory entries

Symlinks

Data size

ACL and similar

Data size

Alternate data stream

Each ADS is charged as a file and a container as a directory

 

The following example shows each method of accounting for a 1KB file:


quotas_applogical_2.png

 

Logical size reports 8KB, or one block, physical size reports 24KB (file with 3x mirroring protection), and application logical shows its actual size of 1KB.


Quotas with application logical accounting can easily be created from either the OneFS WebUI or CLI. For example, the following syntax will configure an applogical hard quota with a 5GB threshold on the /ifs/data directory:


# isi quota quotas create -path=/ifs/data directory --thresholdson=applogicalsize --hard-threshold=5G


Other resources encountered during quota accounting include:


Hard Links - Each logical inode is accounted exactly once in every domain to which it belongs. If an inode is present in multiple domains, it is accounted in multiple domains. Alternatives such as shared accounting were considered. However, if inodes are not accounted once in every domain, it is possible for the deletion of a hard link in one domain to put another domain over quota.


Alternate Data Streams (ADS) - A file with an alternate data stream or resource fork is accounted as the sum of the resource usage of the individual file, the usage for the container directory and the usage for each ADS. SmartQuotas handles the rename of a file with ADS synchronously, despite the fact that the ADS container is just a directory. SmartQuotas will store an accounting summary on the ADS container to handle renames.


Directory Rename – A directory rename presents a unique challenge to a per-directory quota system. Renames of directories within a domain are trivial - if both the source and target directories have the same domain membership, no accounting changes. However, non-empty directories are not permitted to be moved when the SmartQuotas configuration is different on the source and the target parent directories. If a user trusts the client operating systems to copy files and preserve all the necessary attributes, then the user may set dir_rename_errno to EXDEV, which causes most Unix and Windows clients to do a copy and delete of the directory tree to affect the move.


Snapshot Accounting – If desired, a quota domain can also include snapshot usage in its accounting. SmartQuotas will only support snapshots created after the quota domain was created. This is because determining quota governance (including QuotaScan job) for existing snapshots is a very time and resource consuming operation. As most administrators cycle their snapshots through timed expirations, SmartQuotas will eventually accrue enough accounting information to include the entire set of relevant snapshots on the system.

In the first article of this series, we explored the architecture and configuration of OneFS Small File Storage Efficiency. Next, we'll take a look at SFSE monitoring & reporting, defragmentation, plus some considerations and recommended practices.


There are three main CLI commands that report on the status and effect of SFSE. These are:

 

  • isi job reports view <job_id>
  • isi_packing –fsa
  • isi_sfse_assess

 

In when running the isi job report view command, enter the job ID as an argument. In the command output, the ‘file packed’ field will indicate how many files have been successfully containerized. For example, for job ID 1018:

 

# isi job reports view –v 1018

SmartPools[1018] phase 1 (2019-02-31T10:29:47

---------------------------------------------

Elapsed time                        12 seconds

Working time                        12 seconds

Group at phase end                  <1,6>: { 1:0-5, smb: 1, nfs: 1, hdfs: 1, swift: 1, all_enabled_protocols: 1}

Errors

‘dicom’:

      {‘Policy Number’: 0,

      ‘Files matched’: {‘head’:512, ‘snapshot’: 256}

      ‘Directories matched’: {‘head’: 20, ‘snapshot’: 10},

      ‘ADS containers matched’: {‘head’:0, ‘snapshot’: 0},

      ‘ADS streams matched’: {‘head’:0, ‘snapshot’: 0},

      ‘Access changes skipped’: 0,

‘Protection changes skipped’: 0,

‘Packing changes skipped’: 0,

‘File creation templates matched’: 0,

‘Skipped packing non-regular files’: 2,

‘Files packed’: 48672,

‘Files repacked’: 0,

‘Files unpacked’: 0,

},

}   

 

The second command, isi_packing –fsa, provides a storage efficiency percentage in the last line of its output. This command requires InsightIQ to be licensed on the cluster and a successful run of the file system analysis (FSA) job.


If FSA has not been run previously, it can be kicked off with the following 'isi job jobs start FSAnalyze' command. For example:

 

# isi job jobs start FSAnalyze

Started job [1018]

 

When this job has completed, run:

 

# isi_packing -–fsa -–fsa-jobid 1018

FSAnalyze job: 1018 (Mon Feb 29 22:01:21 2019)

Logical size:  47.371T

Physical size: 58.127T

Efficiency:    81.50%

 

In this case, the storage efficiency achieved after containerizing the data is 81.50%, as reported by isi_packing.

If you don't specify an FSAnalyze job ID, the –fsa defaults to the last successful FSAnalyze job run results.

Be aware that the isi_packing --fsa command reports on the whole /ifs file system. This means that the overall utilization percentage can be misleading if other, non-containerized data is also present on the cluster.

 

There is also a Storage Efficiency assessment tool available in OneFS 8.2. This can be run as from the CLI with the following syntax: # isi_sfse_assess <options>

 

Estimated storage efficiency is presented in the tool’s output in terms of raw space savings as a total and percentage and a percentage reduction in protection group overhead.

 

SFSE estimation summary:

* Raw space saving: 1.7 GB (25.86%)

* PG reduction: 25978 (78.73%)

 

 

When containerized files with shadow references are deleted, truncated or overwritten it can leave unreferenced blocks in shadow stores. These blocks are later freed and can result in holes which reduces the storage efficiency.


sfse_2.png

 

The actual efficiency loss depends on the protection level layout used by the shadow store. Smaller protection group sizes are more susceptible, as are containerized files, since all the blocks in containers have at most one referring file and the packed sizes (file size) are small.


In OneFS 8.2, a shadow store deframenter is added to reduce fragmentation resulting of overwrites and deletes of files. This defragmenter is integrated into the ShadowStoreDelete job. The defragmentation process works by dividing each containerized file into logical chunks (~32MB each) and assessing each chunk for fragmentation.


sfse_3.png

 

If the storage efficiency of a fragmented chunk is below target, that chunk is processed by evacuating the data to another location. The default target efficiency is 90% of the maximum storage efficiency available with the protection level used by the shadow store. Larger protection group sizes can tolerate a higher level of fragmentation before the storage efficiency drops below this threshold.


In OneFS 8.2, the ‘isi_sstore list’ command is enhanced to display fragmentation and efficiency scores. For example:


# isi_sstore list -v                    

              SIN lsize   psize   refs filesize  date       sin type underfull frag score efficiency

4100:0001:0001:0000 128128K 192864K 32032 128128K Sep 20 22:55 container no       0.01        0.66

 

The fragmentation score is the ratio of holes in the data where FEC is still required, whereas the efficiency value is a ratio of logical data blocks to total physical blocks used by the shadow store. Fully sparse stripes don't need FEC so are not included. The rule of thumb is that lower fragmentation scores and higher efficiency scores are better.


The defragmenter does not require a license to run and is disabled by default in OneFS 8.2. It can be easily activated using the following CLI commands:


# isi_gconfig -t defrag-config defrag_enabled=true


Once enabled, the defragmenter can be started via the job engine’s ShadowStoreDelete job, either from the OneFS WebUI or via the following CLI command:


# isi job jobs start ShadowStoreDelete


The defragmenter can also be run in an assessment mode. This reports on and helps to determine the amount of disk space that will be reclaimed, without moving any actual data. The ShadowStoreDelete job can run the defragmenter in assessment mode but the statistics generated are not reported by the job. The isi_sstore CLI command has a ‘defrag’ option and can be run with the following syntax to generate a defragmentation assessment:


# isi_sstore defrag -d -a -c -p -v

Processed 1 of 1 (100.00%) shadow stores, space reclaimed 31M

Summary:

Shadows stores total: 1

Shadows stores processed: 1

Shadows stores skipped: 0

Shadows stores with error: 0

Chunks needing defrag: 4

Estimated space savings: 31M


Isilon Small File Storage Efficiency for Archive is not free. There’s always trade-off between cluster resource consumption (CPU, memory, disk), the potential for data fragmentation and the benefit of improved space utilization. As such, it's worth bearing the following in mind:


  • This is a storage efficiency product, not a performance product.
  • A valid Isilon SmartPools software license is required in order to configure small file storage efficiency on a cluster.
  • The time to retrieve a packed archive image should not be much greater than an unpacked image data – unless fragmentation has occurred.
  • Configuration is only via the OneFS CLI, rather than the WebUI, at this point.
  • After enabling a filepool policy, the first SmartPools job may take a relatively long time due to packing work, but subsequent runs should be much faster.
  • For clusters using CloudPools you cannot containerize stubbed files.
  • SyncIQ data will be unpacked during replication, so SmartPools will need to be licensed and packing configured on the target cluster.
  • If the data is in a snapshot, it won’t be packed – only HEAD file data will be containerized.
  • The isi_packing --fsa command reports on the whole filesystem, so the overall utilization percentage can be misleading if other, non-containerized data is also present on the cluster.
  • Alternate data streams (ADS, i.e. the streams themselves, not the parent files) will not be containerized by default.
  • Packing and unpacking will be logically preserving actions, they will not cause logical changes to a file and therefore will not trigger snapshot COW.
  • If you’ve already run Isilon SmartDedupe data deduplication software on your data, you won’t see much additional benefit because your data is already in shadow stores.
  • If you run SmartDedupe against packed data, the deduped files will be skipped.
  • You can clone files with packed data.
  • Containerization is managed by the SmartPools job. However, the SmartPoolsTree job, isi filepool apply, and isi set will also be able to perform file packing.


Similarly, some recommended best practices for SFSE include:


  • Only enable storage efficiency on an archive workflow with a high percentage of small files.
  • The majority of logical space used on cluster is for small files. In this case, small files are considered as less than 512 KB in size.
  • The default minimum age for packing is anything over one day, and this will override anything configured in the filepool policy.
  • Limit changes (overwrites and deletes) to containerized files, which cause fragmentation and impact both file read performance and storage efficiency
  • Ensure there’s sufficient free space available on the cluster before unpacking any containerized data.
  • Ensure the archive solution being used does not natively perform file containerization, or the benefits of Isilon small file storage efficiency will likely be negated.
  • Use a path based filepool policy for configuration, where possible, rather than more complex filepool filtering logic.
  • Don’t configure the maximum file size value inside the file pool filter itself. Instead set this parameter via the isi_packing command.
  • Use SFSE to archive static small file workloads, or those with only moderate overwrites and deletes.
  • If necessary, run the defragmentation job on a defined schedule (ie. weekly) to eliminate fragmentation.

Archive applications such as next generation healthcare Picture Archiving and Communication Systems (PACS) are increasingly moving away from housing large archive file formats (such as tar and zip files) to storing the smaller files individually. To directly address this trend, OneFS now includes a Small File Storage Efficiency (SFSE) component. And we'll take an in-depth look at this feature over the next couple of blog articles.


SFSE maximizes the space utilization of a cluster by decreasing the amount of physical storage required to house the small files that often comprise an archive, such as a typical healthcare DICOM dataset. Efficiency is achieved by scanning the on-disk data for small files and packing them into larger OneFS data structures, known as shadow stores. These shadow stores are then parity protected using erasure coding, and typically provide storage efficiency of 80% or greater.


Isilon SFSE is specifically designed for infrequently modified, archive datasets. As such, it trades a small read latency performance penalty for improved storage utilization. Files obviously remain writable, since archive applications are assumed to periodically need to update at least some of the small file data.


Under the covers. SFSE comprises six main components:


  • File pool configuration policy
  • SmartPools Job
  • Shadow Store
  • Configuration control path
  • File packing and data layout infrastructure
  • Defragmenter


But first, a quick review of file system layout...


OneFS provides one vast, scalable namespace—free from multiple volume concatenations or single points of failure. As such, an Isilon cluster can support data sets with hundreds of billions of small files, all within the same file system.


OneFS lays data out across multiple nodes allowing files to benefit from the resources (spindles and cache) of up to twenty nodes. Reed-Solomon erasure coding is used to protecting at the file-level, enabling the cluster to recover data quickly and efficiently, and providing exceptional levels storage utilization. OneFS provides protection against up to four simultaneous component failures respectively. A single failure can be as little as an individual disk or an entire node.


A variety of mirroring options are also available, and OneFS typically uses these to protect metadata and small files. Striped, distributed metadata coupled with continuous auto-balancing affords OneFS near linear performance characteristics, regardless of the capacity utilization of the system. Both metadata and file data are spread across the entire cluster keeping the cluster balanced at all times.


The OneFS file system employs a native block size of 8KB, and sixteen of these blocks are combined to create a 128KB stripe unit. Files larger than 128K are protected with error-correcting code parity blocks (FEC) and striped across nodes. This allows files to use the combined resources of up to twenty nodes, based on per-file policies.

 

Files smaller than 128KB are unable to fill a stripe unit, so are typically mirrored rather than FEC protected, resulting in a less efficient on-disk footprint. For most data sets, this is rarely an issue, since the presence of a smaller number of larger FEC protected files offsets the mirroring of the small files.

 

sfse_1.png

 

For example, if a file is 24KB in size, it will occupy three 8KB blocks. If it has two mirrors for protection, there will be a total of nine 8KB blocks, or 72KB, that will be needed to protect and store it on disk. Clearly, being able to pack several of these small files into a larger, striped and parity protected container will provide a great space benefit.


Additionally, files in the 150KB to 300KB range typically see utilization of around 50%, as compared to 80% or better when containerized with SFSE.


Under the hood, the SFSE has similarities to the OneFS file cloning process, and both operations utilize the same underlying infrastructure – the shadow store.


As we saw in previous blog articles on the topic, shadow stores are similar to regular files, but don’t contain all the metadata typically associated with regular file inodes. In particular, time-based attributes (creation time, modification time, etc.) are explicitly not maintained.



The shadow stores for SFSE differ from existing shadow stores in a few ways in order to isolate fragmentation, to support tiering, and to support future optimizations which will be specific to single-reference stores.


Containerization is managed by the SmartPools job. This job typically runs by default on a cluster with a 10pm nightly schedule and a low impact management setting, but can also be run manually on-demand. Additionally, the SmartPoolsTree job, isi filepool apply, and the isi set command are also able to perform file packing.

File attributes indicate each file's pack state:


packing_policy: container or native. This indicates whether the file meets the criteria set by your file pool policies and is eligible for packing. Container indicates that the file is eligible to be packed; native indicates that the file is not eligible to be packed. Your file pool policies determine this value. The value is updated by the SmartPools job.


packing_target: container or native. This is how the system evaluates a file's eligibility for packing based on additional criteria such as file size, type, and age. Container indicates that the file should reside in a container shadow store. Native indicates that the file should not be containerized.


packing_complete: complete or incomplete. This field establishes whether or not the target is satisfied. Complete indicates that the target is satisfied, and the file is packed. Incomplete indicates that the target is not satisfied, and the packing operation is not finished.


It’s worth noting that several healthcare archive applications can natively perform file containerization. In these cases, the benefits of OneFS small file efficiency will be negated.


Before configuring small file storage efficiency on a cluster, make sure that the following pre-requisites are met:


  1. Enable on an archive workflow only: This feature is not intended for performance or write-heavy workflows.
  2. The majority of the archived data comprises small files. By default, the threshold target file size is from 0-1 MB.
  3. Isilon SmartPools software is licensed and active on the cluster.


Additionally, it’s highly recommended to have Isilon InsightIQ software licensed on the cluster. This enables the file systems analysis (FSAnalyze) job to be run, which provides enhanced storage efficiency reporting statistics.


The first step in configuring Isilon small file storage efficiency on a cluster is to enable the packing process. To do so, run the following command from the OneFS CLI:

 

# isi_packing –-enabled=true

 

Once the isi_packing variable is set, and the licensing agreement is confirmed, configuration is done via a filepool policy. The following CLI example will containerize data under the cluster directory /ifs/data/dicom.

 

# isi filepool policies create dicom --enable-packing=true --begin-filter --path=/ifs/data/pacs --end-filter

 

The SmartPools configuration for the resulting ‘dicom’ filepool can be verified with the following command:

 

# isi filepool policies view dicom

Name: dicom

Description: -

State: OK

State Details:

Apply Order: 1

File Matching Pattern: Birth Time > 1D AND Path == dicom (begins with)

          Set Requested Protection: -

Data Access Pattern: -

Enable Coalescer: -

Enable Packing: Yes

...

 

Note:  There is no dedicated WebUI for SFSE, so configuration is performed via the CLI.


The isi_packing command will also confirm that packing has been enabled:

 

# isi_packing –-ls

Enabled: Yes

Enable ADS: No

Enable snapshots: No

Enable mirror containers:           No

Enable mirror translation:          No

Unpack recently modified:           No

Unpack snapshots: No

Avoid deduped files:                Yes

Maximum file size: 1016.0k

SIN cache cutoff size:              8.00M

Minimum age before packing:        0s

Directory hint maximum entries:     16

Container minimum size:             1016.0k

Container maximum size:             1.000G

 

While the defaults will work for most use cases, the two values you may wish to adjust are maximum file size (--max-size <bytes>) and minimum age for packing (--min-age <seconds>).


Files are then containerized in the background via the SmartPools job, which can be run on-demand, or via the nightly schedule.

 

# isi job jobs start SmartPools

Started job [1016]

 

After enabling a new filepool policy, the SmartPools job may take a relatively long time due to packing work. However, subsequent job runs should be significantly faster.


SFSE reporting can be viewed via the SmartPools job reports, which detail the number of files packed. For example:

 

#  isi job reports view –v 1016

 

For clusters with a valid InsightIQ license, if the FSA (file system analytics) job has run, a limited efficiency report will be available. This can be viewed via the following command:

 

# isi_packing -–fsa               

 

For clusters using Isilon CloudPools software, you cannot containerize stubbed files. SyncIQ data will be unpacked, so packing will need to be configured on the target cluster.


To unpack previously packed, or containerized, files, in this case from the ‘dicom’ filepool policy, run the following command from the OneFS CLI:

 

# isi filepool policies modify dicom -–enable-packing=false

 

Before performing any unpacking, ensure there’s sufficient free space on the cluster. Also, be aware that any data in a snapshot won’t be packed – only HEAD file data will be containerized.


A threshold is provided, which prevents very recently modified files from being containerized. The default value for this is 24 hours, but this can be reconfigured via the isi_packing –min-age <seconds> command, if desired. This threshold guards against accidental misconfiguration within a file pool policy, which could potentially lead to containerization of files which are actively being modified, which could result in container fragmentation.

trimbn

OneFS GMP Scale Enhancements

Posted by trimbn May 14, 2019

As mentioned in previous articles, in OneFS 8.2 the maximum cluster size is extended from 144 to 252 nodes. To support this increase in scale, GMP transaction latency has been dramatically improved by eliminating serialization and reducing its reliance on exclusive merge locks.


Instead, GMP now employs a shared merge locking model.

 

gmp_paxos_1.png

 

Take the four node cluster above. In this serialized locking example, the interaction between the two operations is condensed, illustrating how each node can finish its operation independent of its peers. Note that the diamond icons represent the ‘loopback’ messaging to node 1.


Each node takes its local exclusive merge lock. By not serializing/locking, the group change impact is significantly reduced, allowing OneFS to support greater node counts. It is expensive to stop GMP messaging on all nodes to allow this. While state is not synchronized immediately, it will be the same after a short while. The caller of a service change will not return until all nodes have been updated. Once all nodes have replied, the service change has completed. It is possible that multiple nodes change a service at the same time, or that multiple services on the same node change.


The example above illustrates nodes {1,2} merging with nodes {3, 4}. The operation is serialized, and the exclusive merge lock will be taken. In the diagram, the wide arrows represent multiple messages being exchanged. The green arrows show the new service exchange. Each node sends its service state to all the nodes new to it and receives the state from all new nodes. There is no need to send the current service state to any node in a group prior to the merge.


During a node split, there are no synchronization issues because either order results in the services being down, and the existing OneFS algorithm still applies.


OneFS 8.2 also sees the introduction of a new daemon, isi_array_d, which replaces isi_boot_d from prior versions. Isi_array_d is based on the Paxos consensus protocol.


gmp_paxos_2.png

 

Paxos is used to manage the process of agreeing on a single, cluster-wide result amongst a group of potential transient nodes. Although no deterministic, fault-tolerant consensus protocol can guarantee progress in an asynchronous network, Paxos guarantees safety (consistency), and the conditions that could prevent it from making progress are difficult to trigger.


In 8.2, the unique, cluster-wide GMP ID is instead replaced with a unique GMP Cookie on each node. For example:


# sysctl efs.gmp.group

  1. efs.gmp.group: <889a5e> (5) :{ 1-3:0-5, smb: 1-3, nfs: 1-3, all_enabled_protocols: 1-3, isi_cbind_d: 1-3 }


The GMP Cookie is a hexadecimal number. The initial value is calculated as a function of the current time, so it remains unique even after a node is rebooted. The cookie changes whenever there is a GMP event and is unique on power-up. In this instance, the (5) represents the configuration generation number.


In the interest of ease of readability in large clusters, logging verbosity is also reduced in OneFS 8.2. Take the following syslog entry, for example:


2019-05-12T15:27:40-07:00 <0.5> (id1) /boot/kernel.amd64/kernel: connects: { { 1.7.135.(65-67)=>1-3 (IP), 0.0.0.0=>1-3, 0.0.0.0=>1-3, }, cfg_gen:1=>1-3, owv:{ build_version=0x0802009000000478 overrides=[ { version=0x08020000 bitmask=0x0000ae1d7fffffff }, { version=0x09000100 bitmask=0x0000000000004151 } ] }=>1-3, }


Only the lowest node number in a group proposes a merge or split to avoid too many retries from multiple proposing nodes.


GMP will always select nodes to merge to form the biggest group and equal size groups will be weighted towards the smaller node numbers. For example:


{1, 2, 3, 5} > {1, 2, 4, 5}

 

Discerning readers will have likely noticed a new ‘isi_cbind_d’ entry appended to the group sysctl output above. This new GMP service shows which nodes have connectivity to the DNS servers. For instance, in the following example node 2 is not communicating with DNS.


# sysctl efs.gmp.group

  1. efs.gmp.group: <889a65> (5) :{ 1-3:0-5, smb: 1-3, nfs: 1-3, all_enabled_protocols: 1-3, isi_cbind_d: 1,3 }


 

As you may recall, isi_cbind_d is the distributed DNS cache daemon in OneFS. The primary purpose of cbind is to accelerate DNS lookups, especially when netgroups are used. The design of the cache is to distribute the cache and DNS workload among each node of the cluster.


Cbind has been re-architected in OneFS 8.2 to improve its operation with large clusters. The primary change has been the introduction of a consistent hash to determine the gateway node to cache a request. This consistent hashing algorithm, which decides on which node to cache an entry, has been designed to minimize the number of entry transfers as nodes are added/removed. In so doing, it has  usefully reduced the number of threads and UDP ports used.


The cache is logically divided into two parts:


Component

Description

Gateway cache

The entries that this node will refresh from the DNS server.

Local cache

The entries that this node will refresh from the Gateway node.

 

To illustrate cbind consistant hashing, consider the following three node cluster:

 

gmp_paxos_3.png

 

In the scenario above, when the cbind service on Node 3 becomes active, one third each of the gateway cache from node 1 and 2 respectively gets transferred to node 3.


Similarly, if node 3’s cbind service goes down, it’s gateway cache is divided equally between nodes 1 and 2.

For a DNS request on node 3, the node first checks its local cache. If the entry is not found, it will automatically query the gateway (for example, node 2). This means that even if node 3 cannot talk to the DNS server directly, it can still cache the entries from a different node.

In the previous article, we discussed the new SmartPools FilePolicy job. This is just one of several new jobs that are introduced to the job engine in OneFS 8.2. These include:

 

Job Name

Job Description

Cache-pre-populate

Add data to the CloudPools cache. Having a single job helps avoid the large number of individual jobs as often seen in CloudPools 1.0

ComplianceStoreDelete

SmartLock Compliance mode garbage collection job.

DomainTag

Associates a path and its contents with a domain.

FilePolicy

Efficient SmartPools file pool policy job.

IndexUpdate

Creates and updates an efficient file system index for FilePolicy and FSAnalyze jobs.

Smartlink-upgrade

Update CloudPools Smartlink file formats

Snapshot-writeback

Write updated snapshot data to the cloud

 

In addition to new jobs, 8.2 introduces enhancements to increase job engine performance, reliability, and scalability, and reduce impact.

 

For example, in situations where the job engine sees the available capacity on one or more disk pools fall below a low space threshold, JE now engages a low space mode. This enables space-saving jobs to run and reclaim space before the job engine or even the cluster become unusable.

 

pipeline_je_1.png

 

When the job engine is in low-space mode new jobs will not be started, and any jobs that are not space-saving will be paused. Once free space returns above the low-space threshold, jobs that have been paused for space are resumed.

 

The space-saving jobs in OneFS 8.2 are:


  • AutoBalance(LIN)
  • Collect
  • MultiScan
  • ShadowStoreDelete
  • SnapshotDelete
  • TreeDelete


Once the cluster is no longer space constrained, any paused jobs are automatically resumed.

 

On large clusters with multiple jobs running at high impact, the job coordinator can become bombarded by the volume of task results being sent directly from the worker threads. In OneFS 8.2, this is mitigated by certain jobs performing intermediate merging of results on individual nodes and batching delivery of their results to the coordinator. The jobs that support results merging efficiency include:


  • AutoBalance(Lin)
  • MultiScan
  • AVScan
  • PermissionRepair
  • CloudPoolsLin
  • QuotaScan
  • CloudPoolsTreewalk
  • SnapRevert
  • Collect
  • SnapshotDelete
  • FlexProtect(Lin)
  • TreeDelete
  • LinCount
  • Upgrade

 

The MultiScan job, which combines the functionality of AutoBalance and Collect, is automatically run after a group change which adds a device to the cluster. AutoBalance(Lin) and/or Collect are only run manually if MultiScan has been disabled.


OneFS 8.2 scalability enhancements mean that fewer group change notifications are received. This results in MultiScan being triggered less frequently. To compensate for this in OneFS 8.2, MultiScan is now started when:


  • Data is unbalanced within one or more disk pools, which triggers MultiScan to start the AutoBalance phase only.
  • When drives have been unavailable for long enough to warrant a Collect job, which triggers MultiScan to start both its AutoBalance and Collect phases.

In 8.2, the FilePolicy and FSAnalyze jobs automatically share the same snapshots and index, created and managed by the new IndexUpdate job. When a cluster running FSA is upgraded to OneFS 8.2, the legacy index and snapshots are removed and replaced by new snapshots the first time that IndexUpdate is run. The new index stores considerably more file and snapshot attributes than the old FSA index. Until the IndexUpdate job effects this change, FSA keeps using old index and snapshots.

Note that the DomainMark job is still required for SyncIQ (which still uses legacy BAM domain).

trimbn

SmartPools FilePolicy Job

Posted by trimbn May 7, 2019

As mentioned in the previous introduction to OneFS 8.2, scalability is the linchpin of this release. In order to support the increase in size of up to 252 nodes per cluster, most of the OneFS data services have received some performance efficiency work to reduce execution time and cluster resource burden. And SmartPools, the OneFS data tiering engine, is no exception in this regard.


Traditionally, OneFS has used the SmartPools jobs to apply its file pool policies. To accomplish this, the SmartPools job visits every file, and the SmartPoolsTree job visits a tree of files. However, the scanning portion of these jobs can result in significant random impact to the cluster and lengthy execution times, particularly in the case of SmartPools job.


To address this, OneFS 8.2 introduces the FilePolicy job which, provides a faster, lower impact method for applying file pool policies than the full-blown SmartPools job.


But first, a quick Job Engine refresher…


As we know, he Job Engine is OneFS’ parallel task scheduling framework, and is responsible for the distribution, execution, and impact management of critical jobs and operations across the entire cluster.


The OneFS Job Engine schedules and manages all the data protection and background cluster tasks: creating jobs for each task, prioritizing them and ensuring that inter-node communication and cluster wide capacity utilization and performance are balanced and optimized.  Job Engine ensures that core cluster functions have priority over less important work and gives applications integrated with OneFS – Isilon add-on software or applications integrating to OneFS via the OneFS API – the ability to control the priority of their various functions to ensure the best resource utilization.


Each job, for example the SmartPools job, has an “Impact Profile” comprising a configurable policy and a schedule which characterizes how much of the system’s resources the job will take - plus an Impact Policy and an Impact Schedule.  The amount of work a job has to do is fixed, but the resources dedicated to that work can be tuned to minimize the impact to other cluster functions, like serving client data.

In 8.2 the specific jobs that the SmartPools feature encompasses include:

 

Job

Description

SmartPools

Job that runs and moves data between the tiers of nodes within the same cluster. Also executes the CloudPools functionality if licensed and configured.

SmartPoolsTree

Enforces SmartPools file policies on a subtree.

FilePolicy

Efficient changelist-based SmartPools file pool policy job.

IndexUpdate

Creates and updates an efficient file system index for FilePolicy job.

SetProtectPlus

Applies the default file policy. This job is disabled if SmartPools is activated on the cluster.


In conjunction with the new IndexUpdate job, FilePolicy improves job scan performance for by using a ‘file system index’, or changelist, to find files needing policy changes, rather than a full tree scan.


filepolicy_1.png

 

Avoiding a full treewalk dramatically decreases the amount of locking and metadata scanning work the job is required to perform, reducing impact on CPU and disk - albeit at the expense of not doing everything that SmartPools does. The FilePolicy job enforces just the SmartPools file pool policies, as opposed to the storage pool settings. For example, FilePolicy does not deal with changes to storage pools or storage pool settings, such as:

 

  • Restriping activity due to adding, removing, or reorganizing node pools.
  • Changes to storage pool settings or defaults, including protection.


However, the majority of the time SmartPools and FilePolicy perform the same work. Disabled by default, FilePolicy supports the full range of file pool policy features, reports the same information, and provides the same configuration options as the SmartPools job. Since FilePolicy is a changelist-based job, it performs best when run frequently - once or multiple times a day, depending on the configured file pool policies, data size and rate of change.


Job schedules can easily be configured from the OneFS WebUI by navigating to Cluster Management > Job Operations, highlighting the desired job and selecting ‘View\Edit’. The following example illustrates configuring the IndexUpdate job to run every six hours at a LOW impact level with a priority value of 5:


filepolicy_2.png

When enabling and using the FilePolicy and IndexUpdate jobs, the recommendation is to continue running the SmartPools job as well, but at a reduced frequency (monthly).

In addition to running on a configured schedule, the FilePolicy job can also be executed manually.


FilePolicy requires access to a current index. In the event that the IndexUpdate job has not yet been run, attempting to start the FilePolicy job will fail with the error below. Instructions in the error message will be displayed, prompting to run the IndexUpdate job first. Once the index has been created, the FilePolicy job will run successfully. The IndexUpdate job can be run several times daily (ie. every six hours) to keep the index current and prevent the snapshots from getting large.


filepolicy_3.png

 

Consider using the FilePolicy job with the job schedules below for workflows and datasets with the following characteristics:


  • Data with long retention times
  • Large number of small files
  • Path-based File Pool filters configured
  • Where FSAnalyze job is already running on the cluster (InsightIQ monitored clusters)
  • There is already a SnapshotIQ schedule configured
  • When the SmartPools job typically takes a day or more to run to completion at LOW impact


For clusters without the characteristics described above, the recommendation is to continue running the SmartPools job as usual and to not activate the FilePolicy job.

The following table provides a suggested job schedule when deploying FilePolicy:

 

Job

Schedule

Impact

Priority

FilePolicy

Every day at 22:00

LOW

6

IndexUpdate

Every six hours, every day

LOW

5

SmartPools

Monthly – Sunday at 23:00

LOW

6

 

Bear in mind that no two clusters are the same, so this suggested job schedule may require additional tuning to meet the needs of a specific environment.

Also, when a cluster running the FSAnalyze job is upgraded to OneFS 8.2, the legacy FSAnalyze index and snapshots are removed and replaced by new snapshots the first time that IndexUpdate is run. The new index stores considerably more file and snapshot attributes than the old FSA index. Until the IndexUpdate job effects this change, FSA keeps running on the old index and snapshots.

trimbn

Scaling OneFS To New Heights

Posted by trimbn May 6, 2019

The last year has been a busy one over here at Dell EMC Isilon, culminating in last week’s GA release of OneFS 8.2.

 

This new release is a milestone for OneFS, delivering, among other things, a 75% increase in scale:

  • Raising the maximum cluster size from 144 to 252 nodes.
  • Enabling up to 58PB of capacity.
  • Delivering up to 945 GB/s of aggregate throughput per cluster.

 

The four key themes of OneFS 8.2 are:

 

Massive scale

  • New max cluster size of 252 nodes
  • Improve upon greater parallelism of all data services

Deep file to object integration

  • CloudPools 2.0 puts cloud-tiered content at file system level
  • Integrated data services across on-prem and multi-cloud

Analytics security

  • TDE Support enabling security of shared data lake for Hadoop

Broad enterprise enhancements

  • Security -- data movement encryption, multi-factor authentication, improved role-based access, etc.
  • Data services – improvements across snapshots, quotas, tiering, replication, small-file efficiency, etc.

In addition, we also unveiled a new high-performance hybrid platform, the H5600:

 

pipeline_1.png

 

The H5600 is a 4U Gen6 chassis that houses four nodes, each containing twenty 10TB SATA hard drives. This provides both dense capacity and high performance, making it an ideal platform for data-heavy AI workloads such as ADAS (advanced driver assistance systems).

 

But back to OneFS 8.2… We’ll delve into this new release over the course of the next several blog articles and explore many of the new features and functionality enhancements, which include the following areas:


pipeline_2.png

Filter Blog

By date:
By tag: