Had several field conversations recently around the effect of high storage capacity utilization on cluster performance. Capacity management is a vital part of Isilon system administration and would seem to warrant a blog article.

 

Because OneFS is a single, scalable file system, unencumbered by underlying volume management requirements, it can lead to reduce vigilance on cluster capacity utilization. While the cluster will fire alerts before things become critical, not all sites have additional nodes on hand, sitting around waiting for cluster expansion. The reality is there’s a lead time between ordering and taking delivery of new hardware. As such, it pays to be proactive when it comes to cluster capacity management.

 

When a cluster, or any of its nodepools, becomes more than 90% full, OneFS can experience slower performance and possible workflow interruptions in high-transaction or write-speed-critical operations. Furthermore, when a cluster or pool approaches full capacity (over 95% full), the following issues can occur:

 

  • Substantially slower performance
  • Workflow disruptions - failed file operations and inability to write data
  • Inability to make configuration changes or run commands to delete data and free up space

 

Allowing a cluster or pool to fill can put the cluster into a non-operational state that can take significant time (hours, or even days) to correct. Therefore, it is important to keep your cluster or pool from becoming full. To ensure that a cluster or its constituent pools do not run out of space:

 

  • Add new nodes to existing clusters or pools
  • Replace smaller-capacity nodes with larger-capacity nodes
  • Create more clusters.

 

OneFS will notify when cluster capacity starts to reach levels of concern. If the warning events and alerts are not heeded, the following error messages can be displayed when attempting to write to a full, or nearly full, cluster or pool:


 

Error Message

Where Error is Displayed


The operation can’t be completed because the disk “<share name>” is full.


OneFS WebUI, or the command line interface on an NFS client.

No space left on device.

OneFS WebUI, or the command line interface on an NFS client.


No available space.


OneFS WebUI, or the command line interface on a Windows or SMB client.


ENOSPC (error code)

Written to the cluster’s /var/log/messages file. This error code will be embedded in another message.


Failed to satisfy layout preference.


Written to the cluster’s /var/log/messages file

Disk Quota Exceeded.

Cluster command line interface, or an NFS client when you encounter a Snapshot Reserve limitation.

 

 

When deciding to add new nodes to an existing cluster or pool, contact your sales team to order the nodes well in advance of the cluster or pool running short on space. The recommendation is to start planning for additional capacity when the cluster or pool reaches 75% full. This will allow sufficient time to receive and install the new hardware, while still maintaining sufficient free space.


Here’s the recommended timeline for cluster capacity planning purposes:


capacity_management_1.png

 

If your data availability and protection SLA varies across different data categories (for example, home directories, file services, etc), ensure that any snapshot, replication and backup schedules are configured accordingly to meet the required availability and recovery objectives, and fit within the overall capacity plan.

 

Consider configuring a separate accounting quota for /ifs/home and /ifs/data directories (or wherever data and home directories are provisioned) to monitor aggregate disk space usage and issue administrative alerts as necessary to avoid running low on overall capacity.

 

InsightIQ provides detailed monitoring and trending functionality to help with capacity consumption projections and usage forecasting.


capacity_management_2.png

 

For optimal performance in any size cluster, the recommendation is to maintain at least 10% free space in each pool of a cluster.


To better protect smaller clusters (containing 3 to 7 nodes) the recommendation is to maintain 15 to 20% free space. A full smartfail of a node in smaller clusters may require more than one node's worth of space. Keeping 15 to 20% free space can allow the cluster to continue to operate while Isilon support assists with recovery.


Plan for contingencies: Having sufficient free space, plus a replica or fully updated backup of your data can limit downtime and mitigate risk of data loss if a nodes fails.