In the previous article, we looked at some the effects of high capacity utilization on a cluster. Now, let’s explore some of the practical ways to monitor and manage cluster storage capacity, starting with maintaining appropriate protection levels.

 

Every time you add nodes, re-evaluate protection levels and ensure your cluster and pools are protected at the appropriate level. OneFS includes a ‘suggested protection’ function that calculates a recommended protection level based on cluster configuration, and alerts you if the cluster falls below this suggested level

 

capacity_management_3.png

 

OneFS supports several protection schemes. These include the ubiquitous +2d:1n, which protects against two drive failures or one node failure. Use the recommended protection level for a particular cluster configuration. This recommended level of protection is clearly marked as ‘suggested’ in the OneFS WebUI storage pools configuration pages, and is typically configured by default.

 

For all current Gen6 hardware configurations, the recommended protection level is “+2d:1n’.

Monitoring cluster capacity is a typically a two-pronged approach which involves:


  • Proactive monitoring and trend analysis:  Includes setting notification quotas and using the FSAnalyze job in conjunction with InsightIQ, etc.


  • Reactive alerting: Automated notification when capacity thresholds are reached.

 

Set up event notification rules so that you will be notified when the cluster begins to reach capacity thresholds. Make sure to enter a current email address in order to receive the notifications.


This can be configured from the WebUI by browsing to Cluster Management > General Settings > Email Settings.


capacity_management_3-2.png


More info can be found here.


The cluster sends notifications when it has reached 95 percent and 99 percent capacity. On some larger clusters, 5 percent (or even 1 percent) capacity remaining might mean that a lot of space is still available, so you might be inclined to ignore these notifications. However, it is best to pay attention to the alerts, closely monitor the cluster, and have a plan in place to take action when necessary.


There are three main monitoring options for monitoring the data ingest rate on a cluster:


    • SNMP
    • SmartQuotas
    • FSAnalyze


FSAnalyze is a job-engine job that the system runs to create data for InsightIQ’s file system analytics tools. It provides details about data properties and space usage within the /ifs directory. Unlike SmartQuotas, FSAnalyze updates its views only when the FSAnalyze job runs. Since FSAnalyze is a fairly low-priority job by default, it can sometimes be preempted by higher-priority jobs and therefore take significant time to gather all of the data. An InsightIQ license is required to run an FSAnalyze job.


Quotas are useful for both to monitoring and enforcing administrator-defined storage limits on the cluster. SmartQuotas manages storage use, monitors disk storage, and issues alerts when disk storage limits are exceeded. Although it does not provide the same detail of the file system that FSAnalyze does, SmartQuotas maintains a real-time view of space utilization so that you can quickly obtain the information you need.


On the data management side, good housekeeping includes regularly archiving data that is rarely accessed and deleting any unused and unwanted data. Ensure that pools do not become too full by setting up file pool policies to move data to other tiers and pools.


To guard against your cluster or pools running out of space, you can add new nodes to existing clusters or pools, replace smaller-capacity nodes with larger-capacity nodes, or even create more clusters. If you decide to add new nodes to an existing cluster or pool, contact your sales team to order the nodes long before the cluster or pool runs out of space. We recommend that you begin the ordering process when the cluster or pool reaches 75% used capacity. This will allow enough time to receive and install the new equipment and still maintain enough free space.


Sometimes a cluster has many old snapshots that take up a lot of space. Reasons for this include inefficient deletion schedules, degraded cluster preventing job execution, expired SnapshotIQ license, etc. Clearing out old snapshots or, preferably, configuring an automated snapshot deletion schedule, helps to reclaim cluster capacity.

Each version of OneFS supports only certain nodes. Refer to the “OneFS and node compatibility” section of the Isilon Supportability and Compatibility Guide for a list of which nodes are compatible with each version of OneFS. When upgrading OneFS, make sure that the new version supports your existing nodes. If it does not, you might need to replace the nodes.


Space and performance are optimized when all nodes in a pool are compatible. When you add new nodes to a cluster, OneFS automatically provisions nodes into pools with other nodes of compatible type, hard drive capacity, SSD capacity, and RAM. Occasionally, however, the system might put a node into an unexpected location. If you believe that a node has been placed into a pool incorrectly, contact Isilon Technical Support for assistance. Different versions of OneFS have different rules regarding what makes nodes compatible.


OneFS provides a feature know as Virtual Hot Spare (VHS) which keep space in reserve in case you need to smartfail drives when the cluster gets close to capacity. Enabling VHS will not give you more free space, but it will help protect your data in the event that space becomes scarce. VHS is enabled by default. Isilon strongly recommends that you do not disable VHS unless directed by a Support engineer. If you disable VHS in order to free some space, the space you just freed will probably fill up again very quickly with new writes. At that point, if a drive were to fail, you might not have enough space to smartfail the drive and re-protect its data, potentially leading to data loss. If VHS is disabled and you upgrade OneFS, VHS will remain disabled. If VHS is disabled on your cluster, first check to make sure the cluster has enough free space to safely enable VHS, and then enable it.


capacity_management_4.png


The SmartPools Spillover allows data that is being sent to a full pool to be diverted to an alternate pool. Spillover is enabled by default on clusters that have more than one pool. If you have a SmartPools license on the cluster, you can disable Spillover. However, it is recommended that you keep Spillover enabled. If a pool is full and Spillover is disabled, you might get a “no space available” error but still have a large amount of space left on the cluster.


Additionally, it’s well worth periodically running and reviewing the Isilon Advisor health check report, especially prior to re-configuring and/or adding new nodes to the cluster.

 

capacity_management_5.png

 

Isilon Advisor will help confirm that:

  • There are no cluster issues.
  • OneFS configuration is as expected.