Top tips every admin should know to keep their Isilon cluster healthy

NOTE: This topic is part of the Uptime Information Hub.

dell_emc_isilon1.jpg

 

 

Isilon OneFS Cluster Utilization: Keep below 90%

  • A general rule of thumb is to keep cluster utilization below 90%. If you have 8 TB and 10 TB drives in a node, then it’s good to keep utilization 80% or below to get better performance. Performance of the system may degrade beyond the recommended utilization level.
  • When a drive fails at utilization above 90%, the system will take more time to smartfail (reprotect) a drive. A node failure at 90% and above may take a long time, and in some cases may not complete if there is not enough capacity.
          Note: A node failure in Isilon doesn’t start the smartfail (reprotect) process and an administrator has to initiate it.
  • When capacity is above 90%, and a drive fails, the load will increase on the remaining drives and will further degrade performance.
  • During a drive failure, the Job Engine goes into degraded status to run FlexProtect or FlexProtectLin, the Isilon maintenance job to re-protect data on remaining drives. By default, no other maintenance jobs (for example, SnapshotDelete or Collect) could run with the exception that Support can modify the degraded status in certain cases for other jobs to run. This means that utilization can increase rapidly, as there will be the accumulation of snapshots pending deletion.
  • Verify that nodes, node pools and diskpools are below 90%. Isilon OneFS has maintenance jobs to keep them balanced, but in some cases workflow requirements (for example, the use of filepool policies or node add or data deletion) could make either some nodes or node pools or disk pools go above 90%.
  • You will prevent many problems by keeping Isilon system utilization below 90%. Depending on your sales cycle, and the process it takes internally and externally to get nodes onsite and added to a cluster, start the conversation about adding node(s) when utilization reaches 80 to 85%.
  • For more information, refer to the Best Practices Guide for Maintaining Enough Free Space on Isilon clusters.

 

Protection Level: Keep node pools at the recommended protection level

  • OneFS uses the Reed Solomon algorithm for N+M protection. In the N+M data protection model, N represents the number of data-stripe units, and M represents the number of simultaneous node or drive failures—or a combination of node and drive failures—that the cluster can withstand without incurring data loss. N must be larger than M.
  • OneFS 7.2.x and later recommends the protection level for each node pool in the Web Administration Interface under File System > Storage Pools > SmartPools.  Re-evaluate the protection level every time you add a new node, and use the recommended protection level for each node pool.
  • Changing the protection level can change capacity utilization because data needs to be protected at higher/new protection levels. If your utilization level is high, consult a Dell EMC Isilon expert to consider the impact on utilization.
  • Nodes which are the same or equivalent (see the Isilon Support and Compatibility Guide) would be in the same node pool.
  • Data will be re-protected at the new protection level after the successful completion of an Isilon maintenance job.

 

Target Code: Always stay on Target Code or Target -1

  • Dell EMC Isilon releases code as generally available (GA) after it has completed internal testing. Once code satisfies specific criteria, which includes production time in the field, deployments across all support node platforms, and other quality metrics, Dell EMC designates that code as Target code. To ensure that Isilon clusters are running the most stable and reliable version of OneFS, upgrade to the latest available Target Code for the OneFS family that meets your business needs. If you can’t upgrade to Target, stay at least at Target -1 code. There would be some exceptions to these guidelines and Isilon experts like Dell EMC Support and/or a Technical Account Manager (TAM) can recommend a specific code which may not be target code. For more information, see the Isilon Uptime Info Hub.
  • Plan your upgrade and refer to the OneFS Upgrades - Isilon Info Hub. Check with your account team or Dell EMC support to see if you can utilize the Dell EMC Remote Proactive Support (RPS) team to do a pre-health check and upgrade.

 

Patches: Review and install applicable patches

  • Refer to the Current Isilon OneFS Patches guide to find information about a patch for any version of OneFS.
  • Look for patches applicable to your version of OneFS and workflow.
    • Each patch lists a summary of the patch, what version of OneFS it applies to, and what MR version the bug is fixed in.
    • Not all patches will be applicable to your workflow (For example, if you don’t use HDFS, you don’t need HDFS patches).
    • Some patches may require a reboot of nodes, some patches may require just the restart of a few services, and some may be online.
    • Download the patch file by clicking on the Patch-ID. In the zip file there is a patch and README. The README file provides all details on impact of patch install and procedure.
    • You can also view this video to understand the patch and install process.
  • Periodically check for new patches that apply to your system and are relevant to your workflow.

 

Firmware: Check firmware version every three month

Note: On 5th gen nodes (S210, X410, X210, NL410, HD400), there is BMC firmware and CMC firmware which needs to be upgraded as well. These updates include new features and resolve known issues that might be relevant to you. Node firmware requires a node reboot. BMC firmware requires a node reboot. Drive Firmware is mostly online, with some exceptions for certain drive models and/or OneFS versions.
  • Some firmware may have a patch or OneFS requirement. Always go through the Release Notes before upgrading firmware. Release Notes for each version of OneFS can be found in the Target Code section of the Isilon Uptime Info Hub.
  • Refer to Current Isilon Software Releases to find the latest firmware versions.

 

EMC Technical and Security Advisories: Do not ignore

  • Make sure you are registered on the Dell EMC Web site to receive EMC Technical Advisories (ETAs) and EMC Security Advisories(ESAs).
    • When you receive an email, read it carefully. Contact Dell EMC Support, Sales or other Dell EMC Employees (e.g. TAM, SAM, CSA, DSE) to get more details when you have a question.
    • See Dell EMC knowledge base article 334017 for steps to subscribe, or watch the following video.
  • ETAs and ESAs are specific to OneFS versions, so see if you have an Isilon cluster running that version of OneFS and if your Isilon cluster is impacted.
  • ETAs alert you about potential hardware or software issues that could cause serious negative impacts to a production environment, such as data loss, data unavailability, loss of system functionality, or anything that could result in a significant safety risk. The advisories include specific details about the issue and instructions to help prevent or alleviate the problem. To determine the impact of the ETA, read the severity rating description in the impact section of the ETA and impacted OneFS versions. For more information, see: EMC Technical Advisories (ETAs) for Isilon OneFS.
  • ESAs alert you to potential security vulnerabilities and their remedies for DellEMC products. The advisories include specific details about the issue and instructions to help prevent or alleviate the problem. Common Vulnerabilities and Exposures (CVEs) identify publicly known security concerns. A Dell EMC ESA can address one or more CVEs. For more information, see: EMC Isilon Security Advisories (ESAs).

 

ESRS/Alerts: Connect your Isilon systems to EMC through ESRS and configure events by email

  • With the release of OneFS 7.1, Isilon products can utilize ESRS for remote connectivity (video).
Note: The ESRS version must be at 2.24 or higher. Isilon events can go out via email, SNMP, or ESRS. Not all events generate an Isilon SR, so it is important to configure email notification for events.
  • ESRS also allows remote support to gather logs and connect to devices securely. You can manage access to devices using ESRS policy manager.
  • Without ESRS and email notifications, you could miss out on important events and FCOs since Dell EMC may not have any information about the device in your data center.
  • Ask your Dell EMC Support representative for more information on configuring ESRS (at no cost).

 

Isilon Maintenance Job: Verify that it is running

OneFS uses the Job Engine to schedule maintenance tasks, known as jobs. Some of these jobs are critical to run (for example, SnapshotDelete to delete expired snapshots, MultiScan or Autobalance to keep all nodes utilization balanced, and FlexProtect to reprotect data from failed devices). You should regularly check that jobs are running by checking the Job Engine status using the isi job status command.

Read additional details about the Isilon Job Engine Ask the Expert: The What, Why and How of the Isilon Job Engine. Also refer to the following articles:

There are many configuration options available to the number of jobs, workers, priority, and impact of each job. Please change them only when directed by a Dell EMC engineer.

 

Isilon Events: Verify that you and your cluster are receiving event alerts

The Isilon cluster will create an event when it detects an issue. Not all events result in Dell EMC Services Requests (for example: quota exceed or SyncIq RPO). It is important to keep an eye on those events and ensure that you configured ways (For example, SMTP or SNMP) to receive those events.

  • If you don't receive events, then check your SMTP and SNMP channels. If those are working, try resetting the CELOG (Clusterwide Event Log) database.
  • Send a test event once a week to ensure your system is sending events. See: OneFS: How to reset the CELOG database and clear all historical events, article 304312
  • In OneFS 8.0 and above, a new feature is added that allows you to put the CELOG into maintenance mode to avoid receiving alerts or triggering Dial-Home service requests while tests or planned activities are being made on the Isilon cluster. See OneFS 8.0+ How to place the CELOG into Maintenance mode, article 494523.

 

Cluster Health Check: Gather logs and check health using Isilon Advisor and InsightIQ to monitor performance

  • Use the Isilon Advisor (IA) a free application that enables you to check the health of your Isilon cluster and to resolve common Isilon issues.
  • InsightIQ is Dell EMC software available to monitor performance and file system statistics, understand how the system is performing during normal operation, and find out types of files being written on Isilon OneFS. It is a single pane of glass for performance monitoring of different Isilon systems.

Note: InsightIQ does require a license. Talk with your account representative to obtain a license.


Additional Information


  For additional troubleshooting, visit the Customer Troubleshooting Info Hub for troubleshooting guides that will help you to resolve common Isilon issues.