NOTE: This topic is part of the Uptime Information Hub.
Troubleshooting performance issues
Performance occupies a space all of its own. Performance is always about finding and moving the weakest link. When you’re troubleshooting performance issues, start by breaking down the performance into the smallest and simplest parts.
Performance issues are typically caused by network traffic, network configuration issues, client or cluster processing loads, or a combination thereof. Symptoms include client computers that perform slowly, and/or certain jobs— particularly those than run on the cluster—that either fail or take longer than expected to complete.
Simple maintenance steps, including checking free disk space, both for the local file systems on each cluster node and for the /ifs file system is a critical task to stay on top of. We cannot emphasize enough how important it is to maintain at least 10% free space on /ifs. If you fill the file system up, you can experience needless downtime. When a cluster or pool is more than 90 percent full, the system can experience slower performance and possible workflow interruptions in high-transaction or write-speed-critical operations.
Further, when a cluster or pool approaches full capacity (over 98 percent full), the following issues can occur:
- Substantially slower performance
- Workflow disruptions (failed file operations)
- Inability to write data
- Inability to make configuration changes
- Inability to run commands that would ordinarily be used to free up space
- Inability to delete data
- Data unavailability
- Potential failure of client operations or authentication (to connect/mount and navigate)
- Potential data loss
Allowing a cluster or pool to fill up can put the cluster into a non-operational state that can take significant time (hours, or even days) to correct. Therefore, it is important to keep your cluster or pool from becoming full.
Monitor the performance of your cluster regularly, so that you can identify unexpected behavior before it adversely affects performance. Check the cluster status page by using the OneFS web administration interface to monitor how much space is being used. and how much space remains free, for the entire cluster and for each node. If the cluster or certain directories on the cluster are at full capacity or are approaching full capacity, the cluster's overall performance might be degraded.
By default, the virtual hot spare is enabled and maintains sufficient free space for one drive to be Smartfailed. This function should not be disabled, except under exceptional circumstances. If a drive in the node fails, the virtual hot spare reserves the free space needed to rebuild the data from that failed drive. Every six months, confirm that the virtual hot spare is enabled, and that sufficient space is allotted to it. For more information about enabling the virtual hot spare, and for additional steps on how to maintain enough free space on your cluster, refer to Best Practices Guide for Maintaining Enough Free Space on Isilon Clusters and Pools.
Also, on the cluster status page in the web administration interface, review the cluster throughput, CPU usage, client connections, and active events. If you find any active critical events or emergency events, contact EMC Online Support.
As mentioned earlier, performance is an entirely separate subject, both distinct from, and yet encompassing, all other areas. For performance issues, it is vital to quantify the issues and determine:
- What, exactly, is slow?
- What is the expected performance metric?
- What is the actual performance metric?
- Is the expectation reasonable and mathematically feasible?
- Is this an issue of degraded performance?
Quantifying the performance degradation
The essential first step in a performance issue is to quantify it. "It's slow" isn't helpful. It is necessary to break down, in as much detail as possible, exactly what "it" is and exactly what "slow" is. It's also important to understand the expectation. The fact that anyone would say something is slow means that they have a performance expectation in mind. This specific expectation needs to be captured, along with any justification and reasoning as to why the expectation exists.
Is it "reasonable"? The point here is that, if there's a client farm and a OneFS server, and they are linked by a 10 Gigabit Ethernet-based network, any expectations that there will be greater than 10 Gigabits per second of actual data transferred in either direction are not realistic and must be reset. A more difficult one to quantify is disk input/output (I/O) performance. However, it is unlikely you'll be able to achieve 1 Gigabyte per second sustained streaming write performance to a three-node 36 NL cluster. In this case, use your judgment.
When an issue arises, first determine when it occurred and what changed on the cluster. Determine whether the change was expected and is a known issue, or if the change caused an unexpected issue. If some functionality worked in the past, but a change caused it to not work, then you will need to investigate further. In this case, a rigorous change control process can help significantly. An Isilon cluster exists in a complex environment and relies on a large number of external entities and systems. As such, a holistic change control process will serve you well. As an example, a firewall change was made on an internal EMC Isilon cluster that broke SmartConnect DNS, but because EMC IT had a thorough change control process in place, we were quickly able to determine what environmental changes had occurred around the time of the event and revert the change that caused the issue.
Is this a regression? If so, understand what changed on the cluster. Something changed. This is where you get out your change control log to see if any changes were made to the environment. Did somebody upgrade the client operating system or application(s)? Were network changes made? There are things to look for on the cluster—especially in OneFS 7.1, where Job Engine changes were made. Is there a Job Engine job running on the cluster, and what job impact policy is in effect? The job will be consuming disk input/output operations (IOPS) that would otherwise go to clients.
Performance troubleshooting is all about moving the bottleneck, so:
- Find the bottleneck.
- Test individual parts of the file system—for example, slow writes.
- Try the I/O locally on the cluster.
- Measure the network throughput without writing to disk.
- Address and repeat.
At the end of the day, performance is always about finding the weakest link, strengthening it, and then reiterating the process. A very effective technique is to break down the workflow into component parts—ideally, parts that test an individual component, such as the network, disks, and other components.