NOTE: This topic is part of the Uptime Information Hub.
- Troubleshooting using cluster statistics
Troubleshooting performance issues (cont'd)
This is a continuation of the troubleshooting performance issues series. Performance issues are typically caused by network traffic, network configuration issues, client or cluster processing loads, or a combination thereof. Symptoms include client computers that perform slowly, and/or certain jobs— particularly those than run on the cluster—that either fail or take longer than expected to complete.
One issue is that most clusters are not "single taskers." Therefore, you will need to deal with the possibility that the performance is due to contention and overall load. In this case, the cluster-wide statistics numbers are very helpful.
You can run a variety of commands to investigate performance issues. OneFS maintains a very rich set of statistics for protocols, disks, clients, and file system activity. A significant portion of the statistics are saved (or “persisted”) and can be queried for historical data. All of the statistics can be queried in real time. You can aggregate and sort the query results by using the totalby and orderby options. Examining the statistics makes it possible to inspect the cluster for specific behaviors, including:
- Showing active and connected client counts to each node.
- Determining which files in the file system are the most busy, also known as heat.
- Querying disk activity to see if there are drives that are being overtaxed.
- Measuring the response times of configured directory services, such as Active Directory.
Use the isi statistics command to breakout performance by protocol, client, and so on. For more information, see Using the isi statistics command, article 89158, on the EMC Online Support site.
Checking NFS clients
To check NFS clients, run the following command:
isi statistics query -–nodes=all --stats=node.clientstats.connected.nfs,node.clientstats.active.nfs
The output displays the amount of clients connected per node and how many of those clients are active on each node.
Checking SMB clients
To check SMB clients, run the following command:
isi statistics query -–nodes=all --stats=node.clientstats.connected.smb,
The output displays the amount of clients per node and how many of those clients are active on each node.
Examining disk I/O can help you determine whether certain disks are being overused. Run the following command to examine disk I/O by cluster:
isi statistics pstat
From the output, divide the disk IOPS by the total number of disks in the cluster. For X-Series nodes and NL-Series nodes, you should expect to see disk IOPS of 70 or less for 100 percent random workflows, or disk IOPS of 140 or less for 100 percent sequential workflows. Because NL series nodes have less RAM and lower CPU speeds than Xseries nodes, X-series nodes can handle higher disk IOPS.
You can examine disk I/O by node and disk. To discover disks that are overused, run the following command to determine disk IOPS by node:
isi statistics query --nodes=all --stats=node.disk.xfers.rate.sum --top
To query for statistics on a per disk basis, use the following command:
isi statistics describe --stats=all | grep disk
Another way to determine whether disks are being overused is to find out how many operations are queued for each disk in the cluster. For a single stream SMB-based workflow, a queue of four can indicate an issue, while for high concurrency NFS namespace operations, the queue can be much greater. To determine how many operations are queued for each disk in the cluster, run the following command:
isi_for_array -s sysctl hw.iosched | grep total_inqueue
To determine the latency caused by the queued operations, run the following command:
sysctl -aN hw.iosched|grep bios_inqueue|xargs sysctl -D
CPU issues are frequently traced to the operations that clients are performing on the cluster. Determine which operations are being performed across the network, and then assess which of those operations are taking the most time by running the following command:
isi statistics protocol --orderby=TimeAvg --top
This command’s output gives you detailed statistics for all network protocols, organized by how long it takes for the cluster to respond to clients. Although the results of this command might not identify which operation is the slowest, it can point you in the right direction.
For additional details about CPU processing, such as which nodes' CPUs are the most heavily used, run the following command:
isi statistics system --top
To determine which four processes on each node that are consuming the most CPU resources, run:
isi_for_array -sq 'top -d1|grep PID -A4'
Although the isi statistics command is very powerful, it is also fairly low-level. You might want to consider using the OneFS InsightIQ module. This is a separate Unix virtual appliance (also available as a standalone RPM) that continuously pulls statistics from the cluster and presents the results graphically in a web browser. This module also leverages the FSAnalyze (file system analytics) Job Engine job on the cluster to provide deeper insight into file-level statistics on the cluster— for example, file size, file age, and so on. Isilon InsightIQ makes it easy to see, graphically, what's going on over time, and can also be very helpful in correlating performance with load.
With InsightIQ, you can monitor and analyze Isilon cluster activity through flexible, customizable chart views in a web-based application. These charts provide detailed information about cluster hardware, software, file system, and protocol operations. InsightIQ transforms data into visual information that highlights any performance outliers, enabling you to quickly and easily diagnose bottlenecks or optimize workflows. For details about InsightIQ, see the InsightIQ User Guide.
Troubleshooting large workloads
For more information on performance for large workloads, see the Isilon Guidelines for Large Workloads document for performance guidelines for workloads of protocol connections, file system components, software modules, and network settings for OneFS 7.1.1.