Got the following question from a customer recently, and thought it was worth elaborating on in an article:
Is there a command to show the machines and customers using my Isilon? I realize I can get it over time from the InsightIQ, just wondering if there is something more current and on-cluster?
Absolutely! The ‘isi statistics client’ CLI command displays the most active clients, by throughput, that are accessing the cluster for each supported protocol. You can specify options to track access by user, for example, more than one user on the same client host access the cluster will list the client name or IP, depending on options, and the user ID, assuming it can be determined.
# isi statistics client -–protocols=all
This will show results for all the protocols across the cluster. These include the familiar NAS protocols like NFS, SMB, HDFS, FTP, HTTP – plus the OneFS-specific services like SyncIQ (siq), PlatformAPI (papi), NFS network lock manager (nlm), ‘OneFS’ security authority subsystem (lsass), and Job Engine (jobd), etc.
For instance the following syntax will show the SMB2 client stats, displayed in ‘top’ format, where data is continuously overwritten in a single table.
# isi statistics client --protocols=smb2 --format=top
It’s worth noting that Windows client statistics can only be displayed for SMB1 and SMB2. SMB3 is considered by Microsoft to be a ‘dialect’ of SMB2, rather than a distinct ‘protocol’ version. As such, SMB3 client stats are included with the SMB2 statistics.
The following OneFS CLI command tallies usage by user. While it obscures the specific nodes themselves, for HPC workloads that involve a distributed compute farm it can be useful to view the combined user load on a cluster:
# isi statistics client --protocols=nfs3 --numeric --totalby=username --sort=Ops,TimeMax --format=top
Isi statistics is a versatile utility, with the following subcommand-level reporting areas:
All the keys available via the isi statistics command can be viewed with the following syntax:
# isi statistics list keys
# isi statistics client --nodes all
The ‘–nodes all’ flag can also be truncated to simply ‘-nall’ for command efficiency:
A good overview of the cluster can be obtained via the ‘isi statistics system’ command. This will show CPU, core protocols, network, disk and totals in a single line. To split out these stats by node and stack rank, use:
# isi statistics system –nall --top
This following command can be useful for specifically identifying the files which are being most heavily utilized. The output can be piped to ‘sort’ or ‘head’ to manage the output (ie. limited to ten lines, in this case):
# isi statistics heat --long --classes=read,write,namespace_read,namespace_write | head -10
This syntax will show the amount of contention where concurrent user(s) operations are targeting the same object:
# isi statistics heat --long --classes=read,write,namespace_read,namespace_write --event=blocked,contended,deadlocked | head -20
Other useful stats to monitor include latency, disk activity, and CPU load.
Useful commands for latency include:
# isi statistics query current –-nodes-all --stats=node.disk.access.latency.all
# isi statistics query current –-nodes-all --stats=node.disk.iosched.latency.all
For example, to show the busiest drives on a cluster in order, you can use the following syntax:
# isi statistics drive –nall –sort Busy
On the disk side, the sum of DiskIn (writes) and DIskOut (reads) gives the total IOPS for all the drives per node.
For the next level of granularity, the following drive statistics command provides individual SATA disk info. The sum of OpsIn and OpsOut is the total IOPS per drive in the cluster.
# isi statistics drive -nall -–long --type=sata --sort=busy | head -20
And the same info for SSDs:
# isi statistics drive -nall --long --type=ssd --sort=busy | head -20
The primary counters of interest in drive stats data are often the ‘TimeInQ’, ‘Queued’, OpsIn, OpsOut, and IO and the ’Busy’ percentage of each disk. If most or all the drives have high busy percentages, this indicates a uniform resource constraint, and there is a strong likelihood that the cluster is spindle bound. If, say, the top five drives are much busier than the rest, this suggests a workflow hot-spot.
CPU stats can be displayed by user, node, etc:
# isi statistics query current –nall --stats=node.cpu.user.avg
# isi statistics query current –nall --stats=node.cpu.sys.avg
Or by CPU idle:
# isi statistics query current –nall --stats=node.cpu.idle.avg
Note: These counter values sum up to 1000, so divide by 10 to get the percentage.
Under normal circumstances, gen 5 and prior nodes tend to be bottlenecked on disk, rather than CPU-bound. However, despite this, CPU can be a useful barometer of workflow health. Since the OneFS NFS and SMB protocol stacks both reside in userspace, you’ll typically see the user (rather than kernel) as the predominant CPU consumer. If not, the cluster either has very low client traffic, or too high an impact setting on currently running Job Engine jobs.
A reasonable rule of thumb is to take the average CPU utilization for all the nodes in the same pool. If the delta between one node and its peers reaches 10% or more, then there is likely something that is beating (or will) OneFS’ ability to balance load and it’s time to start investigating. Once this delta reaches 50% disparity, there is typically an actor and often a cluster resource bottleneck, which could be negatively impacting clients and user experience.
To determine whether the CPUs are pegged for extended periods, just query “node.cpu.idle.avg” history:
# isi statistics query history -nall --stats=node.cpu.idle.avg
Finally, the isi statistics metrics are also via OneFS’ RESTful platform API, for use in scripts. For example, the following URL will show the active SMB2 clients across all nodes:
Or, for connected NFS clients:
It’s worth noting that isi statistics doesn’t directly tie a client to a file or directory path. Both isi statistics heat and isi statistics client provide some of this information, but not together. The only directory/file related metrics come from the ‘heat’ stats, which track the hottest accesses in the filesystem.